When is Convolutional Filter Easy to Learn
When is a Convolutional Filter Easy to Learn?
Abstract
We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions. Our theory also justifies the two-stage learning rate strategy in deep neural networks. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
1 Introduction
Deep convolutional neural networks (CNN) have achieved the state-of-the-art performance in many applications such as computer vision(Krizhevsky et al., 2012), natural language processing(Dauphin et al., 2016) and reinforcement learning applied in classic games like Go(Silver et al., 2016). Despite the highly non-convex nature of the objective function, simple first-order algorithms like stochastic gradient descent and its variants often train such networks successfully. On the other hand, the success of convolutional neural network remains elusive from an optimization perspective.
When the input distribution is not constrained, existing results are mostly negative, such as hardness of learning a 3-node neural network(Blum and Rivest, 1989) or a non-overlap convolutional filter(Brutzkus and Globerson, 2017). Recently, Shamir (2016) showed learning a simple one-layer fully connected neural network is hard for some specific input distributions.
These negative results suggest that, in order to explain the empirical success of SGD for learning neural networks, stronger assumptions on the input distribution are needed.. Recently, a line of research(Tian, 2017; Brutzkus and Globerson, 2017; Li and Yuan, 2017; Soltanolkotabi, 2017; Zhong et al., 2017) assumed the input distribution be standard Gaussian and showed (stochastic) gradient descent is able to recover neural networks with ReLU activation in polynomial time.
One major issue of these analysis is that they rely on specialized analytic properties of Gaussian distribution (c.f. Section1.1) and thus cannot be generalized to non-Gaussian cases, in which real-world distributions fall into. For general input distributions, new techniques are needed.
In this paper we consider a simple architecture: a convolution layer, followed by a ReLU activation function, and then average pooling. Formally, we let be an input sample, e.g., an image, we generate patches from , each with size : where the -th column is the -th patch generated by some known function . For a filter with size and stride then is the -th and -th pixels. Since for convolutional filters, we only need to focus on the patches instead of the input, in the following definitions and theorems, we will refer as input and let as the distribution of : ( is the ReLU activation function)
See Figure1 (a) for a graphical illustration. Such architectures have been used as the first layer of many works in computer vision(Lin et al., 2013; Milletari et al., 2016).
We address the realizable case, where training data are generated from (1) with some unknown teacher parameter under input distribution . Consider the loss . We learn by (stochastic) gradient descent, i.e.,
where is the step size which may change over time and is a random function where its expectation equals to the population gradient The goal of our analysis is to understand the condition when , if is optimized under (stochastic) gradient descent.
In this setup, our main contributions are as follows:
-
Learnability of Filters: We show if the input patches are highly correlated (Section3), i.e., for some small , then gradient descent and stochastic gradient descent with random initialization recover the filter in polynomial time. 2 2 2Note since in this paper we focus on continuous distribution over , our results do not have conflict with previous negative results(Blum and Rivest, 1989; Brutzkus and Globerson, 2017) whose constructions on counter examples rely on discrete distribution. Furthermore, strong correlations imply faster convergence. To the best of our knowledge, this is the first recovery guarantee of randomly initialized gradient-based algorithms for learning filters (even for the simplest one-layer one-neuron network) on non-Gaussian input distribution, answering an open problem in(Tian, 2017).
-
Distribution-Aware Convergence Rate. We formally establish the connection between the smoothness of the input distribution and the convergence rate for filter weights recovery where the smoothness in our paper is defined as the ratio between the largest and the least eigenvalues of the second moment on the activation region (Section2). We show that a smoother input distribution leads to faster convergence, and Gaussian distribution is a special case that leads to the tightest bound. This theoretical finding also justifies the two-stage learning rate strategy proposed by(He et al., 2016; Szegedy et al., 2017) if the step size is allowed to change over time.
1.1 Related Works
In recent years, theorists have tried to explain the success of deep learning from different perspectives. From optimization point of view, optimizing neural network is a non-convex optimization problem. Pioneered byGe et al. (2015), a class of non-convex optimization problems that satisfy strict saddle property can be optimized by perturbed (stochastic) gradient descent in polynomial time(Jin et al., 2017). 3 3 3The vanilla gradient descent fails in this case(Du et al., 2017). This motivates the research of studying the landscape of neural networks(Kawaguchi, 2016; Choromanska et al., 2015; Hardt and Ma, 2016; Haeffele and Vidal, 2015; Mei et al., 2016; Freeman and Bruna, 2016; Safran and Shamir, 2016; Zhou and Feng, 2017) However, these results cannot be directly applied to analyzing the convergence of gradient-based methods for ReLU activated neural networks.
From learning theory point of view, it is well known that training a neural network is hard in the worst cases(Blum and Rivest, 1989; Livni et al., 2014; Šíma, 2002; Shalev-Shwartz et al., 2017a, b) and recently,Shamir (2016) showed either "niceness" of the target function or of the input distribution alone is not sufficient for optimization algorithms used in practice to succeed. With some additional assumptions, many works tried to design algorithms that provably learn a neural network with polynomial time and sample complexity(Goel et al., 2016; Zhang et al., 2015; Sedghi and Anandkumar, 2014; Janzamin et al., 2015). However, these algorithms are specially designed for certain architectures and cannot explain why (stochastic) gradient based optimization algorithm work well in practice.
Focusing on gradient-based algorithms, a line of research analyzed the behavior of (stochastic) gradient descent for Gaussian input distribution. Tian (2017) showed population gradient descent is able to find the true weight vector with random initialization for one-layer one-neuron model. Brutzkus and Globerson (2017) showed population gradient descent recovers the true weights of a convolution filter with non-overlapping input in polynomial time. Li and Yuan (2017) showed SGD can recover the true weights of a one-layer ResNet model with ReLU activation under the assumption that the spectral norm of the true weights is bounded by a small constant. All the methods use explicit formulas for Gaussian input, which enable them to apply trigonometry inequalities to derive the convergence. With the same Gaussian assumption, Soltanolkotabi (2017) used Gaussian width (c.f. Definition 2.2 of(Soltanolkotabi, 2017)) for concentrations to show the true weights can be exactly recovered by empirical projected gradient descent with enough samples in linear time and his approach cannot be extended to learning a convolutional filter.
Other approaches combine tensor approaches with assumptions of input distribution. Zhong et al. (2017) proved that with sufficiently good initialization, which can be implemented by tensor method, gradient descent can find the true weights of a 3-layer fully connected neural network. However, their approach works with known input distributions.
In this paper, we adopt a different approach that only relies on the definition of ReLU. We show as long as the input distribution satisfies weak smoothness assumptions, we are able to find the true weights by SGD in polynomial time. Using the conclusion, we could justify the effectiveness of large amount of data (which help eliminate saddle points), two-staged and adaptive learning rates used byHe et al. (2016); Szegedy et al. (2017), etc.
1.2 Organization
This paper is organized as follows. In Section2, we analyze the simplest one-layer one-neuron model where we state our key observation and establish the connection between smoothness and convergence rate. In Section3, we discuss the performance of (stochastic) gradient descent for learning a convolutional filter. We provide empirical illustrations in Section4 and conclude in Section5. We place most of our detailed proofs in the Appendix.
1.3 Notations
Let denote the Euclidean norm of a finite-dimensional vector. For a matrix , we use to denote its largest singular value and its smallest singular value. Note if is a positive semidefinite matrix, and represent the largest and smallest eigenvalues of , respectively. Let denote standard Big-O and Big-Theta notations, only hiding absolute constants. We assume the gradient function is uniformly bounded, i.e., There exists such that . This condition is satisfied as long as patches, and noise are all bounded.
2 Warm Up: Analyzing One-Layer One-Neuron Model
Before diving into the convolutional filter, we first analyze the special case for , which is equivalent to the one-layer one-neuron architecture. The analysis in this simple case will give us insights for the fully general case. For the ease of presentation, we define following two events and corresponding second moments
where is the indicator function. Intuitively, is the joint activation region of and and is the joint activation region of and . See Figure2 (a) for the graphical illustration. With some simple algebra we can derive the population gradient.
One key observation is we can write the inner product as the sum of two non-negative terms (c.f. LemmaA.1). This observation directly leads to the following Theorem2.1.
Theorem 2.1 .
Suppose for any with , and the initialization satisfies then gradient descent algorithm recovers .
The first assumption is about the non-degeneracy of input distribution. For if the assumption is not true, then the input distribution is supported on a low-dimensional space meaning the input distribution is degenerate. The second assumption on the initialization is for ensuring gradient descent not converge to , at which the gradient is undefined. This is a general convergence theorem that holds for a wide class of input distribution and initialization points. In particular, it includes Theorem 6 of(Tian, 2017) as a special case. If the input distribution is degenerate, i.e., there are holes in the input space, the gradient descent may stuck around saddle points and we believe more data are needed to facilitate the optimization procedure This is also consistent with empirical evidence in which more data are helpful for optimization.
2.1 Convergence Rate of One-Layer One-Neuron Model
In the previous section we showed if the distribution is regular and the weights are initialized appropriately, gradient descent recovers the true weights when it converges. In practice we also want to know how many iterations are needed. To characterize the convergence rate, we need some quantitative assumptions. We note that different set of assumptions will lead to a different rate and ours is only one possible choice. In this paper we use the following quantities.
Definition 2.1 (The Largest/Smallest eigenvalue Values of the Second Moment on Intersection of two Half Spaces).
These two conditions quantitatively characterize the angular smoothness of the input distribution. For a given angle , if the difference between and is large then there is one direction has large probability mass and one direction has small probability mass, meaning the input distribution is not smooth. On the other hand, if and are close, then all directions have similar probability mass, which means the input distribution is smooth. The smoothest input distributions are rotationally invariant distributions (e.g. standard Gaussian) which have . For analogy, we can think of as Lipschitz constant of the gradient and as the strong convexity parameter in the optimization literature but here we also allow they change with the angle. Also observe that when , because the intersection has measure and both and are monotonically decreasing.
Our next assumption is on the growth of . Note that when , then because the intersection between and has measure. Also, grows as the angle between and becomes larger.
In the following, we assume the operator norm of increases smoothly with respect to the angle. The intuition is that as long as input distribution bounded probability density with respect to the angle, the operator norm of is bounded. We show in TheoremA.1 that for rotational invariant distribution and in TheoremA.2 that for standard Gaussian distribution.
Assumption 2.1 .
Let . We assume there exists that for , .
Now we are ready to state the convergence rate.
Theorem 2.2 .
Suppose the initialization satisfies . Denote then if step size is set as , we have for
Note both and increases as decreases so we can choose a constant step size . This theorem implies that we can find the -close solution of in
If we are able to choose the step sizes adaptively , like using methods proposed byLin and Xiao (2014), we may improve the computational complexity to
The theorem requires the initialization satisfying , which can be achieved by random initialization with constant success probability. See Section3.2 for a detailed discussion.
3 Main Results for Learning a Convolutional Filter
In this section we generalize ideas from the previous section to analyze the convolutional filter. First, for given and we define four events that divide the input space of each patch . Each event corresponds to a different activation region induced by and , similar to (3).
Please check Figure2 (a) again for illustration. For the ease of presentation we also define the average over all patches in each region
Next, we generalize the smoothness conditions analogue to Definition2.1 and Assumption2.1. Here the smoothness is defined over the average of patches.
Assumption 3.1 .
For , define
We assume satisfies for some
The main difference between the simple one-layer one-neuron network and the convolution filter is two patches may appear in different regions. For a given sample, there may exists patch and such that and and their interaction plays an important role in the convergence of (stochastic) gradient descent. Here we assume the second moment of this interaction, i.e., cross-covariance, also grows smoothly with respect to the angle.
Assumption 3.2 .
We assume there exists such that
First note if , then and has measure and this assumption models the growth of cross-covariance. Next note this represents the closeness of patches. If and are very similar, then the joint probability density of and is small which implies is small. In the extreme setting, , we have because in this case the events , and all have measure .
Now we are ready to present our result on learning a convolutional filter by gradient descent.
Theorem 3.1 .
If the initialization satisfies and denote which satisfies . Then if we choose , we have for and
Our theorem suggests if the initialization satisfies , we obtain linear convergence rate. In Section3.1, we give a concrete example showing closeness of patches implies large and small . Similar to Theorem2.2, if the step size is chosen so that , in
In practice,we never get a true population gradient but only stochastic gradient (c.f. Equation (2)). The following theorem shows SGD also recovers the underlying filter.
Theorem 3.2 .
Let . Denote , and . For sufficiently small, if , then we have in
Unlike the vanilla gradient descent case, here the convergence rate depends on instead of . This is because of the randomness in SGD and we need a more robust initialization. We choose to be the average of and for the ease of presentation. As will be apparent in the proof we only require not very close to . The proof relies on constructing a martingale and use Azuma-Hoeffding inequality and this idea has been previously used byGe et al. (2015).
3.1 What distribution is easy for SGD to learn a convolutional filter?
Different from One-Layer One-Neuron model, here we also requires the Lipschitz constant for closeness to be relatively small and to be relatively large. A natural question is: What input distributions satisfy this condition?
Here we give an example. We show if (1) patches are close to each other (2) the input distribution has small probability mass around the decision boundary then the assumption in Theorem3.1 is satisfied. See Figure1 (b)-(c) for the graphical illustrations.
Theorem 3.3 .
Denote . Suppose all patches have unit norm 4 4 4 This is condition can be relaxed to the norm and the angle of each patch are independent and the norm of each pair is independent of others. and for all for all , . Further assume there exists such that for any and for all
then we have
where , analogue to Definition2.1.
Several comments are in sequel. We view as a quantitative measure of the closeness between different patches, i.e., small means they are similar. This lower bound is monotonically decreasing as a function of and note when , which recovers Definition2.1.
For the upper bond on , represents the upper bound of the probability density around the decision boundary. For example if
3.2 The Power of Random Initialization
For one-layer one-neuron model, we need initialization and for the convolution filter, we need a stronger initialization . The following theorem shows with uniformly random initialization we have constant probability to obtain a good initialization. Note with this theorem at hand, we can boost the success probability to arbitrary close to by random restarts. The proof is similar to(Tian, 2017).
Theorem 3.4 .
If we uniformly sample from a -dimensional ball with radius so that , then with probability at least , we have .
To apply this general initialization theorem to our convolution filter case, we can choose . Therefore, with some simple algebra we have the following corollary.
Corollary 3.1 .
Suppose , then if is uniformly sampled from a ball with center and radius , we have with probability at least
The assumption of this corollary is satisfied if the patches are close to each other as discussed in the previous section.
4 Experiments
In this section we use simulations to verify our theoretical findings. We first test how the smoothness affect the convergence rate in one-layer one-neuron model described in Section2 To construct input distribution with different , and (c.f. Definition2.1 and Assumption2.1), we fix the patch to have unit norm and use a mixture of truncated Gaussian distribution to model on the angle around and around the Specifically, the probability density of is sampled from Note by definitions of and if the probability mass is centered around , so the distribution is very spiky and and will be large. On the other hand, if , then input distribution is close to the rotation invariant distribution and and will be small. Figure2(a) verifies our prediction where we fix the initialization and step size.
Next we test how the closeness of patches affect the convergence rate in the convolution setting. We first generate a single patch using the above model with , then generate each unit norm whose angle with , is sampled from . Figure2(b) shows as variance between patches becomes smaller, we obtain faster convergence rate, which coincides with Theorem3.1.
5 Conclusions and Future Works
In this paper we provide the first recovery guarantee of (stochastic) gradient descent algorithm with random initialization for learning a convolution filter when the input distribution is not Gaussian. Our analyses only used the definition of ReLU and some mild structural assumptions on the input distribution. Here we list some future directions.
One possibility is to extend our result to deeper and wider architectures. Even for two-layer fully-connected network, the convergence of (stochastic) gradient descent with random initialization is not known. Existing results either requires sufficiently good initialization(Zhong et al., 2017) or relies on special architecture(Li and Yuan, 2017). However, we believe the insights from this paper is helpful to understand the behaviors of gradient-based algorithms in these settings.
Another direction is to consider the agnostic setting, where the label is not equal to the output of a neural network. This will lead to different dynamics of (stochastic) gradient descent and we may need to analyze the robustness of the optimization procedures. This problem is also related to the expressiveness of the neural network(Raghu et al., 2016) where if the underlying function is not equal bot is close to a neural network. We believe our analysis can be extend to this setting.
Acknowledgment
The authors would like to thank Hanzhang Hu, Tengyu Ma, Yuanzhi Li, Jialei Wang and Kai Zhong for useful discussions.
References
- Blum and Rivest (1989) Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete. In Advances in neural information processing systems, pages 494–501, 1989.
- Brutzkus and Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
- Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
- Dauphin et al. (2016) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.
- Du et al. (2017) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412, 2017.
- Freeman and Bruna (2016) C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
- Ge et al. (2015) Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
- Goel et al. (2016) Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. arXiv preprint arXiv:1611.10258, 2016.
- Haeffele and Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt and Ma (2016) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Janzamin et al. (2015) Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
- Kawaguchi (2016) Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Li and Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. arXiv preprint arXiv:1705.09886, 2017.
- Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
- Lin and Xiao (2014) Qihang Lin and Lin Xiao. An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. In International Conference on Machine Learning, pages 73–81, 2014.
- Livni et al. (2014) Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
- Mei et al. (2016) Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.
- Milletari et al. (2016) Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 565–571. IEEE, 2016.
- Raghu et al. (2016) Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
- Safran and Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
- Sedghi and Anandkumar (2014) Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.
- Shalev-Shwartz et al. (2017a) Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, pages 3067–3075, 2017a.
- Shalev-Shwartz et al. (2017b) Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Weight sharing is crucial to succesful optimization. arXiv preprint arXiv:1706.00687, 2017b.
- Shamir (2016) Ohad Shamir. Distribution-specific hardness of learning neural networks. arXiv preprint arXiv:1609.01037, 2016.
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Šíma (2002) Jiří Šíma. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709–2728, 2002.
- Soltanolkotabi (2017) Mahdi Soltanolkotabi. Learning relus via gradient descent. arXiv preprint arXiv:1705.04591, 2017.
- Szegedy et al. (2017) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, pages 4278–4284, 2017.
- Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.
- Zhang et al. (2015) Yuchen Zhang, Jason D Lee, Martin J Wainwright, and Michael I Jordan. Learning halfspaces and neural networks with random initialization. arXiv preprint arXiv:1511.07948, 2015.
- Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.
- Zhou and Feng (2017) Pan Zhou and Jiashi Feng. The landscape of deep learning algorithms. arXiv preprint arXiv:1705.07038, 2017.
Source: https://www.arxiv-vanity.com/papers/1709.06129/
0 Response to "When is Convolutional Filter Easy to Learn"
Post a Comment