Diversity Leads to Generalization in Neural Networks

Bo Xie1 , Yingyu Liang2 , Le Song1 1 Georgia Institute of Technology [email protected], [email protected] 2 Princeton University [email protected]

Abstract Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true labeling function? If yes, what are the key factors contributing to such generalization ability? In this paper, we provide answers to these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units can learn the true function. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss is bounded if the smallest singular value of the “extended feature matrix” is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization ability.

1

Introduction

In this paper, we show that despite the non-convexity, neural networks with diverse units can learn the true function. We bypass the hurdle of non-convexity by directly analyzing the first order optimality condition of the learning problem, which implies that the training loss is bounded if the minimum singular value of the extended feature matrix is large enough. Bounding the singular value is challenging because it entangles the nonlinear activation function, the weights and data in a complicated way. Unlike most previous attempts, we directly analyze the effect of nonlinearity without assuming independence of the activation patterns from actual data; in fact, the dependence of the patterns on the data and the unit weights underlies the key connection to activation kernel spectrum and the diversity of the units. We have constructed a novel proof, which makes use of techniques from geometric discrepancy and kernel methods, and have identified a new relation linking the smallest singular value to the diversity of the units and the spectrum of a kernel function associated with the unit. Our results also suggest a novel regularization scheme to promote unit diversity for potentially better generalization.

2

Related work

In [3], the authors analyze the loss surface of a special random neural network through spin-glass theory and show that for many large-size networks, there is a band of exponentially many local optima, whose loss is small and close to that of a global optimum. A similar work shows that all local 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

optima are also global optima in linear neural networks [9]. However their analysis for nonlinear neural networks hinges on independence of the activation patterns from the actual data, which is unrealistic. Some other works try to argue that gradient descent is not trapped in saddle points [10, 6], as was suggested to be the major obstacle in optimization [4]. There is also a seminal work using tensor method to avoid the non-convex optimization problem in neural network [8]. [11] is the closest to our work, which shows that zero gradient implies zero loss for all weights except an exception set of measure zero. However, this is insufficient to guarantee low training loss since small gradient can still lead to large loss.

3

Problem setting and preliminaries

We will focus on a special class of data distributions where the input x ∈ Rd is drawn uniformly from the unit sphere,1 , and assume the label |y| ≤ Y where Y is a constant. We consider the following hypothesis class: ( n ) ( ) n X X > F= vk σ(wk x) : vk ∈ {±1} , W = {wk } ∈ FW , where FW = W = {w} : kwk k ≤ CW k=1

k=1

and σ(·) = max{0, ·} is the rectified linear unit (ReLU) activation function, {wk } and {vk } are the unit weights and combination coefficients respectively, n is the number of units, and CW is some constant. We restrict vk ∈ {−1, 1} due to the positive homogeneity of ReLU. That is, the magnitude of vk can always be scaled into the corresponding wk . m

Given a set of i.i.d. training examples {xl , yl }l=1 ⊆ Rd × R, we want to find a function f ∈ F which minimizes the following least-squares loss function m 1 X 2 (yl − f (xl )) . (1) L(f ) = 2m l=1

3.1 First order optimality condition The gradient of the empirical loss w.r.t. wk is 2 m ∂L 1 X (f (xl ) − yl ) vk σ 0 (wk> xl )xl , = ∂wk m

(2)

l=1

for all k = 1, . . . , n. We will express this collection of gradient equations using matrix notation. Define the “extended feature matrix” as   v1 σ 0 (w1> x1 )x1 · · · v1 σ 0 (w1> xm )xm ··· ··· ···     D =  vk σ 0 (wk> x1 )x1 · · · vk σ 0 (wk> xm )xm  ,   ··· ··· ··· 0 > 0 > vn σ (wn x1 )x1 · · · vn σ (wn xm )xm and the residual as r=

1 > (f (x1 ) − y1 , · · · , f (xm ) − ym ) . m

Then we have ∂L := ∂W

∂L > ∂L > ,..., ∂w1 ∂wn

!> = D r.

(3)

A stationary point has zero gradient, so if D ∈ Rdn×m has full column rank, then immediately r = 0, i.e., it is actually a global optimal. Since nd > m is necessary for D to have full column rank, we assume this throughout the paper. However, in practice, we will not have the gradient being exactly zero because, e.g., we stop the algorithm in finite steps or because we use stochastic gradient descent (SGD). In other words, typically we only have k∂L/∂W k ≤ , and D being full rank is insufficient since small gradient can still lead to large loss. More specifically, let sm (D) be the minimum singular 1 The restriction of input to the unit sphere is not too stringent since many inputs are already normalized. Furthermore, it is possible to relax uniform assumption to sub-gaussian rotationally invariant distributions, but we use the current assumption for simplicity. 2 Note that even though ReLU is not differentiable, we can use its sub-gradient σ 0 (u) = I [u ≥ 0].

2

value of D, we have krk ≤



1

∂L .

sm (D) ∂W

(4)

3.2 Spectrum decay of activation kernel We will later show that sm (D) is related to the decay rate of the kernel spectrum associated with the activation function. Specifically, for an activation function σ(w> x), we can define a kernel function g(x, y) = Ew [σ 0 (w> x)σ 0 (w> x) hx, yi] (5) where Ew is over w uniformly distributed on a sphere. In fact, it is a dot-product kernel and we can decompose it with spherical harmonic decomposition: ∞ X g(x, y) = γu φu (x)φu (y), (6) u=1

where the eigenvalues are ordered γ1 ≥ · · · ≥ γm ≥ · · · ≥ 0 and the bases φu (x) are spherical harmonics. The m-th eigenvalue γm will be related to sm (D). 3.3 Weight discrepancy Another key factor in the analysis is the diversity of the unit weights, measured by its geometric n discrepancy [2, 5, 1]. Given a setof n points W = {wk }k=1 on the unit sphere, the discrepancy of W w.r.t. a measurable set Sxy = w ∈ Sd−1 : w> x ≥ 0, w> y ≥ 0 associated with a pair of points (x, y) on the unit sphere is defined as 1 dsp(W, Sxy ) = |W ∩ Sxy | − A(Sxy ), (7) n where A(Sxy ) is the normalized area of Sxy (i.e., the area of the whole sphere A(Sd−1 ) = 1).  Based on the collection of such sets S = Sxy : x, y ∈ Sd−1 , we can define two discrepancy measures relevant to ReLU units: L∞ (W, S) = sup |dsp(W, S)| , (8) S∈S q L2 (W, S) = Ex,y dsp(W, Sxy )2 (9) where the expectation is taken over x, y uniformly on the sphere. We use L∞ (W ) and L2 (W ) as their shorthands. Both discrepancies measure how diverse the points W are.

4

Main results

Recall that γm is the m-th eigenvalue of the kernel in (6), and we define β such that γm = Ω(m−β ) for some β < 1. ˜ β ) and d = Ω(m ˜ β ). Then there exists a Theorem 1 Let 0 < δ, δ 0 < 1, and suppose n = Ω(m set GW ⊆ FW that takes up 1 − δ 0 fraction of measure of FW such that with probability at least 1 − cm− log m − δ, for any W ∈ GW , r

∂L 2 1 1 1 2 0 2

E(f (x; v, W ) − y) ≤ c + c (CW + Y ) log .

2 ∂W m δ ˜ hides logarithmic terms log m log 1 log Here Ω δ

1 δ0 . 2

By the theorem, when we√obtain an solution W ∈ GW with gradient k∂L/∂W k ≤ , the generalization error is O( + 1/ m). This essentially means that although non-convex, the loss function is well behaved, and learning is not difficult over this set. Furthermore, a randomly sampled set of weights W are likely to fall into this set. This then suggests an explanation for the practical success of training with random initialization: after initialization, the parameters w.h.p. fall into the set, then stay inside during training, and finally get to a point with small gradient and small error. High level intuition Due to space limitations, we only describe the high level intuition and proof sketch for bounding the minimum singular value. Please see the supplementary material for full proof. 3

sm (D) is necessarily connected to the activation function and the diversity of the weights. For example, if σ 0 (t) is very small for all t, then the smallest singular value is expected to be very small. For the weights, if d < m (the interesting case) and all wk ’s are the same, then D cannot have rank m. If wk ’s are very similar to each other, then one would expect the smallest singular value to be very small or even zero. Therefore, some notion of diversity of the weights are needed. The analysis begins by considering the matrix Gn = D> D/n. It suffices to bound λm (Gn ), the m-th (and the smallest) eigenvalue of Gn . To do so, we introduce a matrix G whose entries G(i, j) = Ew [Gn (i, j)] where the expectation Ew is taken assuming wk ’s are uniformly random on the unit sphere. The intuition is that when w is uniformly distributed, σ 0 (w> x) is most independent of the actual value of the x, and the matrix D should have the highest chance of having large smallest singular value. We will introduce G as an intermediate quantity and subsequently bound the spectral difference between Gn and G. Roughly speaking, we break the proof into two steps λm (Gn ) ≥ λm (G) − kG − Gn k (10) | {z } | {z } I. ideal spectrum

II. discrepancy

where kG − Gn k is the spectral norm of the difference. For the first term, we observe that G has a particular nice form: G(i, j) = g(xi , xj ), the kernel defined in (6). This allows us to apply the eigendecomposition of the kernel and positive definite matrix concentration inequality to bound λm (G), which turns out to be around mγm /2. For the second term, we use the geometric discrepancy to characterize kG − Gn k and show it is small for most W . In particular, the entries in G − Gn can be viewed as the kernel of some U-statistics, hence concentration bounds can be applied. The expected U-statistics turns out to be the (L2 (W ))2 , which has a closed form and can be shown to be small. Our key technical lemma is a lower bound on the smallest singular value of the matrix D.  Lemma 2 With probability ≥ 1 − m exp (−mγm /8) − 2m2 exp −4 log2 d − δ, we have sm (D)2 ≥ nmγm /2 − cnρ(W ),

(11)

where m log d ρ(W ) = √ d

p

 L∞ (W )L2 (W )

4 1 log m δ

1/4

r + L∞ (W )

! 4 1 log + L2 (W ) + L∞ (W ). 3m δ (12)

We know L∞ (W ) is bounded by 2, so the question is how large is L2 (W ). In the next lemma, we show it is small for most random W . Lemma 3 There exists a constant cg , such that for any 0 < δ < 1, with probability at least 1 − δ n over W = {wi }i=1 that are sampled from the unit sphere uniformly at random, ! r log d 1 1 1 2 (L2 (W )) ≤ cg log + log . nd δ n δ One limitation of the analysis is that we have not analyzed how to obtain a solution with both small discrepancy and small gradient. To reinforce solutions with small discrepancy, we propose a novel regularization term to minimize L2 discrepancy. Detailed experiments are in the supplementary.

5

Conclusion

We have analyzed one-hidden-layer neural networks and identified novel conditions when they achieve small training and test errors despite the non-convexity of the loss function. Although we focus on a least-square loss function and uniform input distribution, the analysis technique can be readily extended to other loss function and input distributions. At the moment, our analysis is still limited in the sense that it is independent of the actual algorithm. In the future work, we will explore the interplay between the discrepancy and gradient descent. In addition, we will further investigate the issue of designing an algorithm that guarantees good discrepancy thus small errors, possibly in a way similar to [7] in low-rank recovery problems. 4

References [1] D. Bilyk and M. T. Lacey. One bit sensing, discrepancy, and stolarsky principle. arXiv preprint arXiv:1511.08452, 2015. [2] B. Chazelle. The Discrepancy Method: Randomness and Complexity. 2000. [3] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015. [4] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, 2014. [5] J. Dick and F. Pillichshammer. Discrepancy theory and quasi-monte carlo integration. In A Panorama of Discrepancy Theory, pages 539–619. Springer, 2014. [6] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points – online stochastic gradient for tensor decomposition. arXiv preprint arXiv:1503.02101, 2015. [7] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. arXiv preprint arXiv:1605.07272, 2016. [8] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. CoRR abs/1506.08473, 2015. [9] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems (NIPS), 2016. [10] J. Lee, M. Simchowitz, M. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Proceedings of the Annual Conference on Learning Theory (COLT), 2016. [11] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

5

Diversity Leads to Generalization in Neural Networks

difficulty of analyzing the non-convex loss function with potentially numerous ... because it entangles the nonlinear activation function, the weights and data in.

187KB Sizes 1 Downloads 209 Views

Recommend Documents

Diversity Leads to Generalization in Neural Networks
and the corresponding matrix G[m]. On one hand, it is clear that λm(G) ≥ λm(G[m]). On the other hand, G[m] = AA where A is a random matrix whose rows are. Ai ...

Sensitivity and Generalization in Neural Networks: an ... - OpenReview
robustness, while factors associated with good generalization – such as data augmentation and ... networks, that behave very differently close to the data manifold than away from it (§4.1). We therefore analyze the ... variety of scenarios. *Work

Sensitivity and Generalization in Neural Networks: an ... - OpenReview
j. ] = ϵ J (x)2. F . 2. To detect a change in linear region (further called a “transition”), we need ..... David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al.

Improving Generalization Capability of Neural Networks ...
the network weights simultaneously. For fast convergence and good solution quality of the algorithms, we suggest the hybrid simulated annealing algorithm with ...

Adaptive Incremental Learning in Neural Networks
structure of the system (the building blocks: hardware and/or software components). ... working and maintenance cycle starting from online self-monitoring to ... neural network scientists as well as mathematicians, physicists, engineers, ...

Neural Networks - GitHub
Oct 14, 2015 - computing power is limited, our models are necessarily gross idealisations of real networks of neurones. The neuron model. Back to Contents. 3. ..... risk management target marketing. But to give you some more specific examples; ANN ar

Recurrent Neural Networks
Sep 18, 2014 - Memory Cell and Gates. • Input Gate: ... How LSTM deals with V/E Gradients? • RNN hidden ... Memory cell (Linear Unit). . =  ...

Intriguing properties of neural networks
Feb 19, 2014 - we use one neural net to generate a set of adversarial examples, we ... For the MNIST dataset, we used the following architectures [11] ..... Still, this experiment leaves open the question of dependence over the training set.

Can One Achieve Multiuser Diversity in Uplink Multi-Cell Networks?
Abstract—We introduce a distributed opportunistic scheduling. (DOS) strategy, based on two pre-determined thresholds, for uplink K-cell networks with ...

Neural Graph Learning: Training Neural Networks Using Graphs
many problems in computer vision, natural language processing or social networks, in which getting labeled ... inputs and on many different neural network architectures (see section 4). The paper is organized as .... Depending on the type of the grap

Adaptive Incremental Learning in Neural Networks - Semantic Scholar
International Conference on Adaptive and Intelligent Systems, 2009 ... There exit many neural models that are theoretically based on incremental (i.e., online,.

Deep Neural Networks for Acoustic Modeling in Speech ... - CiteSeerX
Apr 27, 2012 - origin is not the best way to find a good set of weights and unless the initial ..... State-of-the-art ASR systems do not use filter-bank coefficients as the input ...... of the 24th international conference on Machine learning, 2007,

The Power of Sparsity in Convolutional Neural Networks
Feb 21, 2017 - Department of Computer Science .... transformation and depend on number of classes. 2 ..... Online dictionary learning for sparse coding.

pdf-0946\artificial-neural-networks-in-cancer-diagnosis-prognosis ...
... apps below to open or edit this item. pdf-0946\artificial-neural-networks-in-cancer-diagnosi ... gement-biomedical-engineering-from-brand-crc-press.pdf.

Deep Neural Networks for Acoustic Modeling in ... - Semantic Scholar
Apr 27, 2012 - His current main research interest is in training models that learn many levels of rich, distributed representations from large quantities of perceptual and linguistic data. Abdel-rahman Mohamed received his B.Sc. and M.Sc. from the El

Convolutional Neural Networks for Eye Detection in ...
Convolutional Neural Network (CNN) for. Eye Detection. ▫ CNN has ... complex features from the second stage to form the outputs of the network. ... 15. Movie ...

Distributed Real Time Neural Networks In Interactive ...
real time, distributed computing, artificial neural networks, robotics. 1. INTRODUCTION. The projects Leto and Promethe aim at facilitating the de- velopment of ...

Using Recurrent Neural Networks for Slot Filling in Spoken ... - Microsoft
two custom SLU data sets from the entertainment and movies .... searchers employed statistical methods. ...... large-scale data analysis, and machine learning.