A norm-concentration argument for non-convex regularisation
Ata Kab´ an Robert J. Durrant School of Computer Science The University of Birmingham Birmingham B15 2TT, UK
ICML/UAI/COLT Workshop on Sparse Optimization and Variable Selection Helsinki, 9 July 2008.
Introduction L1-regularisation - a workhorse in machine learning • sparsity • convexity • logarithmic sample complexity Non-convex norm regularisation - seems to have added value • statistics (Fan & Li, ’01): oracle property • signal proc. (Chartland, ’07), signal reconstruction (Wipf & Rao, ’05) • 0-norm SVM classification (Weston et al., ’03) (results data-dependent) • genomic data classification (Liu et al., ’07)
Regularised regression in high dimensions Training set {(xj , yj )}nj=1 , where xj ∈ Rm are m-dimensional inputs and yj ∈ {−1, 1} are their labels. Scenario of interest: few r << m relevant features, small sample size n << m. Consider regularised logistic regression for concreteness: max w
n X
log p(yj |xj , w) subject to ||w||q ≤ A
(1)
j=1
where w ∈ R1×m are unknown parameters, p(y|wT x) = 1/(1 + exp(−ywT x)), P q 1/q . and ||w||q = ( m i=1 |wi | ) If q = 2: L2-regularised (’ridge’) logistic regression. If q = 1: L1-regularised (’lasso’) logistic regression. If q < 1: Lq<1-regularised logistic regression: non-convex, non-differentiable at 0
A word on some recent estimation algorithms 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5
0.5
K
4
K
2
0
w
i
2
4
K
4
K
2
0
w
2
4
i
Local quadratic (Fan & Li,’01) vs. local linear (Zou & Li,’08) bound, tangent at ±3. [Despite the latter appears to be a closer approximation, framing the iterative estimation into the E-M methodology framework, it turns out they are in fact equivalent (Kaban & Durrant, ECML’08)]
Sample complexity bound H = {h(x, y) = − log p(y|wT x) : x ∈ Rm , y ∈ {−1, 1}} the function class erP (h) = E(x,y)∼iidP [h(x, y)] the true error of h P er ˆ z (h) = n1 ni=1 h(xi , yi ) the sample error of h on training set z of size n optP (H) = inf h∈H erP (h) the approximation error of H
L(z) = minh∈H er ˆ z (h) function returned by the learning algorithm
Theorem (A.Ng,’04, extended from L1 to Lq<1). ∀ǫ > 0, ∀δ > 0, ∀m, n ≥ 1, in order to ensure that erP (L(z)) ≤ optP (H) + ǫ with probability 1 − δ, it is enough to have n = Ω((log m) × poly(A, r 1/q , 1/ǫ, log(1/δ)) - logarithmic in dimensionality m; - polynomial in #relevant features, but growing with r 1/q
(2)
test logloss
0−1 error
30 20
validation logloss
1.5
40
1 0.5
10 0.2
0.4
0.6 q
0.8
1
0.2
r=5
r=10
0.4
0.6 q r=30
0.8
1 r=50
1.4 1.2 1 0.8 0.6 0.4 0.2 0.2
0.4
0.6 q
0.8
r=100
Experiments on m = 200 dimensional data sets, varying the number of relevant features r ∈ {5, 10, 30, 50, 100}. The medians of 60 independent trials are shown and the error bars represent one standard error. The 0-1 errors are out of 100.
1
A norm concentration view Consider the un-regularised version of the problem. Because n << m, the system is under-determined, and so, m − n components of w can be set arbitrarily. We can model the arbitrary components of w as being i.i.d. Uniform: wi ∼ Unif[−a, a], ∀i ∈ {n + 1, ..., m} with some large a. The regularisation term is meant to constrain the problem to make it well-posed. However, in very high dimensions, a counter-intuitive phenomenon known as the concentration of distances and norms comes into play. The regularisation term becomes essentially the same for all the infinitely many possible maximisers of the likelihood term.
Distance concentration Distance concentration is the counter-intuitive phenomenon that, as the data dimensionality increases without bounds, all pairwise distances between points become identical. This phenomenon affects every area, where high-dimensional data processing is required — e.g. database indexing & retrieval, data analysis, statistical machine learning. Concentration of the L2-norm (Demartinez,’94) Let x ∈ Rm a random vector with i.i.d. components of any distribution. Then, lim
E[||x||2]
m→∞ m1/2
= const.;
lim Var[||x||2] = const.
m→∞
(3)
Concentration of arbitrary dissimilarity functions in arbitrary multivariate distributions (Beyer et al.,99). (m)
(m)
Let Fm , m = 1, 2, . . . be an infinite sequence of data distributions and x1 , . . . , xn a random sample of n independent data vectors distributed as Fm . For each m, let ||.|| : dom(Fm) → R+ be a function that takes a point from the domain of Fm and returns a positive real value. p > 0 an arbitrary positive constant Assume that E[||x(m) ||p ] and Var[||x(m) ||p ] are finite and E[||x(m) ||p ] 6= 0. Var[(||x(m) ||)p] = 0, then, If lim m→∞ E[(||x(m)»||)p]2 – (m) (m) ∀ǫ > 0, lim P max ||xj || ≤ (1 + ǫ) min ||xj || = 1. m→∞ 1≤j≤n 1≤j≤n
Sample estimate of Var[x2] / E[x2]2
0.5 0.4 0.3 0.2 0.1 0
0
100
200
300 400 dimensions (m)
500
600
700
0
100
200
300 400 dimensions (m)
500
600
700
log ( DMAXm / DMINm )
5 4 3 2 1 0
Var[(||x||q )p] (p) Applying this to our problem. Denote RVm (||x||q ) = E[(||x||q )p ]2
Using the independence of wn+1, ..., wm, we get: Pm P n Pn q] q q Var[|w | Cov[|w | , |w | ] + i i j i=n+1 j=1 i=1 (q) P m Pm RVm (||w||q ) = q q j=1 E[|wi | ]E[|wj | ] i=1 which converges to 0 as m → ∞.
Hence, the problem remains ill-posed despite the regularisation.
The effect of q Fortunately, not all norms concentrate at the same rate. x ~ Unif[0,1]
Sample estimate of Var[||x||2] / E[||x||2]2
0.25 L2−norms L1−norms L0.5−’norms’ L0.1−’norms’ 0.2
0.15
0.1
0.05
0
0
5
10
15
20
25
30
dimensions (m)
35
40
45
50
Theorem (Francois et al.’07, extended). If w ∈ Rm is a random vector with no more than n < m non-iid components, where n is finite, and all the other components being i.i.d, then Var[||w||q ] 1 σ2 = 2 2 lim m m→∞ E[||w||q ]2 q µ
(4)
where µ = E[wn+1], σ 2 = Var[wn+1] and n + 1 is one of the i.i.d. dimensions of w. Applying this to w, we can use (4) to approximate Var[||w||q ] 1 1 σ2 ≈ = ... 2 2 2 E[||w||q ] mq µ for some large m.
(5)
Computing µ and σ 2 for wn+1 ∼ Unif[−a, a]: Z a aq q 1 q = |wn+1 | µ = E[|wn+1 | ] = 2a q+1 −a a2q q 2 2 2q q 2 σ = E[|wn+1 | ] − E[|wn+1 | ] = (2q + 1)(q + 1)2
So, Var[||w||q ] 1 1 ≈ E[||w||q ]2 m 2q + 1
(6)
(Conveniently, a cancels out in this computation.)
Observe this is a decreasing function of q. Thus, the smaller the q the better, from the point of countering the concentration of the norm in regularisation.
3 relevant features
8
Test 0−1 errors
Test 0−1 errors
1 relevant feature
6 4 2 0
200
400 q=0.1
600
10 5
800 1000
200
q=0.3
q=0.5
0.3
400
600
q=0.7
800 1000 q=1
0.4
Test logloss
Test logloss
15
0.2 0.1
200
400 600 800 1000 Data dimensions
0.3 0.2 0.1 200
400 600 800 1000 Data dimensions
Comparative results on 1000-dimensional synthetic data from (Ng,’04). Each point is the median of > 100 indep. trials. The 0-1 errors are out of 100.
Train + valid. set size = 52+23
Train + valid. set size = 35+15
35 40 0−1 test errors
0−1 test errors
30 25 20 15 10
30 20 10
5 0
0 q=0.1
q=0.5
q=1
q=0.1
Train + valid. set size = 52+23
q=0.5
q=1
Train + valid. set size = 35+15 0.8
0.6 0.6 test logloss
test logloss
0.5 0.4 0.3 0.2
0.4 0.2
0.1 0
0 q=0.1
q=0.5
q=1
q=0.1
q=0.5
q=1
Results on 5000-dimensional synthetic data with only one relevant feature and even smaller sample size. The improvement over L1 becomes larger. (The 0-1 errors are out of 100.)
3 relevant features
4 2 0.4
0.6 0.8 q 1 relevant feature
1
0.25 0.2
0.2
0.15 0.1
0.6 0.8 q 3 relevant features
0.05 0.2
0.4
0.6 q
0.8
0.2 0.1 0.2
40 20 0 1 2 3 4 5 6 7 8 9 10 q*10
0.6 q
0.8
12 10 8 0.6 0.8 q exp decay relevance
40 20 0
1 2 3 4 5 6 7 8 9 10 q*10
0.4
0.2
0.4
1
0.35 0.3 0.25
1
60
0.2
0.4
3 relevant features
# features retained
# features retained
1 relevant feature
0.4
14
1
0.3
1
60
0.4
0.6 q
0.8
1
exp decay relevance
# features retained
0.2
Test logloss
Test logloss
0
Test 0−1 errors
6
exp decay relevance
14 12 10 8 6 4 2
Test logloss
8
Test 0−1 errors
Test 0−1 errors
1 relevant feature
40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 q*10
Results on synthetic data from (A.Ng,’04). Training set size n1 = 70, validation set size=30, and the out-of-sample test set size=100. The statistics are over 10 independent runs with dimensionality ranging from 100 to 1000.
Discussion & further work The learning-theoretic sample complexity bound for generalization is an (loose) upper-bound only. Our analysis based on norm-concentration so far only used that n << m. Further work should examine of the effect of r << m from this perspective. The phenomenon of concentration of norms and distances in very high dimensions impacts all high dimensional problems. Its implications for learning and generalization (and for other areas) is and open question.
References C.C. Aggarwal, A. Hinneburg, & D.A. Keim. On the surprising behavior of distance metrics in high dimensional space. Proc. Int. Conf. Database Theory, 2001, pp. 420-434. K. Beyer, J. Goldstein, R. Ramakrishnan, & U. Shaft. When is nearest neighbor meaningful? Proc. Int. Conf. Database Theory, pp. 217-235, 1999. D Fran¸ cois, V Wertz, & M Verleysen. The concentration of fractional distances. IEEE Trans. on Knowledge and Data Engineering, vol 19, no 7, July 2007 A Kab´ an and R.J Durrant. Learning with Lq<1 vs. L1 -norm regularization with exponentially many irrelevant features. Proc. ECML 2008, to appear. Z Liu, F Jiang, G Tian, S Wang, F Sato, S.J Meltzer, M Tan. Sparse Logistic Regression with Lp Penalty for Biomarker Identification. Statistical Applications in Genetics and molecular Biology. Vol.6, Issue 1, 2007. A.Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proc. ICML 2004. Hui Zou and Runze Li: One-step sparse estimates in non-concave penalized likelihood models. The Annals of Statistics, 2008.