A norm-concentration argument for non-convex ... - Semantic Scholar

Viewer
Transcript

A norm-concentration argument for non-convex regularisation

Ata Kab´ an Robert J. Durrant School of Computer Science The University of Birmingham Birmingham B15 2TT, UK

ICML/UAI/COLT Workshop on Sparse Optimization and Variable Selection Helsinki, 9 July 2008.

Introduction L1-regularisation - a workhorse in machine learning • sparsity • convexity • logarithmic sample complexity Non-convex norm regularisation - seems to have added value • statistics (Fan & Li, ’01): oracle property • signal proc. (Chartland, ’07), signal reconstruction (Wipf & Rao, ’05) • 0-norm SVM classification (Weston et al., ’03) (results data-dependent) • genomic data classification (Liu et al., ’07)

Regularised regression in high dimensions Training set {(xj , yj )}nj=1 , where xj ∈ Rm are m-dimensional inputs and yj ∈ {−1, 1} are their labels. Scenario of interest: few r << m relevant features, small sample size n << m. Consider regularised logistic regression for concreteness: max w

n X

log p(yj |xj , w) subject to ||w||q ≤ A

(1)

j=1

where w ∈ R1×m are unknown parameters, p(y|wT x) = 1/(1 + exp(−ywT x)), P q 1/q . and ||w||q = ( m i=1 |wi | ) If q = 2: L2-regularised (’ridge’) logistic regression. If q = 1: L1-regularised (’lasso’) logistic regression. If q < 1: Lq<1-regularised logistic regression: non-convex, non-differentiable at 0

A word on some recent estimation algorithms 2.5

2.0 2.0

1.5 1.5

1.0 1.0

0.5

0.5

K

4

K

2

0

w

i

2

4

K

4

K

2

0

w

2

4

i

Local quadratic (Fan & Li,’01) vs. local linear (Zou & Li,’08) bound, tangent at ±3. [Despite the latter appears to be a closer approximation, framing the iterative estimation into the E-M methodology framework, it turns out they are in fact equivalent (Kaban & Durrant, ECML’08)]

Sample complexity bound H = {h(x, y) = − log p(y|wT x) : x ∈ Rm , y ∈ {−1, 1}} the function class erP (h) = E(x,y)∼iidP [h(x, y)] the true error of h P er ˆ z (h) = n1 ni=1 h(xi , yi ) the sample error of h on training set z of size n optP (H) = inf h∈H erP (h) the approximation error of H

L(z) = minh∈H er ˆ z (h) function returned by the learning algorithm

Theorem (A.Ng,’04, extended from L1 to Lq<1). ∀ǫ > 0, ∀δ > 0, ∀m, n ≥ 1, in order to ensure that erP (L(z)) ≤ optP (H) + ǫ with probability 1 − δ, it is enough to have n = Ω((log m) × poly(A, r 1/q , 1/ǫ, log(1/δ)) - logarithmic in dimensionality m; - polynomial in #relevant features, but growing with r 1/q

(2)

test logloss

0−1 error

30 20

validation logloss

1.5

40

1 0.5

10 0.2

0.4

0.6 q

0.8

1

0.2

r=5

r=10

0.4

0.6 q r=30

0.8

1 r=50

1.4 1.2 1 0.8 0.6 0.4 0.2 0.2

0.4

0.6 q

0.8

r=100

Experiments on m = 200 dimensional data sets, varying the number of relevant features r ∈ {5, 10, 30, 50, 100}. The medians of 60 independent trials are shown and the error bars represent one standard error. The 0-1 errors are out of 100.

1

A norm concentration view Consider the un-regularised version of the problem. Because n << m, the system is under-determined, and so, m − n components of w can be set arbitrarily. We can model the arbitrary components of w as being i.i.d. Uniform: wi ∼ Unif[−a, a], ∀i ∈ {n + 1, ..., m} with some large a. The regularisation term is meant to constrain the problem to make it well-posed. However, in very high dimensions, a counter-intuitive phenomenon known as the concentration of distances and norms comes into play. The regularisation term becomes essentially the same for all the infinitely many possible maximisers of the likelihood term.

Distance concentration Distance concentration is the counter-intuitive phenomenon that, as the data dimensionality increases without bounds, all pairwise distances between points become identical. This phenomenon affects every area, where high-dimensional data processing is required — e.g. database indexing & retrieval, data analysis, statistical machine learning. Concentration of the L2-norm (Demartinez,’94) Let x ∈ Rm a random vector with i.i.d. components of any distribution. Then, lim

E[||x||2]

m→∞ m1/2

= const.;

lim Var[||x||2] = const.

m→∞

(3)

Concentration of arbitrary dissimilarity functions in arbitrary multivariate distributions (Beyer et al.,99). (m)

(m)

Let Fm , m = 1, 2, . . . be an infinite sequence of data distributions and x1 , . . . , xn a random sample of n independent data vectors distributed as Fm . For each m, let ||.|| : dom(Fm) → R+ be a function that takes a point from the domain of Fm and returns a positive real value. p > 0 an arbitrary positive constant Assume that E[||x(m) ||p ] and Var[||x(m) ||p ] are finite and E[||x(m) ||p ] 6= 0. Var[(||x(m) ||)p] = 0, then, If lim m→∞ E[(||x(m)»||)p]2 – (m) (m) ∀ǫ > 0, lim P max ||xj || ≤ (1 + ǫ) min ||xj || = 1. m→∞ 1≤j≤n 1≤j≤n

Sample estimate of Var[x2] / E[x2]2

0.5 0.4 0.3 0.2 0.1 0

0

100

200

300 400 dimensions (m)

500

600

700

0

100

200

300 400 dimensions (m)

500

600

700

log ( DMAXm / DMINm )

5 4 3 2 1 0

Var[(||x||q )p] (p) Applying this to our problem. Denote RVm (||x||q ) = E[(||x||q )p ]2

Using the independence of wn+1, ..., wm, we get: Pm P n Pn q] q q Var[|w | Cov[|w | , |w | ] + i i j i=n+1 j=1 i=1 (q) P m Pm RVm (||w||q ) = q q j=1 E[|wi | ]E[|wj | ] i=1 which converges to 0 as m → ∞.

Hence, the problem remains ill-posed despite the regularisation.

The effect of q Fortunately, not all norms concentrate at the same rate. x ~ Unif[0,1]

Sample estimate of Var[||x||2] / E[||x||2]2

0.25 L2−norms L1−norms L0.5−’norms’ L0.1−’norms’ 0.2

0.15

0.1

0.05

0

0

5

10

15

20

25

30

dimensions (m)

35

40

45

50

Theorem (Francois et al.’07, extended). If w ∈ Rm is a random vector with no more than n < m non-iid components, where n is finite, and all the other components being i.i.d, then Var[||w||q ] 1 σ2 = 2 2 lim m m→∞ E[||w||q ]2 q µ

(4)

where µ = E[wn+1], σ 2 = Var[wn+1] and n + 1 is one of the i.i.d. dimensions of w. Applying this to w, we can use (4) to approximate Var[||w||q ] 1 1 σ2 ≈ = ... 2 2 2 E[||w||q ] mq µ for some large m.

(5)

Computing µ and σ 2 for wn+1 ∼ Unif[−a, a]: Z a aq q 1 q = |wn+1 | µ = E[|wn+1 | ] = 2a q+1 −a a2q q 2 2 2q q 2 σ = E[|wn+1 | ] − E[|wn+1 | ] = (2q + 1)(q + 1)2

So, Var[||w||q ] 1 1 ≈ E[||w||q ]2 m 2q + 1

(6)

(Conveniently, a cancels out in this computation.)

Observe this is a decreasing function of q. Thus, the smaller the q the better, from the point of countering the concentration of the norm in regularisation.

3 relevant features

8

Test 0−1 errors

Test 0−1 errors

1 relevant feature

6 4 2 0

200

400 q=0.1

600

10 5

800 1000

200

q=0.3

q=0.5

0.3

400

600

q=0.7

800 1000 q=1

0.4

Test logloss

Test logloss

15

0.2 0.1

200

400 600 800 1000 Data dimensions

0.3 0.2 0.1 200

400 600 800 1000 Data dimensions

Comparative results on 1000-dimensional synthetic data from (Ng,’04). Each point is the median of > 100 indep. trials. The 0-1 errors are out of 100.

Train + valid. set size = 52+23

Train + valid. set size = 35+15

35 40 0−1 test errors

0−1 test errors

30 25 20 15 10

30 20 10

5 0

0 q=0.1

q=0.5

q=1

q=0.1

Train + valid. set size = 52+23

q=0.5

q=1

Train + valid. set size = 35+15 0.8

0.6 0.6 test logloss

test logloss

0.5 0.4 0.3 0.2

0.4 0.2

0.1 0

0 q=0.1

q=0.5

q=1

q=0.1

q=0.5

q=1

Results on 5000-dimensional synthetic data with only one relevant feature and even smaller sample size. The improvement over L1 becomes larger. (The 0-1 errors are out of 100.)

3 relevant features

4 2 0.4

0.6 0.8 q 1 relevant feature

1

0.25 0.2

0.2

0.15 0.1

0.6 0.8 q 3 relevant features

0.05 0.2

0.4

0.6 q

0.8

0.2 0.1 0.2

40 20 0 1 2 3 4 5 6 7 8 9 10 q*10

0.6 q

0.8

12 10 8 0.6 0.8 q exp decay relevance

40 20 0

1 2 3 4 5 6 7 8 9 10 q*10

0.4

0.2

0.4

1

0.35 0.3 0.25

1

60

0.2

0.4

3 relevant features

# features retained

# features retained

1 relevant feature

0.4

14

1

0.3

1

60

0.4

0.6 q

0.8

1

exp decay relevance

# features retained

0.2

Test logloss

Test logloss

0

Test 0−1 errors

6

exp decay relevance

14 12 10 8 6 4 2

Test logloss

8

Test 0−1 errors

Test 0−1 errors

1 relevant feature

40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 q*10

Results on synthetic data from (A.Ng,’04). Training set size n1 = 70, validation set size=30, and the out-of-sample test set size=100. The statistics are over 10 independent runs with dimensionality ranging from 100 to 1000.

Discussion & further work The learning-theoretic sample complexity bound for generalization is an (loose) upper-bound only. Our analysis based on norm-concentration so far only used that n << m. Further work should examine of the effect of r << m from this perspective. The phenomenon of concentration of norms and distances in very high dimensions impacts all high dimensional problems. Its implications for learning and generalization (and for other areas) is and open question.

References C.C. Aggarwal, A. Hinneburg, & D.A. Keim. On the surprising behavior of distance metrics in high dimensional space. Proc. Int. Conf. Database Theory, 2001, pp. 420-434. K. Beyer, J. Goldstein, R. Ramakrishnan, & U. Shaft. When is nearest neighbor meaningful? Proc. Int. Conf. Database Theory, pp. 217-235, 1999. D Fran¸ cois, V Wertz, & M Verleysen. The concentration of fractional distances. IEEE Trans. on Knowledge and Data Engineering, vol 19, no 7, July 2007 A Kab´ an and R.J Durrant. Learning with Lq<1 vs. L1 -norm regularization with exponentially many irrelevant features. Proc. ECML 2008, to appear. Z Liu, F Jiang, G Tian, S Wang, F Sato, S.J Meltzer, M Tan. Sparse Logistic Regression with Lp Penalty for Biomarker Identification. Statistical Applications in Genetics and molecular Biology. Vol.6, Issue 1, 2007. A.Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proc. ICML 2004. Hui Zou and Runze Li: One-step sparse estimates in non-concave penalized likelihood models. The Annals of Statistics, 2008.