A norm-concentration argument for non-convex regularisation

Ata Kab´ an & Robert J. Durrant [email protected] School of Compuer Science, The University of Birmingham, Edgbaston, B15 2TT, UK

1. Introduction L1-regularisation has become a workhorse in statistical machine learning, because of its sparsity-inducing property and convenient convexity. In addition, detailed theoretical and empirical analysis (Ng, 2004) has shown its ability to learn with exponentially many irrelevant features, in the context of L1-regularised logistic regression. However, independent results in several areas indicate added value to non-convex norm regularisation, despite the existence of local optima. Work in statistics (Fan & Li, 2001) and signal reconstruction (Wipf & Rao, 2005) have established the oracle properties of non-convex regularisers. Good empirical results were also reported in signal processing (Chartland, 2007) and SVM classification (Weston et.al, 2003). Furthermore, using a family of non-convex norms that we shall refer to as fractional-norms in the rest of the paper, turned out to consistently outperform the L1 regulariser in real high-dimensional genomic data classification (Liu et.al, 2007), both in terms of error rates and interpretability. Related ideas, termed as ’zeronorm’ regularisation (Weston et.al, 2003) were also found useful in many other applications, though their success appeared to be data dependent. It is therefore of interest to gain a better understanding of the potential advantages of non-convex norm regularisers, which is our purpose.

2. Regularised regression in high dimensions

n X

log p(yj |xj , w) subject to ||w||q ≤ A

Sample complexity. Noticing that ||w||q<1 ≥ ||w||1 , ∀w, and extending the result of (Ng, 2004) obtained for L1 -regularised logistic regression, it can be shown (Kab´ an & Durrant, 2008) that Lq<1 -norm regularised logistic regression also enjoys a sample complexity that is logarithmic in the data dimensionality m and polynomial in the number of relevant features r and other quantities of interest, n = Ω((log m) × poly(A, r1/q , 1/ǫ, log(1/δ)). Logarithmic bounds are the best known bounds for feature selection. However, we are also interested to know whether, and in which cases there is any advantage in using q < 1 rather than q = 1. Fractional norms were previously studied in the databases and data engineering literature (Aggarwal et.al, 2001; Fran¸cois et.al, 2007), for mitigating the dimensionality curse. In the sequel, we use some of their results to better understand the effects of q in the regularisation term. 2.1. A norm-concentration view

Given a training set of input-target pairs {(xj , yj )}nj=1 , where xj ∈ Rm are m-dimensional input points and yj ∈ {−1, 1} are their labels. We are interested in high-dimensional problems, with few r << m relevant features and small sample size n << m. Consider regularised logistic regression for concreteness: max w

Pm ( i=1 |wi |q )1/q . Note, if q = 2 or q = 1, this is L2or L1-regularised regression respectively. However, if q ∈ (0, 1), we have a non-convex regularisation term, which we will refer to as Lq<1 -regularisation or ’fractional norm’-regularisation. This is not strictly a norm in the mathematical sense, since it does not satisfy the triangle inequality. Also, parameter estimation becomes more difficult than with the more common L1 or L2 norms, because the Lq<1 -norm is non-differentiable at zero and non-convex. Some recent algorithms were developed (Fan & Li, 2001; Kab´ an & Durrant, 2008) that we use in the reported numerical simulations.

(1)

j=1

where w ∈ R1×m are parameters, and the norm in the regularisation term is defined as ||w||q =

Consider the un-regularised version of the problem. Because n << m, the system is under-determined and so the set of solutions is infinite. For the analysis that follows, we would like to capture the distribution of the set of solutions to the unregularised model. To ease notations and without loss of generality, we consider the m − n free variables of w, that can be set arbitrarily, start from component n + 1. We can model the distribution of these arbitrary components as being i.i.d. uniform. So we will have

A norm-concentration argument for non-convex regularisation

wi ∼ Unif[−a, a], ∀i ∈ {n + 1, ..., m} with some large a — in fact, the result of our analysis will turn out not to depend on a.

components, where n is finite, then lim m

m→∞

It is well known that in such ill-conditioned problems, the regularisation term is meant to constrain the problem and make it well-posed. This is indeed so, as long as m is not too large. However, in very high dimensions, the rather counter-intuitive phenomenon known as the concentration of distances and norms comes into play, which is overlooked in previous analyses of regularisation. That is, as the dimensionality grows, the norm that appears in the regularisation constraint becomes essentially the same for all minimisers of the likelihood term. Therefore the problem remains illconditioned despite of the regularisation. To see this, we use a result due to (Beyer et al, 1999). Theorem 1 (Beyer et al, 1999). Let w1 , . . . , wn be a random sample of size n drawn from the multivariate distribution of w as defined by the likelihood term, Var[(||w || )p ] and p > 0 arbitrary. If lim E[(||w||q )qp ]2 = 0, then m→∞   ∀ǫ > 0 lim P max ||wj ||q ≤ (1 + ǫ) min ||wj ||q = m→∞

1≤j≤n

1≤j≤n

1, where E[.] and Var[.] are the theoretical expectation and variance, and the probability on the r.h.s. is over an arbitrary random sample of size n.

Proof (sketch). To allow for a finite number of possibly non-iid Pmarginal dism 1 q tributions we write lim m = i=1 |wi | Pm m→∞q nP o |w | i n i=n+1 1 q (m − n) = lim m i=1 |wi | + m−n

m→∞ P

lim

m→∞

lim 1 m→∞ m

|wi |q m−n

m i=n+1

nP n

(q) RVm =

i=1

Pm q q q j=1 Cov[|wi | , |wj | ] + i=n+1 Var[|wi | ] Pm Pm q q i=1 j=1 E[|wi | ]E[|wj | ]

2.1.1. The effect of q Fortunately, not all norms concentrate equally fast as the dimensionality increases. In particular, the family of norms used in our regularisation term was studied in this respect by (Aggarwal et.al, 2001; Fran¸cois et.al, 2007). Here we propose a straightforward extension of a recent result by (Fran¸cois et.al, 2007) to give us insight into the effect of q and guide our choice of q for the type of problems considered. For this part of the analysis, p = 1. Theorem 2 (Francois (2007), extended). If w ∈ Rm is a random vector with no more than n < m non-iid

j=1

and

lim

m→∞ q

Var[||w ||q q] m

q

Cov(|wi | , |wj | ) +

Pm

i=n+1

=

o Var(|wi |q )

. Using these, the rest of the proof follows the same steps as the original theorem: E[||w || ] We can show that lim m1/q q = µ1/q , and m→∞ Var[||w ||q ] σ2 = q2 µ2(q−1)/q — which put together lim 2/q−1 m→∞ m gives us the required result. Q.E.D. =

As in (Fran¸cois et.al, 2007), we can then approximate (1) RVm for some large m using (2), so we can read off the optimal q by computing the maximiser of this expression. For wn+1 ∼ Unif[−a, a], we get: µ

=

E[|wn+1 |q ] =

σ2

=

Z

a

|wn+1 |q −a

1 aq = 2a q+1

E[|wn+1 |2q ] − E[|wn+1 |q ]2 =

Pn

which converges to 0 as m → ∞. Hence, in very high dimensions, the regularisation term of any of the possible solutions become essentially indistinguishable. In fact, simulations in (Beyer et al, 1999) indicate the problem may become of concern already at 10-20 dimensions.

Pn

i=1 Var[|wn+1 |q ]

so we have Pn

(2)

where µ = E[wn+1 ], σ 2 = Var[wn+1 ] and n + 1 is one of the i.i.d. dimensions of w.

(p)

Applying this to our case, denoting RVm = Var[(||w ||q )p ] E[(||w ||q )p ]2 , choosing p = q to ease the computations, and using the independence of wn+1 , ..., wm ,

Var[||w||]q 1 σ2 = E[||w||q ]2 q 2 µ2

a2q q 2 (2q + 1)(q + 1)2

Var[||w||]q 1 1 ≈ E[||w||q ]2 m 2q + 1

(3)

which rather conveniently turns out to be independent of a. Most importantly, we see that (3) is a monotonically decreasing function of q. In other words, in small sample size problems and with increasing input dimensions, the q-norm regularisation term will concentrate the slowest if we choose the smallest q. Therefore, is such settings, from this analysis we can conclude that, the 0-norm regulariser represents the best choice, from the point of keeping the problem from becoming illconditioned till fairly high dimensions.

3. Results We generated synthetic data sets similarly to (Ng, 2004), having r = 1 and r = 3. We keep r small to suppress the effect of q on the sample complexity w.r.t r, and follow differences attributable to the norm concentration effects. In each experiment, the training and validation set size (to select the regularisation parameter) was 70 and 30 respectively and the performance was measured on an independent test set of size 100.

A norm-concentration argument for non-convex regularisation

Test 0−1 errors

Test 0−1 errors

6 4 2 0

q=0.1

10 5

200 400 600 800 1000

q=0.3

q=0.5

0.3 0.2 0.1

q=1

20 0 1 2 3 4 5 6 7 8 9 10 q*10

0−1 test errors

0−1 test errors

20 15 10

30 20 10

5 0

0 q=0.1

q=0.5

q=1

q=0.1

Train + valid. set size = 52+23

q=0.5

q=1

Train + valid. set size = 35+15 0.8

0.6 0.6 test logloss

0.5 0.4 0.3 0.2

0.4

0.2

0.1 0

0 q=0.1

q=0.5

q=1

q=0.1

q=0.5

q=1

fact a motivating factor for our study. Our analysis gave some new insights that complement other analysis frameworks and our numerical simulations are in agreement with the theory.

RJD was funded by an EPSRC CTA studentship.

0.1

References 200 400 600 800 1000 Data dimensions

3 relevant features # features retained

40

25

Acknowledgement

0.2

1 relevant feature 60

40

30

0.3

200 400 600 800 1000 Data dimensions

# features retained

q=0.7

0.4 Test logloss

Test logloss

15

200 400 600 800 1000

Train + valid. set size = 35+15

Figure 2. Results on 5000-dimensional synthetic data with only one relevant feature and small sample size. Each boxplot summarises 30 independent trials. The 0-1 errors are out of 100.

3 relevant features

1 relevant feature 8

Train + valid. set size = 52+23 35

test logloss

Figure 1 gives the results in terms of both the number of 0-1 errors and the logloss, for different values of q, and varying the data dimensionality. On these plots, each point represents the median of 140 independent trials. We see the logarithmic increase of errors with the data dimensionality, as predicted from the theory, is well supported. More interestingly, smaller values of q do systematically achieve significant improvements. Also the relevant features are more correctly recovered, which clearly favours interpretability. L1 regularisation in turn tends to retain too many features in high dimensions.

60 40 20 0

1 2 3 4 5 6 7 8 9 10 q*10

Figure 1. Comparative results (see text for details). The 0-1 errors are out of 100.

In a subsequent experiment we generated 5000dimensional data with only 1 relevant feature, and even smaller training set sizes. This is shown on Figure 2. Lq<1 -regularisation still has excellent performance (the median of the 0-1 errors is still 0) and the improvement to L1-regularisation becomes even larger. To summarise, in this work we considered a special problem setting, which nevertheless is quite often relevant in gene expression arrays for example, and the superior empirical results of (Liu et.al, 2007) was in

C.C. Aggarwal, A. Hinneburg, & D.A. Keim. On the surprising behavior of distance metrics in high dimensional space. Proc. Int. Conf. Database Theory, 2001. K. Beyer, J. Goldstein, R. Ramakrishnan, & U. Shaft. When is nearest neighbor meaningful? Proc. Int. Conf. Database Theory, pp. 217-235, 1999. R Chartrand. Exact reconstructions of sparse signals via non-convex minimization. IEEE Signal Process. Lett., vol. 14, pp. 707–710, 2007. J Fan & R Li. Variable Selection via Non-concave Penalized Likelihood and its Oracle Properties. J. Amer. Stat. Assoc, Dec 2001, Vol. 96, No. 456. D Fran¸cois, V Wertz, & M Verleysen. The concentration of fractional distances. IEEE Trans. on Knowledge and Data Engineering, vol 19, no 7, July 2007. A Kab´ an and R.J Durrant. Learning with Lq<1 vs. L1 norm regularization with exponentially many irrelevant features. Proc. ECML 2008. Z Liu, F Jiang, G Tian, S Wang, F Sato, S.J Meltzer, M Tan. Sparse Logistic Regression with Lp Penalty for Biomarker Identification. Statistical Applications in Genetics and molecular Biology. Vol.6, Issue 1, 2007. A.Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proc. ICML 2004. J Weston, A Elisseeff, B Sch¨ olkopf, & M Tipping. Use of the Zero-Norm with Linear Models and Kernel Methods. J. Machine Learning Research 3, pp. 1439-1461, 2003. D.P. Wipf & B.D. Rao, “ℓ0 -Norm Minimization for Basis Selection”, NIPS 17, MIT Press, 2005.

A norm-concentration argument for non-convex ...

lariser in real high-dimensional genomic data classi- ... success appeared to be data dependent. .... dimensions, the regularisation term of any of the possi-.

116KB Sizes 1 Downloads 192 Views

Recommend Documents

A New Argument for a Necessary Being
is essential to it (a` la Kripke [1980]). It then follows that there cannot be a cause of p's .... Kripke, Saul 1980. Naming and Necessity, Cambridge, MA: Harvard ...

The-Imaginative-Argument-A-Practical-Manifesto-For-Writers.pdf ...
The-Imaginative-Argument-A-Practical-Manifesto-For-Writers.pdf. The-Imaginative-Argument-A-Practical-Manifesto-For-Writers.pdf. Open. Extract. Open with.

A norm-concentration argument for non-convex ... - Semantic Scholar
Local quadratic (Fan & Li,'01) vs. local linear (Zou & Li,'08) bound, tangent at ±3. [Despite the latter appears to be a closer approximation, framing the iterative estimation into the E-M methodology framework, it turns out they are in fact equival

A norm-concentration argument for non-convex ... - Semantic Scholar
(Chartland, '07), signal reconstruction (Wipf & Rao, '05). • 0-norm SVM classification (Weston et al., '03) (results data-dependent). • genomic data classification ...

Exploiting Structure for Tractable Nonconvex Optimization
Proceedings of the 31st International Conference on Machine. Learning, Beijing, China, 2014. JMLR: W&CP volume ..... Artificial Intel- ligence, 126(1-2):5–41, ...

Bundle Method for Nonconvex Nonsmooth Constrained ...
the optimization method for constrained problems presented in [16, 14]. .... 5. If h(x∗) < 0 then ∂1F(x∗,x∗) = ∂f(x∗), which gives λ1 = 0, and therefore x∗ is.

An Empirical Study of ADMM for Nonconvex Problems
1Department of Computer Science, University of Maryland, College Park, MD ..... relationship between the best penalty for convergence speed and the best penalty for solution quality. .... Mathematical Programming, 55(1-3):293–318, 1992.

NEXT: In-Network Nonconvex Optimization - IEEE Xplore
Abstract—We study nonconvex distributed optimization in multiagent networks with time-varying (nonsymmetric) connec- tivity. We introduce the first algorithmic ...

Argument Structure and Argument Structure Alternations
and Rappaport Hovav (2005) list 16 distinct thematic role hierarchies, organized by ...... I have been calling 'argument structure') are logically distinct from ...

The No No-Miracles-Argument Argument
Dec 29, 2006 - Themes in the philosophy of science Boston kluwer, and Matheson ... We notice that in NMA3,4 there is an appeal to explanation that is not ex-.

Argument-Adjunct_Asymmetries.pdf
Page 1 of 18. 1. Interface Conditions on “Gaps”: Argument-Adjunct Asymmetries and. Scope Reconstruction. Jun Abe. Workshop on Modality and Embedded Clauses. YNU, December 23, 2015. 1. Introduction. - Argument-Adjunct Asymmetry (I):. (1) a.??Who d

An argument for education-application based methods for speech ...
Reframing competitive critical analyses: An argument for education-application based methods for speech writing in CA and Rhetorical Criticism. Katherine L. Hatfield-Edstrom, Ph.D. This project offers a contemporary exemplar that students and coaches

NonConvex Total Variation Speckled Image Restoration Via ... - eurasip
Sep 2, 2011 - web: http://sites.google.com/a/istec.net/prodrig. ABSTRACT. Within the TV framework there are several algorithms to restore images corrupted with Speckle (multiplicative) noise. Typically most of the methods convert the multiplica- tive

Educational Criteria in Forensics: An Argument for ...
This educational function is best served when forensic students are .... judges are that members of the host school's administration and community ... Company.

Visualizing Argument Structure
We describe a visualization tool for understanding ... cated. They provide, at best, poor automatic layout and do not utilize interactive techniques ..... Kumar, H., Plaisant, C., Shneiderman, B.: Browsing hierarchical data with multi- level dynamic 

Visualizing Argument Structure
of the argument's structure and provides focus based interaction tech- niques for visualization. ..... College Park, Maryland 20742, U.S.A. (1995). 13. Huang, M.L. ...

An Argument For Changing TCP Slow Start -
... on many many webpages. There is nothing wrong with the facebook page content. ... Needs a kernel patch at the minimum - possibly a sockets API change.

An Argument For Changing TCP Slow Start - PDFKUL.COM
Page 1. A Client-Side Argument for. Changing TCP Slow Start. Mike Belshe - [email protected] - Jan 11, 2010. Page 2. Slow Start is a key part of. TCP's congestion control algorithms... http://www.faqs.org/rfcs/rfc2581.html. Page 3. And, ''lack of

An Argument for Increasing TCP's Initial Congestion Window
mining flow completion time. Our premise is that the initial congestion window should ... per domain, partly to increase parallelism and avoid head-of- line blocking of independent HTTP requests/responses, ... spread content over multiple domains so