In Defense of l0

Dongyu Lin Emily Pitler Dean P. Foster Lyle H. Ungar University of Pennsylvania, Philadelphia, PA 19104 USA

[email protected] [email protected] [email protected] [email protected]

Keywords: variable selection, best subset selection, l1 regularization, Lasso, stepwise regression

Abstract In the past decade, there has been an explosion of interest in using l1 -regularization in replace of l0 regularization for feature selection. We present results showing that while l1 -regularization never outperforms l0 -regularization by more than a constant factor, in some cases using an l1 penalty is infinitely worse than an l0 penalty. We also compare algorithms solving these two problems from several aspects and show that, although good solutions have been developed for l1 problem, they may not perform as well as the very classic stepwise regression, which is a greedy l0 surrogate. In other words, “an approximate solution to the right problem” can be better than “the exact solutions to the wrong problem”. We focus on variable selection problems in which there is a large set of potential features, only a few of which are likely to be helpful. This type of sparsity occurs often in various machine learning tasks, such as predicting disease based on millions of genes, or predicting the topic of a document based on the occurances of hundreds of thousands of words. Consider a normal linear model y = Xβ + ε, where y is the response variable with n observation, X = (x1 , . . . , xp ) is an n × p design matrix, β = (β1 , . . . , βp )0 is the coefficient parameters, and error ε ∼ N (0, σ 2 In ). Assume that only a subset of {xj }pj=1 has nonzero coefficients. The traditional statistical approach to this problem, namely, the l0 problem, finds an estimator that miniPreliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute.

mizes the l0 penalized sum of squared errors  arg minβ ky − Xβk2 + Πσ 2 kβkl0 , Pp where kβkl0 = i=1 I{βi 6=0} .

(1)

However, this problem is shown to be NP hard (Natarajan, 1995). An trackable problem relaxes the Pp l0 penalty by the l1 norm kβkl1 = i=1 |βi | and seeks  arg minβ ky − Xβk2 + λkβkl1 , (2) known as the l1 -regularization problem (Tibshirani, 1996). The computation of (2) is much more efficient and approachable because of the convexity (Efron et al., 2004; Candes & Tao, 2007). There are three main reasons that we think l0 problem is the more correct problem and that stepwise regression can perform better than l1 algorithms in sparse cases. First, l0 solutions are more predictive. Suppose βˆ is an estimator of β. We consider the preˆ which is defined as dictive risk of β, ˆ = Eβ kXβ − X βk ˆ 2. R(β, β)

(3)

We consider the special case when X is orthogonal. The l0 problem can be solved by simply picking those predictors with least squares estimate |βˆi | > γ ≥ 0, where the choice of γ depends on the noise level of the model and the penalty scale Π in (1). (Donoho & Johnstone, 1994; Foster & George, 1994) showed that Π = 2 log p is optimal. Let βˆl0 (γ0 ) = (βˆ1 I{βˆ1 >γ0 } , . . . , βˆp I{βˆp >γ0 } )0

(4)

be the l0 estimator that solves (1). and the l1 solution to (2)  βˆl1 (γ1 ) = sign(βˆ1 )(|βˆ1 | − γ1 )+ , . . . ,

In Defense of l0

sign(βˆp )(|βˆp | − γ1 )+

0

,

(5)

where βˆi ’s are the least squares estimates. We have the following theorems: Theorem 1. For any γ1 ≥ 0, there exists constants C1 > 0 and C2 > 0, such that inf sup γ0

β

R(β, βˆl0 (γ0 )) ≤ 1 + C1 · γ1 e−C2 γ1 . R(β, βˆl (γ1 ))

(6)

1

This theorem implies that the worse l0 solution performs almost as good as the best l1 solution in both saturated (λ1 ≈ 0) and sparse (λ1  0) cases.

Figure 2. inf γ1 supβ

ˆ 1 )) R(β,β(γ ˆ 0 )) R(β,β(γ

tends to ∞ when γ0 → ∞.

These results show that l0 solution provides a less risky and more predictive solution. Empirical results also show that l0 is substantially better than l1 if there are detectable signals (George & Foster, 2000; Foster & Stine, 2004; Zhou et al., 2006) Second, l0 controls FDR better. The False Discovery Rate (FDR) (Benjamini & Hochberg, 1995) is defined as E[V /R|R > 0]P (R > 0) where R is the total number of discoveries and V is the number of false discoveries among them. ˆ

β(γ0 )) Figure 1. inf γ0 supβ R(β, ˆ 1 )) tends to 1 when γ1 → 0 or R(β,β(γ ∞. Specifically, the supremum over γ1 is bounded.

On the contrary, Theorem 2. For any γ0 , there exists constants C3 > 0 and r > 0, such that inf sup γ1

β

R(β, βˆl1 (γ1 )) ≥ 1 + C3 · γ0r . ˆ R(β, βl (γ0 ))

(7)

0

It diverges when γ0 → ∞, in other words, when the system is extremely sparse, l1 solution will do a terrible job. √ Recall that γ0 = 2 log p is optimal, in which case, the above minimax ratio will blow up when p is very large. The reason for this is because βˆl1 is a biased estimate of β and is shrunk towards zero a lot when the system is sparse, as shown in Figure 3.

(Abramovich et al., 2006) shows that an FDRpenalized procedure is adaptively optimal in any lp ball, 0 ≤ p ≤ 2. For small kβkl0 , this penalty has the flavor of an l0 penalty. Also, the solution is indeed a variable hard threshold rule. Hence in some sense, when a sparse solution is preferred, hard thresholding, or l0 solution surpasses other solutions. ˆ and forWe also compare the empirical FDR Vˆ /R ward stepwise regression does a better job in controlling FDR than Lasso does. Third, the l0 -based stepwise regression provides sparser solutions. Compared to l0 -regularization, l1 does not always provide the sparsest possible solution. It is easy to construct an example where l1 will pick a solution that with smaller l1 norm but a lot more nonzero coefficients (Candes et al., 2007). However, the NP-hardness makes l0 problem un-

In Defense of l0

adaptation by wavelet shrinkage. Biometrika, 81, 425–455. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist., 32. Foster, D. P., & George, E. I. (1994). The Risk Inflation Criterion for Multiple Regression. Ann. Statist., 22, 1947–1975. Foster, D. P., & Stine, R. A. (2004). Variable selection in data mining: building a predictive model for bankruptcy. J. Amer. Statistical Assoc., 99, 303– 313. George, E. I., & Foster, D. P. (2000). Calibration and empirical bayes variable selection. Biometrika, 87, 731–747. Figure 3. The true β is 1 while βˆl1 is always shrunk by at least 20%.

tractable. (Natarajan, 1995) reduced the known NP hard problem of “the exact cover of 3-sets” to the best subset selection problem. It is then fair to ask which comes closer to solving this type of problem: a greedy approximation to the l0 problem or an exact solution to the l1 problem? It turns out that forward stepwise regression gets not only sparser but also more accurate results than Lasso does. Conclusion An approximation to the correct problem is better than the exact solutions to a wrong problem.

References Abramovich, F., Benjamini, Y., Donoho, D. L., & Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist., 34, 584–653. Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Statist. Soc. B, 57, 289–300. Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist., 35. Candes, E. J., Wakin, M. B., & Boyd, S. P. (2007). Enhancing sparsity by reweighted l1 minimization. Donoho, D. L., & Johnstone, I. M. (1994). Ideal spatial

Natarajan, B. (1995). Sparse Approximate Solutions to Linear Systems. SIAM Journal on Computing, 24, 227. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288. Zhou, J., Foster, D. P., Stine, R. A., & Ungar, L. H. (2006). Streamwise feature selection. J. Machine Learning Research, 7, 1861–1885.

In Defense of l0 - Semantic Scholar

University of Pennsylvania, Philadelphia, PA 19104 USA. Keywords: variable selection, best subset selection, l1 regularization, Lasso, stepwise regression.

245KB Sizes 0 Downloads 357 Views

Recommend Documents

In Defense of l0
often in various machine learning tasks, such as pre- ... (7). It diverges when γ0 → ∞, in other words, when the system is extremely sparse, l1 solution will do a ...

in chickpea - Semantic Scholar
Email :[email protected] exploitation of ... 1990) are simple and fast and have been employed widely for ... template DNA (10 ng/ l). Touchdown PCR.

in chickpea - Semantic Scholar
(USDA-ARS ,Washington state university,. Pullman ... products from ×California,USA,Sequi-GenGT) .... Table 1. List of polymorphic microsatellite markers. S.No.

Bluetooth Worms: Models, Dynamics, and Defense ... - Semantic Scholar
layer interactions between Bluetooth devices in a network at the scale in our study ..... 50% and 95% infection coverage in 3-dimension graphs. We observe that given ...... Security Symposium (Security '02), August 2002. [18] C. Taylor and N.

Networks in Finance - Semantic Scholar
Mar 10, 2008 - two questions arise: how resilient financial networks are to ... which the various patterns of connections can be described and analyzed in a meaningful ... literature in finance that uses network theory and suggests a number of areas

Discretion in Hiring - Semantic Scholar
In its marketing materials, our data firm emphasizes the ability of its job test to reduce ...... of Intermediaries in Online Hiring, mimeo London School of Economics.

Semantic processing is a¡ected in inhibition of ... - Semantic Scholar
Feb 12, 2007 - were instructed to do the opposite. Both speed and accuracy were emphasized in the instructions. ... Thus, there were 160 trials for each experimental condition. All the words were two syllabled and the word .... when the response is m

Distinctiveness in chromosomal behaviour in ... - Semantic Scholar
Marathwada Agricultural University,. Parbhani ... Uni, bi and multivalent were 33.33%, 54.21 % and. 2.23 % respectively. Average ... Stain tech, 44 (3) : 117-122.

Distinctiveness in chromosomal behaviour in ... - Semantic Scholar
Cytological studies in interspecific hybrid derivatives of cotton viz., IS-244/4/1 and IS-181/7/1 obtained in BC1F8 generation of trispecies cross ... Chromosome association of 5.19 I + 8.33 II +1.14III + 1.09IV and 6.0 I+ 7.7 II +0.7III + 1.25IV was

Cone of Experience - Semantic Scholar
Bruner, J.S. (1966). Toward a theory of instruction. Cambridge, MA: The Belknap Press of. Harvard University Press. Dale, E. (1946) Audio-visual methods in teaching. New York: The Dryden Press. Dale, E. (1954) Audio-visual methods in teaching, revise

Supporting Variable Pedagogical Models in ... - Semantic Scholar
eml.ou.nl/introduction/articles.htm. (13) Greeno, Collins, Resnick. “Cognition and. Learning”, In Handbook of Educational Psychology,. Berliner & Calfee (Eds) ...

Protecting Vulnerable Subjects in Clinical ... - Semantic Scholar
States Department of Health and Human Services. The. Office for Human ... The list of human-subject research abuses in the United. States is regrettably long. ... cal investigators protected vulnerable research subjects by excluding them from ...

OPTIONALITY IN EVALUATING PROSODY ... - Semantic Scholar
ILK / Computational Linguistics and AI. Tilburg, The Netherlands ..... ISCA Tutorial and Research Workshop on Speech Synthesis,. Perthshire, Scotland, 2001.

Deciphering Trends In Mobile Search - Semantic Scholar
Aug 2, 2007 - PDA and computer-based queries, where the average num- ber of words per ... ing the key and the system cycles through the letters in the order they're printed. ... tracted from that 5 seconds to estimate the network latency (the ..... M

Identifying global regulators in transcriptional ... - Semantic Scholar
discussions and, Verónica Jiménez, Edgar Dıaz and Fabiola Sánchez for their computer support. References and recommended reading. Papers of particular interest, .... Ju J, Mitchell T, Peters H III, Haldenwang WG: Sigma factor displacement from RN

Blocking Calls in Java - Semantic Scholar
FACULTY OF MATHEMATICS AND COMPUTER SCIENCE. Institute of Computer Science. Rein Raudjärv. Blocking Calls in Java. Bachelor thesis (4 AP).

integrating fuzzy logic in ontologies - Semantic Scholar
application of ontologies. KAON allows ... cycle”, etc. In order to face these problems the proposed ap- ...... porting application development in the semantic web.

SEVEN CONSECUTIVE PRIMES IN ARITHMETIC ... - Semantic Scholar
A related conjecture is the following: there exist arbitrarily long sequences of consecutive primes in arithmetic progression [2]. In 1967, Lander and Parkin. [4] reported finding the first and smallest sequence of 6 consecutive primes in AP, where t

Modelling Situations in Intelligent Agents - Semantic Scholar
straints on how it operates, or may influence the way that it chooses to achieve its ... We claim the same is true of situations. 1049 ... to be true. Figure 2 depicts the derivation rules for cap- turing when a particular situation becomes active an

Predicting Human Reaching Motion in ... - Semantic Scholar
algorithm that can be tuned through cross validation, however we found the results to be sparse enough without this regularization term (see Section V).

INVESTIGATING LINGUISTIC KNOWLEDGE IN A ... - Semantic Scholar
bel/word n-gram appears in the training data and its type is included, the n-gram is used to form a feature. Type. Description. W unigram word feature. f(wi). WW.

Efficient Semantic Service Discovery in Pervasive ... - Semantic Scholar
computing environments that integrate heterogeneous wireless network technolo- ... Simple Object Access Protocol (SOAP) on top of Internet protocols (HTTP, SMTP). .... In this area, various languages have been proposed to describe.