Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction Maria-Florina Balcan Avrim Blum Computer Science Department, Carnegie Mellon University

[email protected] [email protected]

Nathan Srebro [email protected] Toyota Technological Institute at Chicago, 1427 East 60th Street, Chicago IL 60637, USA Extended Abstract

1. Sparse Parzen Window Prediction We are concerned here with predictors of the form: f (x) =

n X

αi K(x, x0i )

(1)

i=1

where x01 , . . . , x0n ∈ X are landmark instances in our instance space X and K : X × X → R is a function encoding the relationship between instances. These type of predictors are fairly common and natural and are obtained by various learning rules. Support Vector Machines (SVM) learn a predictor of the form (1) (binary labels yi0 are often included explicitly in the predictor, but can also be encoded as the sign of αi ) by minimizing an objective related to a dual largemargin problem. SVMs enjoy performance guarantees based on interpreting K(·, ·) as an inner product in an implicit Hilbert space, and also tend to yield sparse predictors, i.e. with few non-zero αi s. Parzen window prediction (aka soft nearest neighbor prediction) corresponds to a predictor of the form (1), with αi = yi0 . There is no need to interpret K as an inner product nor to require that it be positive semidefinite—we can simply think of K as specifying similarity. Still thinking of K as encoding similarity, and perhaps also dissimilarity, we might prefer to learn a sparse predictor, with αi = 0 for many landmarks x0i as in SVMs, instead of simply fixing αi = yi0 . A more direct way of doing so is by minimizing a loss (here the hinge loss) with an explicit constraint on the L1 -norm of the coefficients αi : minimize s.t.

m X i=1 n X j=1

[1 − yi f (xi )]+ (2) |αj | ≤ M

where (xi , yi ) ∈ X × {±1} are labeled training examples, which might, or might not, be the same as the landmarks x0i (note that unlabeled examples can also be used as landmarks), and [1 − yx]+ = max(1 − yx, 0) is the hinge loss. This is a linear program and can be solved efficiently. We view K as a similarity function, and provide a natural condition on K, that does not require K be positive semidefinite, and justifies the learning rule (2). Our condition guarantees the success of learning rule (2) and provides bounds on the required number of landmarks and training examples. Furthermore, we show that any similarity function that is good as a kernel, i.e. can ensure SVM learning, also satisfies our condition and can thus also ensure learning using the learning rule (2) (though possibly with some deterioration of the learning guarantees). These arguments can be used to justify (2) as an alternative to SVMs.

2. Prior Work The learning rule (2), usually with the same set of points both as training examples and landmarks, and variations of this rule, have been suggested by various authors and is fairly common in practice. Such a learning rule is typically discussed as an alternative to SVMs: Tipping (2001) suggested the Relevance Vector Machine (RVM) as a Bayesian alternative to SVMs. The MAP estimate of the RVM is given by an optimization problem similar to (2), though with a loss function different from the hinge loss (the hingeloss cannot be obtained as a log-likelihood). Similarly, Singer (2000) suggests Norm-Penalized Leveraging Procedures as a boosting-like approach that mimics SVMs. Again, although the specific loss functions

studied by Singer are different from the hinge-loss, the method (with a norm exponent of 1, as in Singer’s experiments) otherwise corresponds to a coordinatedescent minimization of (2). Other authors do use the hinge loss and discuss the learning rule (2) as given here, with the express intent of achieving sparsity more directly by minimizing the L1 norm of the coefficients (Bennett & Campbell, 2000; Roth, 2001; Guigue et al., 2005). Despite the interest in the learning rule (2), none of the above works suggest learning guarantees. In the case of SVMs, we have an established theory that ensures us that when K is positive semidefinite and is a “good kernel” for the learning problem (i.e. corresponds to an implicit Hilbert space where the problem is mostly separable with large margin), then the SVM learning rule is guaranteed to find a predictor of the form (1) with small generalization error. However, to the best of our knowledge, no such theory has been previously suggested for the learning rule (2). Even when the SVM pre-conditions hold, and the SVM learning-rule would work, we do not know of a previous guarantee for the alternate learning rule (2). Furthermore, since the learning rule (2) does not require K to be positive semidefinite, nor refer to an implied Hilbert space, one might hope for a more direct condition on K, that does not require it be positive semidefinite, and is sufficient to guarantee the success of the learning rule (2). In fact, in order to enjoy the SVM guarantees while using L1 regularization to obtain sparsity, some authors suggest regularizing both the L1 norm kαk1 of the coefficient vector α (as in (2)), and P the norm kβk of the corresponding predictor β = j αj φ(x0j ) in the Hilbert space implied by K, where K(x, x0 ) = hφ(x), φ(x0 )i, as when using a SVM with K as a kernel (Osuna & Girosi, 1999; Gunn & Kandola, 2002).

3. Our Guarantees

function) such that the following conditions hold: 1. We have h i E(x,y)∼P [1 − yg(x)/γ]+ ≤ ²,

(3)

where g(x) = E(x0 ,y0 ,R(x0 )) [y 0 K(x, x0 ) | R(x0 )]. 2. Prx0 [R(x0 )] ≥ τ . That is, we require that at least τ fraction of the points are “reasonable” (in expectation), and that most points can be predicted according to the reasonable points similar, or dis-similar, to them (or rather, that the expected hinge loss of using this prediction is low). If a similarity function is good under Definition 1, then we can guarantee there is a predictor f (x) of the form (1) with low L1 -norm |α|1 achieving low expected hinge loss. This in turn yields a learning guarantee for the learning rule (2): Theorem 1 Let K be an (², γ, τ )-good similarity function for a learning problem P . For any δ, ²1 > 0, let x01 , . . . , x0n be a (potentially unlabeled) sample of ¶ µ 2 log(2/δ) n= log(2/δ) + 16 2 2 τ ²1 γ landmarks drawn from P . Then with probability at least 1 − δ, there exists a predictor of the form (1) with |α|1 =

n X

|αi | ≤ 1/γ

i=1

and expected hinge loss h i E(x,y)∼P [1 − yf (x)]+ ≤ ² + ²1 .

We consider learning problems specified by a joint distribution P over labeled examples (x, y). We consider learning a predictor based on both labeled examples drawn from this distribution, as well as unlabeled examples drawn from the marginal over x. Our goal is to obtain a predictor with low expected error with respect to P .

Corollary 1 Let K be an (², γ, τ )-good similarity function for a learning problem P . For any δ, ²1 > 0, with probability at least 1 − δ the predictor obtain from learning rule (2), with µ ¶ log(1/δ) n=O τ γ 2 ²21

Our main condition for a similarity function K is summarized in the following definition:

(unlabeled) landmarks and µ ¶ ˜ log n log(1/δ) m=O γ 2 ²21

Definition 1 A similarity function K is an (², γ, τ )good similarity function for a learning problem P if there exists a (probabilistic) set R of “reasonable points” (one may think of R as a random indicator

labeled training examples, has expected hinge loss at most ² + ²1 .

As discussed earlier, we also establish that if a similarity function is positive semidefinite and “good” in the traditional kernel sense, then it also satisfies Definition 1, yielding a learning guarantee on the learning rule (2). Recall that a function K : X × X is positive semidefinite iff there exists a mapping φ : X → H into a Hilbert space H such that K(x, x0 ) = hφ(x), φ(x0 )i. With this representation of K in mind: Definition 2 We say that a positive semidefinite K is an (², γ)-good kernel if there exists a vector β ∈ H, kβk ≤ 1/γ such that E(x,y)∼P [[1 − `hβ, φ(x)i]+ ] ≤ ². Theorem 2 If a positive semidefinite K is an (²0 , γ)good kernel in hinge loss for learning problem P (with deterministic labels), then for any ²1 > 0 there exists 2 c > 1 such that K is also a (²0 + ²1 , 1+²cγ0 /2²1 , 2²1c+²0 )good similarity function in hinge loss. Corollary 2 Let K be a positive semidefinite (², γ)good kernel for a learning problem P . For any δ, ²1 > 0, with probability at least 1 − δ the predictor obtained from learning rule (2), with µ ¶ (1 + ²/²1 )2 log(1/δ) n=O (² + ²1 )γ 4 ²21 (unlabeled) landmarks and µ ¶ (1 + ²/²1 )2 log n log(1/δ) ˜ m=O γ 4 ²21 labeled training examples, has expected hinge loss at most ² + ²1 . Note that if ²1 = Ω(²), e.g. if we aim for a fixed percentile increase over “optimal” error, or in the noiseless ³ case ´² = 0, the ³sample sizes ´ simplify to: log 1/δ log n log 1/δ ˜ n = O γ 4 ²3 and m = O . γ 4 ²2 1

1

Proofs of these theorems (in slightly different forms) appear in (Balcan et al., 2008), which focuses on generalizing the theory of learning with kernels to broader classes of pairwise similarity functions. Here, our focus is on how this extension can be used to provide formal guarantees for the common sparsity inducing learning rule given in equation (2).

References Balcan, M., Blum, A., & Srebro, N. (2008). Improved Guarantees for Learning via Similarity Functions. COLT. Bennett, K. P., & Campbell, C. (2000). Support vector machines: hype or hallelujah? SIGKDD Explor. Newsl., 2, 1–13.

Guigue, V., Rakotomamonjy, A., & Canu, S. (2005). Kernel basis pursuit. Proceedings of the 16th European Conference on Machine Learning (ECML’05). Springer. Gunn, S. R., & Kandola, J. S. (2002). Structural modelling with sparse kernels. Mach. Learn., 48, 137– 163. Osuna, E. E., & Girosi, F. (1999). Reducing the runtime complexity in support vector machines. In Advances in kernel methods: support vector learning, 271–283. Cambridge, MA, USA: MIT Press. Roth, V. (2001). Sparse kernel regressors. ICANN ’01: Proceedings of the International Conference on Artificial Neural Networks (pp. 339–346). London, UK: Springer-Verlag. Singer, Y. (2000). Leveraged vector machines. Advances in Neural International Proceedings System 12. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1, 211–244.

Similarity-Based Theoretical Foundation for Sparse ...

Similarity-Based Theoretical Foundation for Sparse Parzen Window. Prediction. Maria-Florina Balcan ninamf@cs.cmu.edu. Avrim Blum avrim@cs.cmu.edu. Computer Science Department, Carnegie Mellon University ... doing so is by minimizing a loss (here the hinge loss) with an explicit constraint on the L1-norm of the co-.

116KB Sizes 1 Downloads 182 Views

Recommend Documents

Sparse Representations for Text Categorization
statistical algorithm to model the data and the type of feature selection algorithm used ... This is the first piece of work that explores the use of these SRs for text ...

BAYESIAN PURSUIT ALGORITHM FOR SPARSE ...
We show that using the Bayesian Hypothesis testing to de- termine the active ... suggested algorithm has the best performance among the al- gorithms which are ...

Deformation techniques for sparse systems
Deformation methods for computing all solutions of a given zero-dimensional ...... Denote by IK the ideal in K[X1,...,Xn] which is the extension of the ideal I :=.

SPARSE CODING FOR SPEECH RECOGNITION ...
ing deals with the problem of how to represent a given input spectro-temporal ..... ICASSP, 2007. [7] B.A. Olshausen and D.J Field, “Emergence of simple-cell re-.

Sparse Distance Learning for Object Recognition ... - Washington
objects, we define a view-to-object distance where a novel view is .... Google 3D Warehouse. ..... levels into 18 (0◦ −360◦) and 9 orientation bins (0◦ −180◦),.

SPARSE CODING FOR SPEECH RECOGNITION ...
2Human Language Technology, Center of Excellence, ... coding information. In other words ... the l1 norm of the weights of the linear combination of ba-.

Sparse Spatiotemporal Coding for Activity ... - Semantic Scholar
of weights and are slow to train. We present an algorithm .... They guess the signs by performing line searches using a conjugate gradi- ent solver. To solve the ...

Theoretical Explanations for Firms' Information Privacy Behaviors
First, appearing extreme in their information privacy behaviors could jeopardize their claims to legitimacy by appearing to be outside the norms of their industry network. ... the name and contact information for its privacy manager, an explanation o

Theoretical Explanations for Firms' Information Privacy Behaviors
firms seeking to advance information privacy programs in pursuit of competitive advantage. Definitions and Concepts .... We present high level propositions to guide research and include hypothetical scenarios to ..... firms, on one hand, to employ a

Self-Explanatory Sparse Representation for Image ...
previous alternative extensions of sparse representation for image classification and face .... linear combinations of only few active basis vectors that carry the majority of the energy of the data. ..... search Funds for the Central Universities (N

Structured Sparse Low-Rank Regression Model for ... - Springer Link
3. Computer Science and Engineering,. University of Texas at Arlington, Arlington, USA. Abstract. With the advances of neuroimaging techniques and genome.

On Deterministic Sketching and Streaming for Sparse Recovery and ...
Dec 18, 2012 - CountMin data structure [7], and this is optimal [29] (the lower bound in. [29] is stated ..... Of course, again by using various choices of ε-incoherent matrices and k-RIP matrices ..... national Conference on Data Mining. [2] E. D. 

A Convex Hull Approach to Sparse Representations for ...
noise. A good classification model is one that best represents ...... Approximate Bayesian Compressed Sensing,” Human Language Tech- nologies, IBM, Tech.

Sparse Sieve MLE
... Business School, NSW, Australia; email: [email protected] .... space of distribution functions on [0,1]2 and its order k controls the smoothness of Bk,Pc , with a smaller ks associated with a smoother function along dimension s.

Sparse Spatiotemporal Coding for Activity ... - Research at Google
Brown University. Providence, Rhode Island 02912. CS-10-02. March 2010 ... a sparse, over-complete basis using a variant of the two-phase analysis-synthesis .... In the last few years, there has been a good deal of work in machine learning and ... av

A New Estimate of Restricted Isometry Constants for Sparse Solutions
Jan 12, 2011 - q < hTc. 0 q. (11) for all nonzero vector h in the null space of Φ. It is called the null ... find two solutions hT0 and −hTc. 0 ... First of all, we have. 6 ...

An Interpretable and Sparse Neural Network Model for ...
An Interpretable and Sparse Neural Network Model ... We adapt recent work on sparsity inducing penalties for architecture selection in neural networks. [1, 7] to ... is mean zero noise. In this model time series j does not Granger cause time series i

Truncated Power Method for Sparse Eigenvalue ...
Definition 1 Given a vector x and an index set F, we define the truncation ..... update of x and the update of z; where the update of x is a soft-thresholding ...

Linear Programming Algorithms for Sparse Filter Design
reduction of data acquisition and communication costs. However, designing a filter with the fewest number ... programming to design filters having a small number of non-zero coefficients, i.e., filters that are sparse. ... 36-615, 77 Massachusetts Av

Sparse distance metric learning for embedding compositional data
Simons Center for Data Analysis, Simons Foundation, New York, NY 10011. Abstract. We propose a novel method for distance metric learning and generalized ...