Consistency of trace norm minimization Francis Bach Willow project, INRIA - Ecole Normale Sup´erieure, Paris

Journal of Machine Learning Research, 2008, to appear

Trace norm minimization - Summary • Consider learning linear predictor on rectangular matrices M ∈ Rp×q • loading matrix W ∈ Rp×q , and prediction tr W ⊤M • Assumption of low rank loading matrix: – Matrix completion (Srebro et al., 2004) – collaborative filtering (Srebro et al., 2004, Abernethy et al., 2006) – Multi-task learning (Argyriou et al., 2006, Obozinsky et al., 2007) • Equivalent of the ℓ1 norm : trace norm = sums of singular values • Do we actually get low-rank solutions? – Necessary and sufficient consistency conditions – Extension of the Lasso / group Lasso results.

Trace norm - optimization problem • n observations (Mi, zi), i = 1, . . . , n, where zi ∈ R and Mi ∈ Rp×q • Optimization problem on W ∈ Rp×q : n X 1 min (zi − tr W ⊤Mi)2 + λnkW k∗ W ∈Rp×q 2n i=1

• kW k∗ = trace norm of W . – sums of the singular values – convex envelope of the rank (Fazel et al., 2001) – solutions W ∈ Rp×q are rank-deficient

Optimization algorithms • Convex non smooth optimization problem n X 1 (zi − tr W ⊤Mi)2 + λnkW k∗ min W ∈Rp×q 2n i=1

• Can be cast a semidefinite programming problem (Fazel et al., 2001; Srebro et al., 2005) • Solution expected to have low rank – Can be optimized efficiently on the space of low-rank matrices (Abernethy et al., 2006; Lu et al., 2008) • Optimization based on smoothing the trace norm using spectral functions

Special cases • Lasso: if xi ∈ Rm, define Mi = Diag(xi) ∈ Rm×m – Solution M is diagonal – Trace norm = ℓ1-norm of the diagonal – Extension of known results dj • Group Lasso: if x ∈ R for j = 1, . . . , m, i = 1, . . . , n, define ij Pm Mi ∈ R( j=1 dj )×m as the block diagonal matrix (with non square blocks) with diagonal blocks xji, j = 1, . . . , m

– Solution M is block-diagonal – Trace norm = block ℓ1-norm of the diagonal • NB: new results are extensions of results for Lasso (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006) and the group Lasso (Bach, 2008)

Asymptotic analysis - Assumptions ˆ mm = • Notation: Σ

1 n

Pn

⊤ pq×pq vec(M ) vec(M ) ∈ R i i i=1

• (A1) Given Mi, i = 1, . . . , n, the n values zi are i.i.d. and there exists W ∈ Rp×q such that for all i, E(zi|M1, . . . , Mn) = tr W⊤Mi and var(zi|M1, . . . , Mn) is a strictly positive constant σ 2. W is not equal to zero and does not have full rank. • (A2) There exists an invertible matrix Σmm ∈ Rpq×pq such that ˆ mm − Σmmk2 = O(ζn2 ) where ζn tends to zero. EkΣ F −1/2

Pn

• (A3) The random variable n i=1 εi vec(Mi ) is converging in distribution to a normal distribution with mean zero and covariance matrix σ 2Σmm, with εi = zi − tr W⊤Mi

Sufficient conditions for assumptions • i.i.d. assumption – If the matrices Mi aresampled i.i.d., z and M have finite fourth order moments, and E vec(M ) vec(M )⊤ is invertible, then (A2) and (A3) are satisfied with ζn = n−1/2. • collaborative filtering – nx values x ˜1 , . . . , x ˜nx ∈ Rp sampled i.i.d – ny values y˜1, . . . , y˜ny ∈ Rq sampled i.i.d. – distributions with finite fourth order moments and invertible second order moment matrices Σxx and Σyy – random subset of size n of pairs (ik , jk ) in {1, . . . , nx}×{1, . . . , ny } sampled uniformly. Observation Mk = x ˜ik y˜j⊤k – (A2) and (A3) are satisfied with Σmm = Σyy ⊗ Σxx and ζn = −1/2 −1/2 n−1/2 + nx + ny .

Asymptotic analysis • Two types of consistency ˆ − Wk > ε) tends to zero as n tends to 1. regular consistency: P(kW infinity, for all ε > 0 ˆ ) 6= rank(W)) tends to zero as n 2. rank consistency: P(rank(W tends to infinity. • Note difference with the Lasso (no notion of “pattern”) • Consistency depends on the decay of the regularization parameter and consistency condition.

Summary of main results ˆ is not a) if λn does not tend to zero, then the trace norm estimate W consistent; b) if λn tends to zero faster than n−1/2, then the estimate is consistent and its error is Op(n−1/2) while it is not rank-consistent with probability tending to one c) if λn tends to zero exactly at rate n−1/2, then the estimator is consistent with error Op(n−1/2) but the probability of estimating the correct rank is converging to a limit in (0, 1) d) if λn tends to zero more slowly than n−1/2, then the estimate is consistent with error Op(λn) and its rank consistency depends on specific consistency conditions

λn tends to zero more slowly than n−1/2 • Notations – W = U Diag(s)V⊤ singular value decomposition, with U ∈ Rp×r, V ∈ Rq×r, and r ∈ (0, min{p, q}) denotes the rank of W – U⊥ ∈ Rp×(p−r) and V⊥ ∈ Rq×(q−r) any orthogonal complements of U and V – Λ ∈ R(p−r)×(q−r) defined as vec(Λ) = ⊤

(V⊥ ⊗ U⊥)

Σ−1 mm(V⊥

 −1 ⊤ −1 ⊗ U⊥ ) (V⊥ ⊗ U⊥) Σmm(V ⊗ U) vec(I)

• Necessary condition for consistency: kΛk2 6 1 • Sufficient condition for consistency: kΛk2 < 1

Adaptive version - I • Adaptive version to provide a consistent algorithm with no consistency conditions ˆ M z) ˆ LS ) = Σ ˆ −1 vec(Σ • Least-square estimate vec(W mm ˆ ˆ −1 vec( Σ – n1/2(Σ M z ) − vec(W)) is converging in distribution to a mm normal distribution with zero mean and covariance matrix σ 2Σ−1 mm ˆ LS = ULS Diag(sLS )V ⊤ • singular value decomposition of W LS • For γ ∈ (0, 1], define ⊤ ⊤ ∈ Rq×q , ∈ Rp×p and B = VLS Diag(sLS )−γ VLS A = ULS Diag(sLS )−γ ULS

two positive definite symmetric matrices

Adaptive version - II • Corresponds to the adaptive Lasso of Zou (2006). • Consistency theorem: If γ ∈ (0, 1], n1/2λn tends to 0 and ˆ A of λnn1/2+γ/2 tends to infinity, then any global minimizer W n X 1 (zi − tr W ⊤Mi)2 + λnkAW Bk∗ 2n i=1

ˆ A − W) is is consistent and rank consistent. Moreover, n1/2 vec(W converging in distribution to a normal distribution with mean zero and covariance matrix  −1 ⊤ σ (V ⊗ U) (V ⊗ U) Σmm(V ⊗ U) (V ⊗ U)⊤. 2

Simulations - consistent condition satisfied consistent − non adaptive

1

2

0.8

1 n = 102

0.6

3

n = 10

n = 104

0.4

n = 105

log(RMS)

P(correct rank)

consistent − non adaptive

0.2

0 −1 −2 −3

0 −5

0 −log(λ)

−4 −5

5

consistent − adaptive (γ=1/2) 2 1

0.8

2

n = 10

n = 103

0.6

n = 104

0.4

n = 105

0.2

log(RMS)

P(correct rank)

5

consistent − adaptive (γ=1/2)

1

0 −5

0 −log(λ)

0 −1 −2 −3

0

5 −log(λ)

10

−4 −5

0

5 −log(λ)

10

Simulations - consistent condition not satisfied inconsistent − non adaptive

1

1.5

0.8

1 n = 102

0.6

n = 103 n = 104

0.4

n = 105

0.2

log(RMS)

P(correct rank)

inconsistent − non adaptive

0.5 0 −0.5

0 −5

0 −log(λ)

−1 −5

5

inconsistent − adaptive (γ=1/2) 2

0.8 n = 102 0.6

n = 103 n = 104

0.4

n = 105

0 log(RMS)

P(correct rank)

5

inconsistent − adaptive (γ=1/2)

1

−2

−4

0.2 0 −5

0 −log(λ)

0

5 −log(λ)

10

−6 −5

0

5 −log(λ)

10

Consistency of trace norm minimization Francis Bach

Trace norm - optimization problem. • n observations (Mi,zi), i = 1,...,n, where zi ∈ R and Mi ∈ R p×q. • Optimization problem on W ∈ R p×q. : min. W∈R p×q. 1.

125KB Sizes 1 Downloads 209 Views

Recommend Documents

Consistency of trace norm minimization
and a non i.i.d assumption which is natural in the context of collaborative filtering. As for the Lasso and the group Lasso, the nec- essary condition implies that ...

Consistency of trace norm minimization
learning, norms such as the ℓ1-norm may induce ... When learning on rectangular matrices, the rank ... Technical Report HAL-00179522, HAL, 2007b. S. Boyd ...

TRACE FOSSILS
Smooth the surface of the sediment using a flat piece of plastic or wood; a cheap ruler should work well. 4. Allow the animal to run, walk, or crawl across the ...

Norm Hord.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Norm Hord.pdf.

socializing consistency
often rather interact with a person than a machine: Virtual people may represent a ..... (Cook, 2000), an active topic of discussion as telephone-based call.

socializing consistency
demonstrates that as interfaces become more social, social consistency .... action with any number of such complex beings on a daily basis. .... media (stereotypically gender neutral), and computers (stereotypically male) ... In line with predictions

Childrens Tattling: The Reporting of Everyday Norm ... - Jesse Bering
refusing to share objects that they themselves owned or were using. A report of ..... the federal civil service: New evidence of the public service ethic. Journal of ...

PDF Books Norm Of The North
... Skills for improved productivity employment growth and development iv Chapter 5 Skills policies as drivers of development Nonlinear Dynamics and Statistical Physics focuses on both fundamental and applied problems involving interacting many body

Fair Simulation Minimization - Springer Link
Any savings obtained on the automaton are therefore amplified by the size of the ... tions [10] that account for the acceptance conditions of the automata. ...... open issue of extending our approach to generalized Büchi automata, that is, to.

Consistency of individual differences in behaviour of the lion-headed ...
1999 Elsevier Science B.V. All rights reserved. Keywords: Aggression .... the data analysis: Spearman rank correlation co- efficient with exact P values based on ...

Consistency Without Borders
Distributed consistency is a perennial research topic; in recent years it has become an urgent practical matter as well. The research literature has focused on enforcing various flavors of consistency at the I/O layer, such as linearizability of read

Distributed Asymptotic Minimization of Sequences of ...
Feb 21, 2012 - planning, acoustic source localization, and environmental modeling). ... steps. First, to improve the estimate of a minimizer, agents apply a ...... Dr. Cavalcante received the Excellent Paper Award from the IEICE in 2006 and the ...

Minimization of thin film contact resistance - MSU College of Engineering
Nov 19, 2010 - lack of analytical scaling that readily gives an explicit evalu- ation of thin .... our Fourier representation data reveal that the sole depen- dence of ...

Cyclone Trace Supplement.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Cyclone Trace ...

On the Consistency of Deferred Acceptance when Priorities are ... - Csic
Roth, A.E., and M.A.O. Sotomayor (1990): Two-Sided Matching: A Study in Game-Theoretic. Modeling and Analysis. Econometric Society Monograph Series.

The Basis of Consistency Effects in Word Naming
Kenseidenberg. Mark S Journal of Memory and Language; Dec 1, 1990; 29, 6; Periodicals Archive Online pg. 637 ..... (consistent vs. inconsistent) and frequency.

condorcet consistency of approval voting: a counter ...
mine voters' best responses and hence the equilibria of the game. Myerson .... component bğcŞ of vector b accounts for the number of voters who vote for bal- lot c. .... Notice that the magnitude of an outcome must be inferior or equal to zero,.

On the Consistency of Deferred Acceptance when ... - Semantic Scholar
An allocation µ Pareto dominates another allocation µ′ at R if µiRiµ′ ... at (R,Ch). Since Ch is substitutable, the so-called deferred acceptance rule, denoted ...

On the Consistency of Deferred Acceptance when ...
There is a set of agents N and a set of proper object types O. There is also a null object ... An allocation is a vector µ = (µi)i∈N assigning object µi ∈ O ∪ {∅} to.

Client-centric benchmarking of eventual consistency for cloud storage ...
Client-centric benchmarking of eventual consistency for cloud storage systems. Wojciech Golab1, Muntasir Raihan Rahman2, Alvin AuYoung3,. Kimberly Keeton3, Jay J. ... J. López, G. Gibson, A. Fuchs, and B. Rinaldi. YCSB++: benchmarking and performanc