Conditional gradients everywhere Francis Bach SIERRA Project-team, INRIA - Ecole Normale Sup´erieure

NIPS Workshops - December 2013

Wolfe’s universal algorithm

www.di.ens.fr/~fbach/wolfe_anonymous.pdf

Conditional gradients everywhere • Conditional gradient and subgradient method – Fenchel duality – Generalized conditional gradient and mirror descent • Conditional gradient and greedy algorithms – Relationship with basis pursuit, matching pursuit • Conditional gradient and herding – Properties of conditional gradient iterates – Relationships with sampling

Composite optimization problems minp h(x) + f (Ax)

x∈R

• Assumptions – – – –

f : Rn → R Lipschitz-continuous h : Rp → R µ-strongly convex A ∈ Rn×p Efficient computations of a subgradient of f and a gradient of h∗

Composite optimization problems minp h(x) + f (Ax)

x∈R

• Assumptions – – – –

f : Rn → R Lipschitz-continuous ⇒ f ∗ has compact support C h : Rp → R µ-strongly convex ⇒ h∗ (1/µ)-smooth A ∈ Rn×p Efficient computations of a subgradient of f and a gradient of h∗

Composite optimization problems minp h(x) + f (Ax)

x∈R

• Assumptions – – – –

f : Rn → R Lipschitz-continuous ⇒ f ∗ has compact support C h : Rp → R µ-strongly convex ⇒ h∗ (1/µ)-smooth A ∈ Rn×p Efficient computations of a subgradient of f and a gradient of h∗

• Dual problem

minp h(x) + f (Ax) = minp max h(x) + y ⊤(Ax) − f ∗(y) x∈R y∈C x∈R   = max minp h(x) + x⊤A⊤y − f ∗(y) y∈C

x∈R

= max −h∗(−A⊤y) − f ∗(y) y∈C

Examples - Primal formulations minp h(x) + f (Ax), f Lipschitz, h strongly convex

x∈R

• ℓ2-regularized logistic regression and generalized linear models Pn µ 1 2 – h = 2 k · k2 and f (z) = n i=1 log(1 + e−yizi ) • SVM and structured max-margin formulations

– h = µ2 k · k22 and f (z) = max ℓ(y, yi ) + zi⊤(y − yi) y∈Y

– Taskar et al. (2005)

• Proximal operators

– h(x) = 12 kx−x0k2, A = I, f non-smooth and Lipschitz-continuous – Submodular function minimization (see, e.g., Bach, 2013a)

Submodular function minimization and conditional gradient • Submodular ( ∼ convex homogeneous) function on {0, 1}n ⊂ Rn • Lov´asz (1982):

min n f (x) = min n f (x)

x∈{0,1}

x∈[0,1]

1 • Fujishige (2005): 0-level set of minn f (x)+ kxk2 x∈R 2

Submodular function minimization and conditional gradient • Submodular ( ∼ convex homogeneous) function on {0, 1}n ⊂ Rn • Lov´asz (1982):

min n f (x) = min n f (x)

x∈{0,1}

x∈[0,1]

1 • Fujishige (2005): 0-level set of minn f (x)+ kxk2 x∈R 2 • Convex duality

1 1 1 2 2 ⊤ – minn f (x) + kxk = minn max y x + kxk = max − kyk22 y∈C x∈R x∈R y∈C 2 2 2

– C polytope with exponentially many vertices and facets – Linear functions may be minimized efficiently in O(n) – May need high precision

Examples - Dual formulations max −h∗(−A⊤y) − f ∗(y), C compact, h∗ smooth y∈C

• Constrained smooth supervised learning – – – –

∗ h∗ smooth convex data fitting term, f = 0, C compact  Typically, C = y ∈ Rn, Ω(y) 6 ω0 Computing a subgradient of f ⇔ Maximizing linear function on C Computing the dual norm Ω∗(z)

• Penalized smooth supervised learning

– f ∗(y) = κΩ(y) if Ω(y) 6 ω0, +∞ otherwise – If ω0 large enough, equivalent to penalized formulation max −h∗(−A⊤y) − κΩ(y) y∈C

Simple equivalence • Assume h(x) = µ2 kxk22 and f ∗(y) = 0 on C, i.e., f (x) = max x⊤y y∈C

µ • Subgradient method on primal problem minp kxk22 + max y ⊤Ax y∈C x∈R 2  ρt  ⊤ ′ xt = xt−1 − A f (Axt−1) + µxt−1 µ 1 • Conditional gradient on dual problem max − kA⊤yk22 y∈C 2µ   1   y¯ ⊤ − AA⊤yt−1 t−1 ∈ arg max y y∈C µ  y = (1 − ρt)yt−1 + ρty¯t−1. t

Simple equivalence • Assume h(x) = µ2 kxk22 and f ∗(y) = 0 on C, i.e., f (x) = max x⊤y y∈C

µ • Subgradient method on primal problem minp kxk22 + max y ⊤Ax y∈C x∈R 2 ( y¯t−1 ∈ arg max y ⊤Axt−1 = f ′(Axt−1) y∈C   1 ⊤ xt = (1 − ρt)xt−1 + ρt − µ A y¯t−1

1 • Conditional gradient on dual problem max − kA⊤yk22 y∈C 2µ  1 ⊤  x = −  µ A yt−1  t−1 y¯t−1 ∈ arg max y ⊤Axt−1 y∈C    yt = (1 − ρt)yt−1 + ρty¯t−1.

Mirror descent (Nemirovski and Yudin, 1983) • Assume h′ is a bijection from int(K) to Rp where K = dom(h) • Bregman divergence D(x1, x2) = h(x1) − h(x2) − (x1 − x2)⊤h′(x2) • Mirror descent recursion to minimize gprimal(x) = h(x) + f (Ax) xt = arg minp (x − x∈R

′ (xt−1) xt−1)⊤gprimal

1 + D(x, xt−1) ρt

Mirror descent (Nemirovski and Yudin, 1983) • Assume h′ is a bijection from int(K) to Rp where K = dom(h) • Bregman divergence D(x1, x2) = h(x1) − h(x2) − (x1 − x2)⊤h′(x2) • Mirror descent recursion to minimize gprimal(x) = h(x) + f (Ax)

1 xt = arg minp (x − + D(x, xt−1) x∈R ρt   1 ⊤ ′ ⊤ ′ = arg minp (x − xt−1) h (xt−1) + A f (Axt−1) + D(x, xt−1) x∈R ρt ′ (xt−1) xt−1)⊤gprimal

= arg minp h(x) − (1 − ρt)x⊤h′(xt−1) + ρtx⊤A⊤f ′(Axt−1) x∈R

• Equivalent reformulation ( y¯t−1 ∈ arg max y ⊤Axt−1 − f ∗(y) = f ′(Axt−1) y∈C

h′(xt) = (1 − ρt)h′(xt−1) + ρtA⊤y¯t−1

Mirror descent • Mirror descent recursion to minimize gprimal(x) = h(x) + f (Ax) (

y¯t−1

∈ arg max y ⊤Axt−1 − f ∗(y) = f ′(Axt−1) y∈C

h′(xt) = (1 − ρt)h′(xt−1) + ρtA⊤y¯t−1

Mirror descent • Mirror descent recursion to minimize gprimal(x) = h(x) + f (Ax) (

y¯t−1

∈ arg max y ⊤Axt−1 − f ∗(y) = f ′(Axt−1) y∈C

h′(xt) = (1 − ρt)h′(xt−1) + ρtA⊤y¯t−1

• Assume h′(xt) = −A⊤yt ⇔ xt = (h∗)′(−A⊤yt)  ⊤ ⊤ ∗ ′ ⊤  x = arg min h(x) + x A y = (h ) (−A yt−1) t−1 t−1  p  x∈R y¯t−1 ∈ arg max y ⊤Axt−1−f ∗(y) y∈C    y = (1 − ρ )y + ρ y¯ t

t

t−1

t t−1

• Generalized conditional gradient for max −h∗(−A⊤y) − f ∗(y) y∈C

Duality between mirror descent and conditional gradient • Generalized conditional gradient for max −h∗(−A⊤y) − f ∗(y) y∈C

– Algorithm from Bredies and Lorenz (2008) – New analysis for ρt = 2/(t + 1) or with line-search (Bach, 2013b) • Consequences of the equivalence – Primal-dual guarantees (see Jaggi, 2013) – Line search for conditional gradient leads to adaptive step-size for mirror descent – Any progress on one side leads to a progress on the other side – Relationship with equivalence of Grigas and Freund (2013)?

Duality between bundle and simplicial methods 1 ⊤ 2 1 2 minp f (Ax) + kxk2 = max − kA yk2 y∈C x∈R 2 2 • Simplicial methods (a.k.a. fully corrective steps) – – – –

Maximize − 21 kA⊤yk22 on the convex hull of y0, . . . , yt−1 More expensive quadratic programming (QP) Finite convergence for polytopes Provably better convergence?

Duality between bundle and simplicial methods 1 1 ⊤ 2 2 minp f (Ax) + kxk2 = max − kA yk2 y∈C x∈R 2 2 • Simplicial methods (a.k.a. fully corrective steps) – – – –

Maximize − 21 kA⊤yk22 on the convex hull of y0, . . . , yt−1 More expensive quadratic programming (QP) Finite convergence for polytopes Provably better convergence?

• Bundle methods – Minimize piecewise affine lower-bound n o 1 max minp f (Axi) + f ′(Axi)⊤A(x − xi) + kxk22 x∈R i∈{0,...,t−1} 2 • Implementation through active-set method for QP, e.g., minimumnorm-point algorithm (Wolfe, 1976)

Frank-Wolfe for penalized problems minp f (Ax) + κΩ(x)

x∈R

• f convex and smooth, Ω = norm • Conditional gradient algorithms when Ω∗ is simple (

x ¯t−1 ∈ arg min x⊤A⊤f ′(Axt−1) Ω(x)=1

xt

= (1 − ρt)xt−1 + τtx ¯t−1

• Dudik et al. (2012); Harchaoui et al. (2013); Zhang et al. (2012); Bach (2013b) • Several choices for ρt and τt lead to a convergence rate of O(1/t) • Multiplicative gaps are allowed (Bach, 2013c)

Dealing with approximate oracles • What if Ω∗ cannot be computed efficiently? – Typically multiplicative errors: for some κ > 1, for any y ∈ Rp, one can find x ∈ Rp such that 1 ∗ Ω(x) = 1 and Ω (y) 6 x⊤y 6 Ω∗(y) κ – Common in relaxation of matrix factorization problems – Different from additive errors (Jaggi, 2013) • Approximate solution with gap (κ − 1)Ω(x∗) (Bach, 2013c)

Gauge function interpretation (Dudik et al., 2012) Ω(x) = inf

X i∈I

|λi| such that x =

X

λizi, Ω(zi) = 1

i∈I

• Equivalent to ℓ1-penalization in potentially infinite dimensional spaces

2 X X 1 1

λiAzi + κ |λi| minp ky − Axk22 + κΩ(x) = minp y − x∈R 2 x∈R 2 2 i∈I

i∈I

Gauge function interpretation (Dudik et al., 2012) Ω(x) = inf

X i∈I

|λi| such that x =

X

λizi, Ω(zi) = 1

i∈I

• Equivalent to ℓ1-penalization in potentially infinite dimensional spaces

2 X X 1 1

λiAzi + κ minp ky − Axk22 + κΩ(x) = minp y − |λi| x∈R 2 x∈R 2 2 i∈I

i∈I

• Conditional gradient algorithm (Dudik et al., 2012)  ⊤ ⊤  i ∈ arg min z t  i A (Axt−1 − y)  i∈I  1 2 ky − (1 − ρ)Ax − τ Az k (ρ , τ ) ∈ arg min t−1 it 2 + κ|τ | t t  ρ,τ 2    x = (1 − ρ )x +τ z t

t

t−1

t it

• Boosting interpretation (Mason et al., 1999)

Basis pursuit vs. matching pursuit • Relaxed greedy approximation (Barron et al., 2008)  ⊤  i ∈ arg min z  t i (xt−1 − y)  i∈I  1 2 τ ∈ arg min ky − (1 − ρ )x − τ z k t t t−1 i t 2  τ 2    xt = (1 − ρt)xt−1 + τtzi t

– With ρt =

2 t+1 ,

xt converges to the basis pursuit solution

min

P

i∈I

|λi| such that y =

P

i∈I

λizi

– Matching pursuit (Mallat and Zhang, 1993)  ⊤  i ∈ arg min z  t i (xt−1 − y)  i∈I  1 2 ky − x − τ z k τ ∈ arg min t−1 i t t 2  τ 2    xt = xt−1 + τtzi t

Basis pursuit vs. matching pursuit • Relaxed greedy approximation (Barron et al., 2008)  ⊤  i ∈ arg min z  t i (xt−1 − y)  i∈I  1 2 τ ∈ arg min ky − (1 − ρ )x − τ z k t t t−1 i t 2  τ 2    xt = (1 − ρt)xt−1 + τtzi t

– With ρt =

2 t+1 ,

xt converges to the basis pursuit solution

min

P

i∈I

|λi| such that y =

P

i∈I

λizi

• Matching pursuit (Mallat and Zhang, 1993)  ⊤  i ∈ arg min z  t i (xt−1 − y)  i∈I  1 2 ky − x − τ z k τ ∈ arg min t−1 i t t 2  τ 2    xt = xt−1 + τtzi t

Herding • Goals of herding (Welling, 2009) – Given a feature map Φ : X → F and a vector µ ∈ F – Generate “pseudo-samples” x1, . . . , xn with properties similar to samples from the maximum entropy distribution s.t. EΦ(x) = µ xt+1 ∈ arg max hwt, Φ(x)i x∈X

wt+1 = wt + µ − Φ(xt+1)

Herding • Goals of herding (Welling, 2009) – Given a feature map Φ : X → F and a vector µ ∈ F – Generate “pseudo-samples” x1, . . . , xn with properties similar to samples from the maximum entropy distribution s.t. EΦ(x) = µ xt+1 ∈ arg max hwt, Φ(x)i x∈X

wt+1 = wt + µ − Φ(xt+1) • Reformulation as mean approximation (Chen et al., 2010)

2

1 Pn – Minimize n i=1 Φ(xi) − µ w.r.t. x1, . . . , xn ∈ X – Approximation of integrals, if F is a Hilbert space



Ep(x)f (x) = Ep(x)hΦ(x), f i = Ep(x)Φ(x), f = µ, f Pn 1 Ep(x)f (x) − Ep(x) ˆkkf k with µ ˆ = n i=1 Φ(xi) ˆ f (x) 6 kµ − µ

Interpretation as conditional gradient  • Marginal polytope M = hull {Φ(x), x ∈ X } µ

• Herding is equivalent to conditional gradient (Bach et al., 2012) 1 min kz − µk2 z∈M 2 – xt+1 = argmaxx∈X hΦ(x), µ − zti is the pseudo-sample – Trivial optimization problem... – Three strategies (ρt = 1/(t + 1), line search, fully corrective)

Convergence rates • No assumptions – ρt = 1/(t + 1) or line search: kµ − µ ˆk = O(t • µ is in the interior of M

−1/2

√ log t)

– ρt = 2/(t + 1): kµ − µ ˆk = O(t−1) (Chen et al., 2010) – line search: kµ − µ ˆk = O(exp(−αt)) (Guelat and Marcotte, 1986)

Convergence rates • No assumptions – ρt = 1/(t + 1) or line search: kµ − µ ˆk = O(t

−1/2

√ log t)

• µ is in the interior of M

– ρt = 2/(t + 1): kµ − µ ˆk = O(t−1) (Chen et al., 2010) – line search: kµ − µ ˆk = O(exp(−αt)) (Guelat and Marcotte, 1986)

• Proposition 1 (Bach et al., 2012): If F is finite-dimensional and p(x) > 0, µ is in the interior of M • Proposition 2 (Bach et al., 2012): If F is infinite-dimensional, µ cannot be in the interior of M • Open problem: the convergence is still empirically O(t−1) in many situations

Interesting open issues/problems • Distributional properties of iterates • Other interesting trivial optimization problems • Stochastic conditional gradient • Convergence rate of partially corrective algorithms – Application to submodular function minimization

References F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective. Technical Report 00645271, HAL, 2013a. F. Bach. Duality between subgradient and conditional gradient methods. Technical Report 00757696, HAL, 2013b. Francis Bach. Convex relaxations of structured matrix factorizations. CoRR, abs/1309.3117, 2013c. Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012. A. R. Barron, A. Cohen, W. Dahmen, and R. A. DeVore. Approximation and learning by greedy algorithms. The annals of Statistics, 36(1):64–94, 2008. K. Bredies and D. A. Lorenz. Iterated hard shrinkage for minimization problems with sparsity constraints. SIAM Journal on Scientific Computing, 30(2):657–683, 2008. Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proc. UAI, 2010. Miro Dudik, Zaid Harchaoui, J´erˆome Malick, et al. Lifted coordinate descent for learning with trace-norm regularization. In AISTATS-Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics-2012, volume 22, 2012. S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005. Jacques Guelat and Patrice Marcotte. Programming, 35(1):110–119, 1986.

Some comments on wolfe’s away step.

Mathematical

Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Technical Report 1302.2325, arXiv, 2013. M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the International Conference on Machine Learning (ICML), 2013. L. Lov´asz. Submodular functions and convexity. Mathematical programming: the state of the art, Bonn, pages 235–257, 1982. S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms as gradient descent in function space. NIPS, 1999. A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. John Wiley, 1983. B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large margin approach. In Proceedings of the International Conference on Machine Learning (ICML), 2005. M. Welling. Herding dynamical weights to learn. In Proc. ICML, 2009. P. Wolfe. Finding the nearest point in a polytope. Math. Progr., 11(1):128–149, 1976. X. Zhang, D. Schuurmans, and Y. Yu. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems (NIPS), 2012.

Conditional gradients everywhere

Conditional gradients everywhere. Francis Bach. SIERRA Project-team, INRIA - Ecole Normale Supérieure ... (1/µ)-smooth. – A ∈ R n×p. – Efficient computations of a subgradient of f and a gradient of h. ∗. • Dual problem min x∈R p h(x) + f(Ax) = min x∈R p max y∈C h(x) + y. ⊤. (Ax) − f. ∗. (y). = max y∈C { min x∈R p.

291KB Sizes 0 Downloads 253 Views

Recommend Documents

CONDITIONAL MEASURES AND CONDITIONAL EXPECTATION ...
Abstract. The purpose of this paper is to give a clean formulation and proof of Rohlin's Disintegration. Theorem (Rohlin '52). Another (possible) proof can be ...

Causal Conditional Reasoning and Conditional ...
judgments of predictive likelihood leading to a relatively poor fit to the Modus .... Predictive Likelihood. Diagnostic Likelihood. Cummins' Theory. No Prediction. No Prediction. Probability Model. Causal Power (Wc). Full Diagnostic Model. Qualitativ

Mining Top-K Multidimensional Gradients - CiteSeerX
work, we are interested to explore best cases (Top-K cells) of interesting ... by a Ph.D. Scholarship from FCT-Foundation of Science and Technology, Ministry .... The GRtree will follow the order of X>>Y>>Z. Given that we want to find large.

Detecting selection along environmental gradients ...
This effect is exacerbated whenever allele frequencies are correlated, either between populations or .... favoured B in populations 1–50 (most strongly in the first 10) and A in ... (G–I) Illustration of the three models of migration. Lines repre

Matter Is Everywhere - cloudfront.net
air around us, but it is still made of atoms that constantly move around freely in space. How can we tell? Take a balloon, for example. When we pump air into a ...

CONDITIONAL STATEMENTS AND DIRECTIVES
window: 'If you buy more than £200 in electronic goods here in a single purchase, .... defined by the Kolmogorov axioms, usually take as their domain a field of subsets of an ..... The best place to begin is the classic presentation Lewis (1973).

Conditional Probability.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

Conditional Probability Practice - edl.io
Use the table below to find each probability. Projected Number of Degree Recipients in 2010 (thousands). Degree. Male. Female. Associate's. 245. 433.

Acknowledgment of Conditional Employment
in this document and agree as a condition of my employment and ... NOW, THEREFORE, I have executed this document knowingly and ... Employee Signature.

Conflict-Driven Conditional Termination
Our search procedure combines decisions with reachability analysis to find potentially ... combining ranking functions with reachability analysis. ..... based tool FuncTion [32] cannot prove termination. cdct enables FuncTion to prove ... a meet ⊓,

CONDITIONAL STATEMENTS AND DIRECTIVES
always either true or false, but never both; as it is usually put, that they are two-valued. ... window: 'If you buy more than £200 in electronic goods here in a single .... defined by the Kolmogorov axioms, usually take as their domain a field of s

Conflict-Driven Conditional Termination
stract domain to construct and refine assignments to second-order variables in .... and first-order program variables are free but first-order position variables are ..... 100. FuncTion. Ul tima te. A utomizer. [18]. (d). Fig.4: Detailed comparison o

[DOWNLOAD] Everywhere That Mary Went
[DOWNLOAD] Everywhere That Mary Went

Mining Top-K Multidimensional Gradients - CiteSeerX
Several business applications such as marketing basket analysis, clickstream analysis, fraud detection and churning migration analysis demand gradient data ...

Conditional sentences intermediate.pdf
c) If you don't get nervous before the exam, you ______ (have) any problems. d) If you ______ (be) rude to your sister, she won't help you. e) If María José ...

Handling Conditional Discrimination
explanatory. Experimental evaluation demonstrates that the new local techniques remove exactly the bad discrimination, allowing differences in decisions as long as they are explainable. Index Terms—discrimination; classification; independence;. I.

the sky is everywhere pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. the sky is ...

Convex Optimization Overview - Stanford Engineering Everywhere
Oct 19, 2007 - Definition 2.1 A set C is convex if, for any x, y ∈ C and θ ∈ R with 0 ≤ θ .... these concepts are very different; in particular, X ≽ 0 does not imply ...

Spatial gradients in Clovis-age radiocarbon dates ... - Semantic Scholar
2 Oct 2007 - Edited by Linda S. Cordell, University of Colorado, Boulder, CO, and approved August 20, 2007 (received for review May 6, 2007). A key issue .... Arlington Springs. 10,960. 80. 12,901.5. 6. 3. Big Eddy. 10,832. 58. 12,842.5. 46. 4. Bonne

Conditional Nonlinear Planning
Reactive planners improvise solutions at run time as uncertainties, predicted or unpredicted, arise. A conditional plan does not exhibit the 'persistent goal- ..... 4) The contexts for the goals in the plan form a tautol- ogy. The context of every po

Matter Is Everywhere! - RPS Cloud Server
the plastic shells with air, the toys take shape. Since air is lighter than water, the pool toys can rest on the water without sinking. And then we can enjoy a sunny day while floating in a pool! Moving Atoms. Atoms are constantly moving. However, at

The Vertical City: Rent Gradients and Spatial Structure ...
Dec 4, 2015 - 1 See https://www.nysm.nysed.gov/wtc/about/facts.html. .... In the specific case of retail tenants, the term “anchor” refers more narrowly to large tenants who generate ..... offering memo text into a machine readable form.