Marginal Inference in MRFs using Frank-Wolfe - CMAP, Polytechnique

Viewer
Transcript

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum School of Computer Science University of Massachusetts, Amherst {belanger,sheldon,mccallum}@cs.umass.edu

December 10, 2013

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

2 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

3 / 26

Markov Random Fields

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

P(x) =

exp (Φθ (x)) log(Z )

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

P(x) =

x→µ

exp (Φθ (x)) log(Z )

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

P(x) =

exp (Φθ (x)) log(Z )

x→µ Φθ (x) → hθ, µi

December 10, 2013

4 / 26

Marginal Inference

µMARG = EPθ [µ]

December 10, 2013

5 / 26

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M

December 10, 2013

5 / 26

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M

µ ¯ approx = arg maxhµ, θi + HB (µ) µ∈L

December 10, 2013

5 / 26

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M

µ ¯ approx = arg maxhµ, θi + HB (µ) µ∈L

HB (µ) =

X

Wc H(µc )

c∈C

December 10, 2013

5 / 26

MAP Inference

µMAP = arg max hµ, θi µ∈M

December 10, 2013

6 / 26

MAP Inference

µMAP = arg max hµ, θi µ∈M

✓

Black&Box&& MAP&Solver&

µMAP

December 10, 2013

6 / 26

MAP Inference

µMAP = arg max hµ, θi µ∈M

✓

✓

Black&Box&& MAP&Solver&

Gray&Box&& MAP&Solver&

µMAP

µMAP

December 10, 2013

6 / 26

Marginal → MAP Reductions

Hazan and Jaakkola [2012] Ermon et al. [2013]

December 10, 2013

7 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

8 / 26

Generic FW with Line Search

yt = arg minhx, −∇f (xt−1 )i x∈X

xt = min f ((1 − γ)xt + γyt ) γ∈[0,1]

December 10, 2013

9 / 26

Generic FW with Line Search

xt

Compute& &Gradient&

rf (xt

1)

Linear&& Minimiza
Line&Search&

yt

December 10, 2013

10 / 26

FW for Marginal Inference

Compute&Gradient&

rF (µt ) = ✓ + rH(µt )

✓˜

µt+1 MAP& Inference& Oracle&

Line&Search&

µ ˜MAP

December 10, 2013

11 / 26

Subproblem Parametrization

F (µ) = hµ, θi +

X

Wc H(µc )

c∈C

December 10, 2013

12 / 26

Subproblem Parametrization

F (µ) = hµ, θi +

X

Wc H(µc )

c∈C

θ˜ = ∇F (µt ) = θ +

X

Wc ∇H(µc )

c∈C

December 10, 2013

12 / 26

Line Search

µt µt+1 µ ˜MAP

December 10, 2013

13 / 26

Line Search

µt µt+1 µ ˜MAP Computing line search objective can scale with:

December 10, 2013

13 / 26

Line Search

µt µt+1 µ ˜MAP Computing line search objective can scale with: Bad: # possible values in cliques.

December 10, 2013

13 / 26

Line Search

µt µt+1 µ ˜MAP Computing line search objective can scale with: Bad: # possible values in cliques. Good: # cliques in graph. (see paper) December 10, 2013

13 / 26

Experiment #1

December 10, 2013

14 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

15 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

2CF (1 + δ) t +2

December 10, 2013

16 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

δCf t+2

2CF (1 + δ) t +2

MAP suboptimality at iter t

December 10, 2013

16 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

δCf t+2

2CF (1 + δ) t +2

MAP suboptimality at iter t −→ NP-Hard

December 10, 2013

16 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

δCf t+2

2CF (1 + δ) t +2

MAP suboptimality at iter t −→ NP-Hard

How to deal with MAP hardness? Use MAP solver and hope for the best [Hazan and Jaakkola, 2012]. Relax to the local polytope.

December 10, 2013

16 / 26

Curvature + Convergence Rate

Cf =

2 (f (y ) − f (x) − hy − x, ∇f (x)i) 2 x,s∈D;γ∈[0,1];y =x+γ(s−x) γ sup

December 10, 2013

17 / 26

Curvature + Convergence Rate

Cf =

2 (f (y ) − f (x) − hy − x, ∇f (x)i) 2 x,s∈D;γ∈[0,1];y =x+γ(s−x) γ sup

0.7 0.6

µt µt+1

entropy

0.5 0.4 0.3 0.2 0.1 0 0

µ ˜MAP

0.2

0.4 0.6 prob x = 1

0.8

December 10, 2013

1

17 / 26

Experiment #2

December 10, 2013

18 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

19 / 26

Beyond MRFs

Question Are MRFs the right Gibbs distribution to use Frank-Wolfe?

December 10, 2013

20 / 26

Beyond MRFs

Question Are MRFs the right Gibbs distribution to use Frank-Wolfe?

Problem Family tree-structured graphical models loopy graphical models Directed Spanning Tree Bipartite Matching

MAP Algorithm Viterbi Max-Product BP Chu Liu Edmonds Hungarian Algorithm

Marginal Algorithm Forward-Backward Sum-Product BP Matrix Tree Theorem ×

December 10, 2013

20 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

21 / 26

norm-regularized marginal inference µMARG = arg max hµ, θi + HM (µ) + λR(µ) µ∈M

Harchaoui et al. [2013].

December 10, 2013

22 / 26

norm-regularized marginal inference µMARG = arg max hµ, θi + HM (µ) + λR(µ) µ∈M

Harchaoui et al. [2013]. Local linear oracle for MRFs? µ ˜t = arg

max

hµ, θi

µ∈M∩Br (µt )

Garber and Hazan [2013]

December 10, 2013

22 / 26

Conclusion

We need to figure out how to handle the entropy gradient.

December 10, 2013

23 / 26

Conclusion

We need to figure out how to handle the entropy gradient. There are plenty of extensions to further Gibbs distributions + regularizers.

December 10, 2013

23 / 26

Further Reading I Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 334–342, 2013. D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization. ArXiv e-prints, January 2013. Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. arXiv preprint arXiv:1302.2325, 2013. Tamir Hazan and Tommi S Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 991–998, 2012. Bert Huang and Tony Jebara. Approximating the permanent with belief propagation. arXiv preprint arXiv:0908.1769, 2009. December 10, 2013

24 / 26

Further Reading II Mark Huber. Exact sampling from perfect matchings of dense regular bipartite graphs. Algorithmica, 44(3):183–193, 2006. Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 427–435, 2013. James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponential family graph matching and ranking. 2009. Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. In Advances in Neural Information Processing Systems, pages 1043–1051, 2013. Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matching problems. In Advances in Neural Information Processing Systems, pages 1322–1330, 2012. Pascal O Vontobel. The bethe permanent of a non-negative matrix. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 341–346. IEEE, 2010.

December 10, 2013

25 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].

December 10, 2013

26 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

December 10, 2013

26 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]

December 10, 2013

26 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]

Frank-Wolfe Basically the same algorithm as for graphical models. Same issue with curvature. December 10, 2013

26 / 26

Marginal Inference in MRFs using Frank-Wolfe - CMAP, Polytechnique

Dec 10, 2013 - Curvature + Convergence Rate. Cf = sup x,sâD;Î³â[0,1];y=x+Î³(sâx). 2 Î³2. (f (y) â f (x) â ãy â x,âf (x)ã). ËiMAP it it+1. 0. 0.2. 0.4. 0.6. 0.8. 1. 0. 0.1. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7 entropy prob x = 1. December 10, 2013. 17 / 26 ...

Download PDF

786KB Sizes 1 Downloads 212 Views

Report

Marginal Inference in MRFs using Frank-Wolfe - CMAP, Polytechnique

Recommend Documents