Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum School of Computer Science University of Massachusetts, Amherst {belanger,sheldon,mccallum}@cs.umass.edu
December 10, 2013
Table of Contents
1
Markov Random Fields
2
Frank-Wolfe for Marginal Inference
3
Optimality Guarantees and Convergence Rate
4
Beyond MRFs
5
Fancier FW
December 10, 2013
2 / 26
Table of Contents
1
Markov Random Fields
2
Frank-Wolfe for Marginal Inference
3
Optimality Guarantees and Convergence Rate
4
Beyond MRFs
5
Fancier FW
December 10, 2013
3 / 26
Markov Random Fields
December 10, 2013
4 / 26
Markov Random Fields
Φθ (x) =
X
θc (xc )
c∈C
December 10, 2013
4 / 26
Markov Random Fields
Φθ (x) =
X
θc (xc )
c∈C
P(x) =
exp (Φθ (x)) log(Z )
December 10, 2013
4 / 26
Markov Random Fields
Φθ (x) =
X
θc (xc )
c∈C
P(x) =
x→µ
exp (Φθ (x)) log(Z )
December 10, 2013
4 / 26
Markov Random Fields
Φθ (x) =
X
θc (xc )
c∈C
P(x) =
exp (Φθ (x)) log(Z )
x→µ Φθ (x) → hθ, µi
December 10, 2013
4 / 26
Marginal Inference
µMARG = EPθ [µ]
December 10, 2013
5 / 26
Marginal Inference
µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M
December 10, 2013
5 / 26
Marginal Inference
µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M
µ ¯ approx = arg maxhµ, θi + HB (µ) µ∈L
December 10, 2013
5 / 26
Marginal Inference
µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M
µ ¯ approx = arg maxhµ, θi + HB (µ) µ∈L
HB (µ) =
X
Wc H(µc )
c∈C
December 10, 2013
5 / 26
MAP Inference
µMAP = arg max hµ, θi µ∈M
December 10, 2013
6 / 26
MAP Inference
µMAP = arg max hµ, θi µ∈M
✓
Black&Box&& MAP&Solver&
µMAP
December 10, 2013
6 / 26
MAP Inference
µMAP = arg max hµ, θi µ∈M
✓
✓
Black&Box&& MAP&Solver&
Gray&Box&& MAP&Solver&
µMAP
µMAP
December 10, 2013
6 / 26
Marginal → MAP Reductions
Hazan and Jaakkola [2012] Ermon et al. [2013]
December 10, 2013
7 / 26
Table of Contents
1
Markov Random Fields
2
Frank-Wolfe for Marginal Inference
3
Optimality Guarantees and Convergence Rate
4
Beyond MRFs
5
Fancier FW
December 10, 2013
8 / 26
Generic FW with Line Search
yt = arg minhx, −∇f (xt−1 )i x∈X
xt = min f ((1 − γ)xt + γyt ) γ∈[0,1]
December 10, 2013
9 / 26
Generic FW with Line Search
xt
Compute& &Gradient&
rf (xt
1)
Linear&& Minimiza
Line&Search&
yt
December 10, 2013
10 / 26
FW for Marginal Inference
Compute&Gradient&
rF (µt ) = ✓ + rH(µt )
✓˜
µt+1 MAP& Inference& Oracle&
Line&Search&
µ ˜MAP
December 10, 2013
11 / 26
Subproblem Parametrization
F (µ) = hµ, θi +
X
Wc H(µc )
c∈C
December 10, 2013
12 / 26
Subproblem Parametrization
F (µ) = hµ, θi +
X
Wc H(µc )
c∈C
θ˜ = ∇F (µt ) = θ +
X
Wc ∇H(µc )
c∈C
December 10, 2013
12 / 26
Line Search
µt µt+1 µ ˜MAP
December 10, 2013
13 / 26
Line Search
µt µt+1 µ ˜MAP Computing line search objective can scale with:
December 10, 2013
13 / 26
Line Search
µt µt+1 µ ˜MAP Computing line search objective can scale with: Bad: # possible values in cliques.
December 10, 2013
13 / 26
Line Search
µt µt+1 µ ˜MAP Computing line search objective can scale with: Bad: # possible values in cliques. Good: # cliques in graph. (see paper) December 10, 2013
13 / 26
Experiment #1
December 10, 2013
14 / 26
Table of Contents
1
Markov Random Fields
2
Frank-Wolfe for Marginal Inference
3
Optimality Guarantees and Convergence Rate
4
Beyond MRFs
5
Fancier FW
December 10, 2013
15 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤
2CF (1 + δ) t +2
December 10, 2013
16 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤
δCf t+2
2CF (1 + δ) t +2
MAP suboptimality at iter t
December 10, 2013
16 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤
δCf t+2
2CF (1 + δ) t +2
MAP suboptimality at iter t −→ NP-Hard
December 10, 2013
16 / 26
Convergence Rate
Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤
δCf t+2
2CF (1 + δ) t +2
MAP suboptimality at iter t −→ NP-Hard
How to deal with MAP hardness? Use MAP solver and hope for the best [Hazan and Jaakkola, 2012]. Relax to the local polytope.
December 10, 2013
16 / 26
Curvature + Convergence Rate
Cf =
2 (f (y ) − f (x) − hy − x, ∇f (x)i) 2 x,s∈D;γ∈[0,1];y =x+γ(s−x) γ sup
December 10, 2013
17 / 26
Curvature + Convergence Rate
Cf =
2 (f (y ) − f (x) − hy − x, ∇f (x)i) 2 x,s∈D;γ∈[0,1];y =x+γ(s−x) γ sup
0.7 0.6
µt µt+1
entropy
0.5 0.4 0.3 0.2 0.1 0 0
µ ˜MAP
0.2
0.4 0.6 prob x = 1
0.8
December 10, 2013
1
17 / 26
Experiment #2
December 10, 2013
18 / 26
Table of Contents
1
Markov Random Fields
2
Frank-Wolfe for Marginal Inference
3
Optimality Guarantees and Convergence Rate
4
Beyond MRFs
5
Fancier FW
December 10, 2013
19 / 26
Beyond MRFs
Question Are MRFs the right Gibbs distribution to use Frank-Wolfe?
December 10, 2013
20 / 26
Beyond MRFs
Question Are MRFs the right Gibbs distribution to use Frank-Wolfe?
Problem Family tree-structured graphical models loopy graphical models Directed Spanning Tree Bipartite Matching
MAP Algorithm Viterbi Max-Product BP Chu Liu Edmonds Hungarian Algorithm
Marginal Algorithm Forward-Backward Sum-Product BP Matrix Tree Theorem ×
December 10, 2013
20 / 26
Table of Contents
1
Markov Random Fields
2
Frank-Wolfe for Marginal Inference
3
Optimality Guarantees and Convergence Rate
4
Beyond MRFs
5
Fancier FW
December 10, 2013
21 / 26
norm-regularized marginal inference µMARG = arg max hµ, θi + HM (µ) + λR(µ) µ∈M
Harchaoui et al. [2013].
December 10, 2013
22 / 26
norm-regularized marginal inference µMARG = arg max hµ, θi + HM (µ) + λR(µ) µ∈M
Harchaoui et al. [2013]. Local linear oracle for MRFs? µ ˜t = arg
max
hµ, θi
µ∈M∩Br (µt )
Garber and Hazan [2013]
December 10, 2013
22 / 26
Conclusion
We need to figure out how to handle the entropy gradient.
December 10, 2013
23 / 26
Conclusion
We need to figure out how to handle the entropy gradient. There are plenty of extensions to further Gibbs distributions + regularizers.
December 10, 2013
23 / 26
Further Reading I Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 334–342, 2013. D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization. ArXiv e-prints, January 2013. Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. arXiv preprint arXiv:1302.2325, 2013. Tamir Hazan and Tommi S Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 991–998, 2012. Bert Huang and Tony Jebara. Approximating the permanent with belief propagation. arXiv preprint arXiv:0908.1769, 2009. December 10, 2013
24 / 26
Further Reading II Mark Huber. Exact sampling from perfect matchings of dense regular bipartite graphs. Algorithmica, 44(3):183–193, 2006. Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 427–435, 2013. James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponential family graph matching and ranking. 2009. Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. In Advances in Neural Information Processing Systems, pages 1043–1051, 2013. Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matching problems. In Advances in Neural Information Processing Systems, pages 1322–1330, 2012. Pascal O Vontobel. The bethe permanent of a non-negative matrix. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 341–346. IEEE, 2010.
December 10, 2013
25 / 26
Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].
December 10, 2013
26 / 26
Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].
December 10, 2013
26 / 26
Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].
Sum-Product Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]
December 10, 2013
26 / 26
Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].
Sum-Product Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]
Frank-Wolfe Basically the same algorithm as for graphical models. Same issue with curvature. December 10, 2013
26 / 26