Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum School of Computer Science University of Massachusetts, Amherst {belanger,sheldon,mccallum}@cs.umass.edu

December 10, 2013

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

2 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

3 / 26

Markov Random Fields

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

P(x) =

exp (Φθ (x)) log(Z )

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

P(x) =

x→µ

exp (Φθ (x)) log(Z )

December 10, 2013

4 / 26

Markov Random Fields

Φθ (x) =

X

θc (xc )

c∈C

P(x) =

exp (Φθ (x)) log(Z )

x→µ Φθ (x) → hθ, µi

December 10, 2013

4 / 26

Marginal Inference

µMARG = EPθ [µ]

December 10, 2013

5 / 26

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M

December 10, 2013

5 / 26

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M

µ ¯ approx = arg maxhµ, θi + HB (µ) µ∈L

December 10, 2013

5 / 26

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max hµ, θi + HM (µ) µ∈M

µ ¯ approx = arg maxhµ, θi + HB (µ) µ∈L

HB (µ) =

X

Wc H(µc )

c∈C

December 10, 2013

5 / 26

MAP Inference

µMAP = arg max hµ, θi µ∈M

December 10, 2013

6 / 26

MAP Inference

µMAP = arg max hµ, θi µ∈M



Black&Box&& MAP&Solver&

µMAP

December 10, 2013

6 / 26

MAP Inference

µMAP = arg max hµ, θi µ∈M





Black&Box&& MAP&Solver&

Gray&Box&& MAP&Solver&

µMAP

µMAP

December 10, 2013

6 / 26

Marginal → MAP Reductions

Hazan and Jaakkola [2012] Ermon et al. [2013]

December 10, 2013

7 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

8 / 26

Generic FW with Line Search

yt = arg minhx, −∇f (xt−1 )i x∈X

xt = min f ((1 − γ)xt + γyt ) γ∈[0,1]

December 10, 2013

9 / 26

Generic FW with Line Search

xt

Compute& &Gradient&

rf (xt

1)

Linear&& Minimiza
Line&Search&

yt

December 10, 2013

10 / 26

FW for Marginal Inference

Compute&Gradient&

rF (µt ) = ✓ + rH(µt )

✓˜

µt+1 MAP& Inference& Oracle&

Line&Search&

µ ˜MAP

December 10, 2013

11 / 26

Subproblem Parametrization

F (µ) = hµ, θi +

X

Wc H(µc )

c∈C

December 10, 2013

12 / 26

Subproblem Parametrization

F (µ) = hµ, θi +

X

Wc H(µc )

c∈C

θ˜ = ∇F (µt ) = θ +

X

Wc ∇H(µc )

c∈C

December 10, 2013

12 / 26

Line Search

µt µt+1 µ ˜MAP

December 10, 2013

13 / 26

Line Search

µt µt+1 µ ˜MAP Computing line search objective can scale with:

December 10, 2013

13 / 26

Line Search

µt µt+1 µ ˜MAP Computing line search objective can scale with: Bad: # possible values in cliques.

December 10, 2013

13 / 26

Line Search

µt µt+1 µ ˜MAP Computing line search objective can scale with: Bad: # possible values in cliques. Good: # cliques in graph. (see paper) December 10, 2013

13 / 26

Experiment #1

December 10, 2013

14 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

15 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

2CF (1 + δ) t +2

December 10, 2013

16 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

δCf t+2

2CF (1 + δ) t +2

MAP suboptimality at iter t

December 10, 2013

16 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

δCf t+2

2CF (1 + δ) t +2

MAP suboptimality at iter t −→ NP-Hard

December 10, 2013

16 / 26

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013] F (µt ) − F (µ∗ ) ≤

δCf t+2

2CF (1 + δ) t +2

MAP suboptimality at iter t −→ NP-Hard

How to deal with MAP hardness? Use MAP solver and hope for the best [Hazan and Jaakkola, 2012]. Relax to the local polytope.

December 10, 2013

16 / 26

Curvature + Convergence Rate

Cf =

2 (f (y ) − f (x) − hy − x, ∇f (x)i) 2 x,s∈D;γ∈[0,1];y =x+γ(s−x) γ sup

December 10, 2013

17 / 26

Curvature + Convergence Rate

Cf =

2 (f (y ) − f (x) − hy − x, ∇f (x)i) 2 x,s∈D;γ∈[0,1];y =x+γ(s−x) γ sup

0.7 0.6

µt µt+1

entropy

0.5 0.4 0.3 0.2 0.1 0 0

µ ˜MAP

0.2

0.4 0.6 prob x = 1

0.8

December 10, 2013

1

17 / 26

Experiment #2

December 10, 2013

18 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

19 / 26

Beyond MRFs

Question Are MRFs the right Gibbs distribution to use Frank-Wolfe?

December 10, 2013

20 / 26

Beyond MRFs

Question Are MRFs the right Gibbs distribution to use Frank-Wolfe?

Problem Family tree-structured graphical models loopy graphical models Directed Spanning Tree Bipartite Matching

MAP Algorithm Viterbi Max-Product BP Chu Liu Edmonds Hungarian Algorithm

Marginal Algorithm Forward-Backward Sum-Product BP Matrix Tree Theorem ×

December 10, 2013

20 / 26

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013

21 / 26

norm-regularized marginal inference µMARG = arg max hµ, θi + HM (µ) + λR(µ) µ∈M

Harchaoui et al. [2013].

December 10, 2013

22 / 26

norm-regularized marginal inference µMARG = arg max hµ, θi + HM (µ) + λR(µ) µ∈M

Harchaoui et al. [2013]. Local linear oracle for MRFs? µ ˜t = arg

max

hµ, θi

µ∈M∩Br (µt )

Garber and Hazan [2013]

December 10, 2013

22 / 26

Conclusion

We need to figure out how to handle the entropy gradient.

December 10, 2013

23 / 26

Conclusion

We need to figure out how to handle the entropy gradient. There are plenty of extensions to further Gibbs distributions + regularizers.

December 10, 2013

23 / 26

Further Reading I Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 334–342, 2013. D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization. ArXiv e-prints, January 2013. Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. arXiv preprint arXiv:1302.2325, 2013. Tamir Hazan and Tommi S Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 991–998, 2012. Bert Huang and Tony Jebara. Approximating the permanent with belief propagation. arXiv preprint arXiv:0908.1769, 2009. December 10, 2013

24 / 26

Further Reading II Mark Huber. Exact sampling from perfect matchings of dense regular bipartite graphs. Algorithmica, 44(3):183–193, 2006. Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 427–435, 2013. James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponential family graph matching and ranking. 2009. Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. In Advances in Neural Information Processing Systems, pages 1043–1051, 2013. Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matching problems. In Advances in Neural Information Processing Systems, pages 1322–1330, 2012. Pascal O Vontobel. The bethe permanent of a non-negative matrix. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 341–346. IEEE, 2010.

December 10, 2013

25 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].

December 10, 2013

26 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

December 10, 2013

26 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]

December 10, 2013

26 / 26

Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]

Frank-Wolfe Basically the same algorithm as for graphical models. Same issue with curvature. December 10, 2013

26 / 26

Marginal Inference in MRFs using Frank-Wolfe - CMAP, Polytechnique

Dec 10, 2013 - Curvature + Convergence Rate. Cf = sup x,s∈D;γ∈[0,1];y=x+γ(s−x). 2 γ2. (f (y) − f (x) − 〈y − x,∇f (x)〉). ˜iMAP it it+1. 0. 0.2. 0.4. 0.6. 0.8. 1. 0. 0.1. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7 entropy prob x = 1. December 10, 2013. 17 / 26 ...

786KB Sizes 1 Downloads 167 Views

Recommend Documents

Marginal Inference in MRFs using Frank-Wolfe - CMAP
Dec 10, 2013 - Generic FW with Line Search ... Computing line search objective can scale with: .... The bethe permanent of a non-negative matrix. In.

Marginal Inference in MRFs using Frank-Wolfe
Dec 10, 2013 - School of Computer Science. University of .... Use MAP solver and hope for the best [Hazan and Jaakkola, 2012]. Relax to the local polytope.

Conditional Gradient with Enhancement and ... - cmap - polytechnique
1000. 1500. 2000. −3. −2. −1. 0. 1. 2. 3. 4 true. CG recovery. The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components p = 2048, m = 512 Gaussian me

Remarks on Frank-Wolfe and Structural Friends - CMAP, Polytechnique
Outline of Topics. Review of Frank-Wolfe ... Here is a simple computational guarantee: A Computational Guarantee for the Frank-Wolfe algorithm. If the step-size ...

Remarks on Frank-Wolfe and Structural Friends - cmap - polytechnique
P ⊂ Rn is compact and convex f (·) is convex on P let x∗ denote any optimal solution of CP f (·) is differentiable on P it is “easy” to do linear optimization on P for any c : ˜x ← arg min x∈P. {cT x}. Page 5. 5. Topics. Review of FW.

Memory in Inference
the continuity of the inference, e.g. when I look out of the window at a bird while thinking through a problem, but this should not blind us to the existence of clear cases of both continuous and interrupted inferences. Once an inference has been int

Web-Scale Knowledge Inference Using Markov Logic ...
web-scale MLN inference by designing a novel ... this problem by using relational databases and task ... rithms in SQL that applies MLN clauses in batches.

Node Level Primitives for Exact Inference using GPGPU
Abstract—Exact inference is a key problem in exploring prob- abilistic graphical models in a variety of multimedia applications. In performing exact inference, a series of computations known as node level primitives are performed between the potent

What Drives Heterogeneity in the Marginal Propensity ...
Dec 3, 2016 - In order to understand the mechanisms that drive MPC heterogeneity, I adopt the dichotomy laid out in Parker (2015) between classes of models that can explain the re- lationship between cash on hand and the MPC. In the first class of mo

Unified Inference in Extended Syllogism - Semantic Scholar
duction/abduction/induction triad is defined formally in terms of the position of the ... the terminology introduced by Flach and Kakas, this volume), cor- respond to ...

Randomization Inference in the Regression ...
Download Date | 2/19/15 10:37 PM .... implying that the scores can be considered “as good as randomly assigned” in this .... Any test statistic may be used, including difference-in-means, the ...... software rdrobust developed by Calonico et al.

Causal inference in motor adaptation
Kording KP, Beierholm U, Ma WJ, Quartz S, Tenenbaum JB, Shams L (in press) Causal inference in Cue combination. PLOSOne. Robinson FR, Noto CT, Bevans SE (2003) Effect of visual error size on saccade adaptation in monkey. J. Neurophysiol 90:1235-1244.

Unified Inference in Extended Syllogism - Semantic Scholar
... formally in terms of the position of the shared term: c© 1998 Kluwer Academic Publishers. ...... Prior Analytics. Hackett Publishing Company, Indianapolis, Indi-.

Inference in Incomplete Models
Program for Economic Research at Columbia University and from the Conseil Général des Mines is grate- ... Correspondence addresses: Department of Economics, Harvard Uni- versity ..... Models with top-censoring or positive censor-.

A difference in the Shapley values between marginal ...
In ordinary cases (domains), these two interpretations lead to the same result (value), i.e., the Shapley value, in some restricted domains, although they lead to ...

Marginal Model Plots - SAS Support
variables and deviate for others largely because of the outlier, Pete Rose, the career hits leader. Figure 1 Marginal Model Plot for the 1986 Baseball Data. 1 ...

The Marginal Body
F urthermore, it is controversial, as. N atsoulas (1996, 199 7 ) has pointed out. If I am attentively looking at an apple, then my body position affects my perception.

Université Catholique de Louvain Ecole Polytechnique de ... - GitHub
requirements of the degree of Master in Computer Science (120 credits). ... One such facility is the macro. .... 3.2.1 Ordered Choice and Single Parse Rule . ..... Yet, if programming language history is any indication, it will take those languages t

Speech Enhancement by Marginal Statistical ...
G. Saha is with Department of E & ECE, Indian Institute of Technology,. Kharagpur 721302, India (e-mail: [email protected]). proposed MMSE spectral components estimation approaches using Laplacian or a special case of the gamma modeling of sp