Decomposing Structured Predic)on via Constrained ...

Viewer
Transcript

Decomposing Structured Predic2on via Constrained Condi2onal Models

Roth Dan

Department of Computer Science University of Illinois at Urbana-‐Champaign With thanks to: June 2013 Collaborators: Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others SLG Workshop, ICML, Atlanta GA Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP) Page 1

Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to 1. Christopher be famous.Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at This is an Inference least 65 now.

Problem

Page 2

Learning and Inference n 

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we oEen think about simpler structured problems: Parsing, Informa2on Extrac2on, SRL, etc. ¨  As we move up the problem hierarchy (Textual Entailment, QA,….) not all component models can be learned simultaneously ¨  We need to think about (learned) models for diﬀerent sub-‐problems, oEen pipelined ¨  Knowledge rela2ng sub-‐problems (constraints) becomes more essen2al and may appear only at evalua2on 2me ¨ 

n 

Goal: Incorporate models’ informa2on, along with prior knowledge (constraints) in making coherent decisions ¨ 

Decisions that respect the local models as well as domain & context speciﬁc knowledge/constraints. Page 3

Outline n 

Constrained Condi2onal Models A formula2on for global inference with knowledge modeled as expressive structural constraints ¨  A Structured Predic2on Perspec2ve ¨ 

n 

Decomposed Learning (DecL) Eﬃcient structure learning by reducing the learning-‐2me-‐inference to a small output space ¨  Provide condi2ons for when DecL is provably iden2cal to global structural learning (GL) ¨ 

Page 4

Three Ideas Underlying Constrained Condi2onal Models Modeling Idea 1: Separate modeling and problem formula2on from algorithms n 

¨ 

Similar to the philosophy of probabilis2c modeling

Inference Idea 2: Keep models simple, make expressive decisions (via constraints) n 

¨ 

Unlike probabilis2c modeling, where models become more expressive

Learning Idea 3: Expressive structured decisions can be supported by simply learned models n 

¨ 

Ampliﬁed and minimally supervised by exploi2ng dependencies among models’ outcomes. Page 5

Signiﬁcant performance Inference with General Constraint Structure [Roth&Yih’04,07] Improvement

Recognizing En22es and Rela2ons

other

0.05

other

0.10

other

0.05

per An 0.85 0.50 Objec2vper 0.60 Note: Non e f y = argmax ∑ry0.10 n0.30 loc lea loc 1u loc c2=on th 0.45 SequenFal Model [y=v] nscore ed m[y=v] at inco odels w rporate ith know s score l = argmax ¢ 1 + score ¢ 1 +… e Dole Elizabeth , is a native of N.C. d E ’s1A=wife, E E E = LOC g e (Qcues2ons: PER 1 = PER 1 = LOCKey 1o c n o s n t score ¢ 1 +….. r s a t in ts) rREa1 =2inS-ofeHow the global E1 R1 = S-of E3inference? d Cto ognuide di2onallearned or pipelined models Over independently ModePlipeline, Subject to Constraints R12 How tR o l23 earn? Independently, Jointly? per

irrelevant irrelevant

0.05

irrelevant

0.10

spouse_of spouse_of

0.45

spouse_of

0.05

born_in born_in

0.50

born_in

0.85

Models could be learned separately; constraints may come up only at decision 2me. Page 6

Constrained Condi2onal Models

Penalty for violating the constraint. (Soft) constraints component

Weight Vector for “local” models

Features, classifiers; loglinear models (HMM, CRF) or a combination

How to solve? Inferning workshop This is an Integer Linear Program Solving using ILP packages gives an exact solu2on. Cugng Planes, Dual Decomposi2on & other search techniques are possible

How far y is from a “legal” assignment

How to train? Training is learning the objec2ve func2on Decouple? Decompose? How to exploit the structure to minimize supervision? Page 7

Placing in context: a crash course in structured prediction

Structured Predic2on: Inference n 

Inference: given input x (a document, a sentence),

predict the best structure y = {y1,y2,…,yn} 2 Y (en22es & rela2ons) ¨ 

n 

Assign values to the y1,y2,…,yn, accoun2ng for dependencies among yis

Inference is expressed as a maximiza2on of a scoring func+on

y’ = argmaxy 2 Y wT Á (x,y) Set of allowed structures n 

Feature Weights (es2mated during learning)

Joint features on inputs and outputs

Inference requires, in principle, enumera2ng all y 2 Y at decision 2me, when we are given x 2 X and anempt to determine the best y 2 Y for it, given w ¨  For some structures, inference is computa2onally easy. ¨  Eg: Using the Viterbi algorithm ¨  In general, NP-‐hard (can be formulated as an ILP) Page 8

Structured Predic2on: Learning Learning: given a set of structured examples {(x,y)} ﬁnd a scoring func2on w that minimizes empirical loss. n  Learning is thus driven by the anempt to ﬁnd a weight vector w such that for each given annotated example (xi, yi): n 

Page 9

Structured Predic2on: Learning Learning: given a set of structured examples {(x,y)} ﬁnd a scoring func2on w that minimizes empirical loss. n  Learning is thus driven by the anempt to ﬁnd a weight vector w such that for each given annotated example (xi, yi): Penalty for Score o f a nnotated Score o f a ny predicFng other 8 y structure other s tructure structure n 

n 

We call these condi2ons the learning constraints.

n 

In most structured learning algorithms used today, the update of the weight vector w is done in an on-‐line fashion W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows:

n 

n 

What follows is a Structured Perceptron, but with minor varia2ons this procedure applies to CRFs and Linear Structured SVM

Page 10

In the structured the predic2on Structured Predic2on: Learning Algorithm case, (inference) step is oEen intractable n  For each example (xi, yi) and needs to be n  Do: (with the current weight vector w) done many 2mes ¨  Predict: perform Inference with the current weight vector n  yi’ = argmaxy 2 Y wT Á ( xi ,y)

Check the learning constraints n  Is the score of the current predicFon bePer than of (xi, yi)? ¨  If Yes – a mistaken predic2on n  Update w ¨  Otherwise: no need to update w on this example EndFor ¨ 

n 

Page 11

Structured Predic2on: Learning Algorithm n  n 

Solu2on I: decompose the scoring func2on to EASY and HARD parts

For each example (xi, yi) Do: ¨  Predict: perform Inference with the current weight vector

n  yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD

( xi ,y)

Check the learning constraint n  Is the score of the current predicFon bePer than of (xi, yi)? ¨  If Yes – a mistaken predic2on n  Update w ¨  Otherwise: no need to update w on this example n  EndDo EASY: could be feature func2ons that correspond to an HMM, a linear CRF, ¨ 

or a bank of classiﬁers (omigng dependence on y at learning 2me). May not be enough if the HARD part is s2ll part of each inference step.

Page 12

Structured Predic2on: Learning Algorithm n  n 

Solu2on II: Disregard some of the dependencies: assume a simple model.

For each example (xi, yi) Do: ¨  Predict: perform Inference with the current weight vector

n  yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD

( xi ,y)

Check the learning constraint n  Is the score of the current predicFon bePer than of (xi, yi)? ¨  If Yes – a mistaken predic2on n  Update w ¨  Otherwise: no need to update w on this example EndDo ¨ 

n 

Page 13

Structured Predic2on: Learning Algorithm n  n 

Solu2on III: Disregard some of the dependencies For each example (xi, yi) during learning; take into account at decision 2me Do: ¨  Predict: perform Inference with the current weight vector n  yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD

( xi ,y)

n 

Check the learning constraint n  Is the score of the current predicFon bePer than of (xi, yi)? ¨  If Yes – a mistaken predic2on n  Update w ¨  Otherwise: no need to update w on this example EndDo

n 

yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)

¨ 

This is the most commonly used soluFon in NLP today Page 14

Examples: CCM Formula2ons

(SoE) constraints component is more general since constraints can be declara2ve, non-‐grounded statements.

CCMs can be viewed as a general interface to easily combine declara2ve domain knowledge with data driven sta2s2cal models Formulate NLP Problems as ILP problems (inference may be done otherwise) 1. Sequence tagging (HMM/CRF + Global constraints) 2. Sentence Compression (Language Model + Global Constraints) 3. SRL (Independent classiﬁers + Global Constraints)

Constrained CondiFonal Models Allow: Sequen2al Sentence Compression/ Predic2on Linguis2cs Constraints n  Learning a simple model (or mulFple; or pipelines) Summariza2on: n  Make decisions with a more complex model HMM/CRF based: Cannot have both A states and B states constraints to include bias/re-‐rank n  Language Accomplished M Aodel rgmax based: ∑ ¸ijb xy ij directly incorporaFng in aa n If moodiﬁer utput cshosen, equence. its head global decisions composed of simpler models’ decisions Argmax ∑ ¸ijk xijk If verb is chosen, include its arguments n  More sophisFcated algorithmic approaches exist to bias the output [CoDL: Cheng et. al 07,12; PR: Ganchev et. al. 10; UEM: Samdani et. al 12] Page 15

Outline n 

Constrained Condi2onal Models A formula2on for global inference with knowledge modeled as expressive structural constraints ¨  A Structured Predic2on Perspec2ve ¨ 

n 

Decomposed Learning (DecL) Eﬃcient structure learning by reducing the learning-‐2me-‐inference to a small output space ¨  Provide condi2ons for when DecL is provably iden2cal to global structural learning (GL) ¨ 

Page 16

Training Constrained Condi2onal Models Decompose Model

n 

Training: ¨  ¨  ¨ 

n 

n 

Independently of the constraints (L+I) Jointly, in the presence of the constraints (IBT, GL) Decomposed to simpler models

Not surprisingly, decomposi2on is good. ¨ 

n 

Decompose Model from constraints

See [Chang et. al., Machine Learning Journal 2012]

Linle can be said theore2cally on the quality/generaliza2on of predic2ons made with a decomposed model Next, an algorithmic approach to decomposi2on that is both good, and comes with interes2ng guarantees. Page 17

Decomposed Structured Predic2on n 

y = argmaxy 2 Y wT Á (x,y) Learning is driven by the anempt to ﬁnd a weight vector w such that for each given annotated example (xi , yi):

y + ¢ (y,yi) 8 y wT Á (xi , yi ) ¸ wT Á (xi , y) n 

n 

In Global Learning, the output space is exponen2al in the number of variables – accurate learning can be intractable “Standard” ways to decompose it forget some of the structure and bring it back y1 only at decision 2me Learning Entity Inference

we

y5

y2

y1

y3

y4 y5

y6

y2

y3

y4

Relation y6

wr

Page 18

Decomposed Structural Learning (DecL) [samdani & Roth IMCL’12] n 

Algorithm: Restrict the ‘argmax’ inference to a small subset of variables while ﬁxing the remaining variables to their ground truth values yi ¨  ¨ 

… and repea2ng for diﬀerent subsets of the output variables: a decomposiFon The resul2ng set of assignments considered for yi is called a neighborhood(yi )

y1

00 01 10 11

n 

y 3

y4 y5

n 

y2

y6

000 001 010 011 100 101 110 111

Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sunon and McCallum, 07; Pseudomax – Sontag et al, 10

00 W 01e 10 Key contribuFon: give 11condi2ons under which DecL is provably equivalent to Global Learning (GL) Show experimentally that DecL provides results close to GL when such condi2ons do not exactly hold Page 19

What are good neighborhoods?

DecL vs. Global Learning (GL) n 

GL: Separate ground truth from 8y 2 Y ¨ 

•  DecL: Separate ground truth from 8y 2 nbr(yj) o  16 outputs

26 = 64 outputs

wT Á (xi , yi ) ¸ wT Á (xi , y) + ¢ (y,yi) 8 y 2 Y 00000 0 00000 1 00001 0 00001 1

y 1

Likely scenario: nbr(yj) ¿ Y y

y

2

3

y 4

y5

y6

y 00 01 10 11

8 y 2 nbr(yj) y y3 2

1

y 4

y5

y6 00 01 10 11

000 001 010 011 100 101 110 111 Page 20

Crea2ng Decomposi2ons n  n 

DecL allows diﬀerent decomposi2ons Sj for diﬀerent training instances yj Example: Learning with decomposi2ons in which all subsets of size k are considered: DecL-‐k ¨  ¨  ¨ 

n 

In prac2ce, neighborhoods should be determined based on domain knowledge ¨ 

n 

For each k-‐subset of the variables, enumerate; keep n-‐k to gold. K=1 is Pseudomax [Sontag et al, 2010] K=2 is Constraint Classiﬁca2on [Har-‐Peled, Zimak, Roth 2002; Crammer, Singer 2002]

Put highly coupled variables in the same set

The goal is to get results that are close to doing exact inference. Are there small & good neighborhoods?

Page 21

Exactness of DecL n 

n  n 

Diﬀerent label space

Separa2ng weights for DecL j j,y), 8 y 2 nbr(yj)} Wdecl: {w| w¢Á(xj, yj) ¸ w¢Á(xj, y)y+ ¢(yw ¨ 

n 

Score of all non ground truth y

Key Result: YES. Under “reasonable condi2ons”, DecL with small sized neighborhoods nbr(yj) gives the same results as Global Learning. For analyzing the equivalence between DecL and GL, we need a no2on of ‘separability’ of the data Separability: existence of set of weights W* that sa2sfy W*: {w | w¢Á(xj, yj) ¸ w¢Á(xj, y) + ¢(yj,y), 8 y 2 Y }

n 

Score of ground truth yj

W*

Naturally: W* µ Wdecl

Wdecl

Exactness Results: The set of separa2ng weights for DecL is equal to the set of separa2ng weights for GL: W*=Wdecl Page 22

Example of Exactness: Pairwise Markov Networks n 

Scoring func2on deﬁne over a graph with edges E

n 

Assume domain knowledge on W*: for correct (separa2ng) w 2 Pairwise/Edge W*, which of Singleton/Vertex the p airwise Á (.;w) are: i,k components components ¨ 

Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0) y2 y3 1 0y1 1 1 0 0

>

0

1

0

1

OR¨  Supermodular: Á (0,0)+ Á (1,1) < Á (0,1) + Á (1,0) i,k i,k y4 i,k i,k 1

1

0

y5

0

<

1

0

y6

Page 23

h

n X

max Myarkov Decomposi2on for arg Pairwise i wi · xN.etworks y2Y

0

1

i=1

n X

i (yi , x; w)

E = {(u, v) 2 E| 4

0, yuj

f (x, y; w) = sub(Á) j

n 

+

i=1

X

X X i,k (yi , yk , x; w) .

i,k2E

sup(Á)

uv

>

=

yvj

or 4

uv

X < 0, yuj 6= yvj }

m j ⇣ ⌘ E C E2 1 X j j j kwk + max yw · x y w · x + Iyj 6=y , j j j For 2an example ), deﬁne E by removing edges from E m (x ,yy2{0,1} j=1 where the labels disagree with the Á‘s

E j = {(u, v) 2 E|(sub( uv ) ^ yuj = yvj ) _ (sup( uv ) ^ yuj 6= yvj )} n  Theorem: Decomposing the variables as T cUy onnected 1 j yields q(y) / PE(y|x; w) e . components of E xactness.

q above is not well defined and so we take the limit

! 0 in (9) an Page 24

Experiments: Informa2on extrac2on Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . PredicFon result of a trained HMM [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 .

Violates lots of natural constraints! Page 25

Adding Expressivity via Constraints n 

Each ﬁeld must be a consecu2ve list of words and can appear at most once in a cita2on.

n 

State transi2ons must occur on punctua2on marks.

n 

The cita2on can only start with AUTHOR or EDITOR.

n  n  n  n 

The words pp., pages correspond to PAGE. Four digits star2ng with 20xx and 19xx are DATE. Quota2ons can appear only in TITLE ……. Page 26

Informa2on Extrac2on with Constraints n 

Adding constraints, we get correct results!

n 

[AUTHOR] [TITLE] [TECH-REPORT] [INSTITUTION] [DATE] Experimental Goal: n  n 

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May, 1994 .

Inves2gate DecL with small neighborhoods Note that the required theore2cal condi2ons hold only approximately: n  Output tokens tend to appear in con2guous blocks n  Use neighborhoods similar to the PMN Page 27

Typical Results: Informa2on Extrac2on (Ads Data) Accuracy 81

F1 Scores

80 79

HMM (LL)

78

L+I GL

77

DecL

76 75

HMM (LL)

L+I

GL

DecL Page 28

Typical Results: Informa2on Extrac2on (Ads Data) 81

80

F1 Scores

60 79

50

78

40 30

77

20 76

10

75

0

HMM (LL)

L+I

GL

Time taken to train (Minutes)

70

80

Time

DecL Page 29

Thank You!

Conclusion n 

Presented Constrained Condi2onal Models: ¨ 

n 

Supports joint inference while maintaining modularity and tractability of training ¨ 

n 

Interdependent components are learned (independently or pipelined) and, via joint inference, support coherent decisions, modulo declara2ve constraints.

Presented Decomposed Learning (DecL): eﬃcient joint learning by reducing the learning-‐2me-‐inference to a small output space ¨ 

n 

An ILP formula2on for structured predic2on that augments sta2s2cally learned models with declara2ve constraints as a way to incorporate knowledge and support decisions in expressive output spaces

Provided condi2ons for when DecL is provably iden2cal to global structural learning (GL)

Interes2ng open ques2ons in developing further understanding of how to support eﬃcient joint inference

Check out our tools, demos, tutorials Page 30

Minimally-Constrained Multilingual Embeddings via ...

Spectral Learning of General Weighted Automata via Constrained ...

Simple Training of Dependency Parsers via Structured ...

Simple Training of Dependency Parsers via Structured Boosting

Robust and Practical Face Recognition via Structured ...

Decomposing Differences in R0

DECOMPOSING INBREEDING AND COANCESTRY ...

Developing a Framework for Decomposing ...

Decomposing Discussion Forums using User Roles - DERI

Decomposing time-frequency macroeconomic relations

CONSTRAINED POLYNOMIAL OPTIMIZATION ...

Decomposing Differences in R0

DECOMPOSING INBREEDING AND COANCESTRY ...

Collusion Constrained Equilibrium

Constrained School Choice

Decomposing the Gender Wage Gap with Sample ...

Decomposing Duration Dependence in a Stopping ...

Decomposing bivariate dominance for social welfare ...

The relative sensitivity of algae to decomposing barley ...

Decomposing Duration Dependence in a Stopping ...