Decomposing  Structured  Predic2on  via   Constrained  Condi2onal  Models  

    Roth   Dan  

Department  of  Computer  Science   University  of  Illinois  at  Urbana-­‐Champaign   With thanks to: June 2013 Collaborators: Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others SLG Workshop, ICML, Atlanta GA Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP) Page 1

Comprehension   (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to 1. Christopher be famous.Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at This is an Inference least 65 now.

Problem

Page 2

Learning  and  Inference     n 

Global  decisions  in  which  several  local  decisions  play  a  role     but  there  are  mutual  dependencies  on  their  outcome.   In  current  NLP  we  oEen  think  about  simpler  structured  problems:   Parsing,  Informa2on  Extrac2on,  SRL,  etc.     ¨  As  we  move  up  the  problem  hierarchy  (Textual  Entailment,  QA,….)  not   all  component  models  can  be  learned  simultaneously   ¨  We  need  to  think  about  (learned)  models  for  different  sub-­‐problems,   oEen  pipelined   ¨  Knowledge  rela2ng  sub-­‐problems  (constraints)  becomes  more   essen2al  and  may  appear  only  at  evalua2on  2me   ¨ 

n 

Goal:  Incorporate  models’  informa2on,  along  with  prior   knowledge  (constraints)  in  making  coherent  decisions     ¨ 

Decisions  that  respect  the  local  models  as  well  as  domain  &  context   specific  knowledge/constraints.   Page 3

Outline   n 

Constrained  Condi2onal  Models   A  formula2on  for  global  inference  with  knowledge  modeled  as  expressive   structural  constraints   ¨  A  Structured  Predic2on  Perspec2ve   ¨ 

n 

Decomposed  Learning  (DecL)       Efficient  structure  learning  by  reducing  the  learning-­‐2me-­‐inference  to  a   small  output  space     ¨  Provide  condi2ons  for  when  DecL  is  provably  iden2cal  to  global  structural   learning  (GL)   ¨ 

Page 4

Three  Ideas  Underlying  Constrained  Condi2onal  Models   Modeling Idea  1:              Separate  modeling  and  problem  formula2on  from  algorithms   n 

¨ 

Similar  to  the  philosophy  of  probabilis2c  modeling  

Inference Idea  2:              Keep  models  simple,  make  expressive  decisions  (via  constraints)   n 

¨ 

Unlike  probabilis2c  modeling,  where  models  become  more  expressive    

Learning Idea  3:              Expressive  structured  decisions  can  be  supported  by  simply              learned  models     n 

¨ 

Amplified  and  minimally  supervised    by  exploi2ng  dependencies  among   models’  outcomes.   Page 5

Significant  performance   Inference  with  General  Constraint  Structure  [Roth&Yih’04,07]   Improvement  

Recognizing  En22es  and  Rela2ons    

other  

0.05

other  

0.10

other  

0.05

per   An  0.85 0.50 Objec2vper   0.60 Note:  Non   e   f y = argmax ∑ry0.10 n0.30 loc  lea loc  1u loc   c2=on  th 0.45 SequenFal  Model   [y=v] nscore ed  m[y=v] at    inco odels  w rporate ith  know s              score l = argmax ¢ 1 + score ¢ 1 +… e Dole Elizabeth , is a native of N.C. d              E  ’s1A=wife, E E E = LOC g e  (Qcues2ons:   PER 1 = PER 1 = LOCKey   1o     c n o s n t                                                        score ¢ 1 +….. r s a t in  ts)     rREa1 =2inS-ofeHow   the  global   E1 R1 = S-of E3inference?   d  Cto  ognuide   di2onallearned  or  pipelined    models   Over  independently    ModePlipeline,   Subject to Constraints R12 How  tR o  l23 earn?  Independently,   Jointly?     per  

irrelevant   irrelevant  

0.05

irrelevant  

0.10

spouse_of   spouse_of  

0.45

spouse_of  

0.05

born_in   born_in  

0.50

born_in  

0.85

Models  could  be  learned  separately;  constraints  may  come  up  only  at  decision  2me.     Page 6

Constrained  Condi2onal  Models  

Penalty for violating the constraint. (Soft) constraints component

Weight Vector for “local” models

Features, classifiers; loglinear models (HMM, CRF) or a combination

How  to  solve?   Inferning workshop This  is  an  Integer  Linear  Program   Solving  using  ILP  packages  gives  an     exact  solu2on.     Cugng  Planes,  Dual  Decomposi2on  &   other  search  techniques  are  possible    

How far y is from a “legal” assignment

How  to  train?   Training  is  learning  the  objec2ve   func2on   Decouple?  Decompose?     How  to  exploit  the  structure  to                 minimize  supervision?   Page 7

Placing in context: a crash course in structured prediction

Structured  Predic2on:  Inference   n 

Inference:  given  input  x (a  document,  a  sentence),    

                                                 predict  the  best  structure  y  =  {y1,y2,…,yn}  2  Y    (en22es  &  rela2ons)   ¨ 

n 

Assign  values  to  the  y1,y2,…,yn,  accoun2ng  for  dependencies  among  yis  

Inference  is  expressed  as  a  maximiza2on  of  a  scoring  func+on  

                                                                       y’  =  argmaxy  2 Y  wT  Á  (x,y)   Set  of  allowed   structures   n 

Feature  Weights   (es2mated  during  learning)  

Joint  features   on  inputs  and   outputs  

Inference  requires,  in  principle,  enumera2ng  all  y  2  Y  at  decision  2me,  when   we  are  given  x  2  X  and  anempt  to  determine  the  best  y  2  Y  for  it,  given  w     ¨  For  some  structures,  inference  is  computa2onally  easy.     ¨  Eg:  Using  the  Viterbi  algorithm     ¨  In  general,  NP-­‐hard  (can  be  formulated  as  an  ILP)   Page 8

Structured  Predic2on:  Learning   Learning:  given  a  set  of  structured  examples  {(x,y)}                                                  find  a  scoring  func2on  w  that  minimizes  empirical  loss.   n  Learning  is  thus  driven  by  the  anempt  to  find  a  weight  vector  w   such  that  for  each  given  annotated  example  (xi,  yi):   n 

     

Page 9

Structured  Predic2on:  Learning   Learning:  given  a  set  of  structured  examples  {(x,y)}                                                  find  a  scoring  func2on  w  that  minimizes  empirical  loss.   n  Learning  is  thus  driven  by  the  anempt  to  find  a  weight  vector  w   such  that  for  each  given  annotated  example  (xi,  yi):     Penalty  for   Score   o f   a nnotated   Score   o f   a ny   predicFng  other   8  y     structure   other   s tructure     structure   n 

n 

We  call  these  condi2ons  the  learning  constraints.  

n 

In  most  structured  learning  algorithms  used  today,  the  update  of  the   weight  vector  w  is  done  in  an  on-­‐line  fashion   W.l.o.g.  (almost)  we  can  thus  write  the  generic  structured  learning   algorithm  as  follows:  

n 

n 

What  follows  is  a  Structured  Perceptron,  but  with  minor  varia2ons  this   procedure  applies  to  CRFs  and  Linear  Structured  SVM  

Page 10

In  the  structured   the  predic2on   Structured  Predic2on:  Learning  Algorithm   case,   (inference)  step  is   oEen  intractable   n  For  each  example  (xi,  yi)   and  needs  to  be   n     Do:  (with  the  current  weight  vector  w)   done  many  2mes   ¨  Predict:  perform  Inference  with  the  current  weight  vector     n  yi’  =  argmaxy  2 Y  wT  Á  (  xi  ,y)  

Check  the  learning  constraints   n  Is  the  score  of  the  current  predicFon  bePer  than  of  (xi,  yi)?   ¨  If  Yes  –  a  mistaken  predic2on   n  Update  w   ¨  Otherwise:  no  need  to  update  w  on  this  example   EndFor   ¨ 

n 

Page 11

Structured  Predic2on:  Learning  Algorithm   n  n 

Solu2on  I:   decompose  the   scoring  func2on  to   EASY  and  HARD  parts  

For  each  example  (xi,  yi)   Do:   ¨  Predict:  perform  Inference  with  the  current  weight  vector    

n  yi’  =  argmaxy  2 Y    wEASYT  ÁEASY  (  xi  ,y)  +  wHARDT  ÁHARD  

(  xi  ,y)    

Check  the  learning  constraint   n  Is  the  score  of  the  current  predicFon  bePer  than  of  (xi,  yi)?   ¨  If  Yes  –  a  mistaken  predic2on   n  Update  w   ¨  Otherwise:  no  need  to  update  w  on  this  example   n  EndDo   EASY:  could  be  feature  func2ons  that  correspond  to  an  HMM,  a  linear  CRF,       ¨ 

or  a  bank  of  classifiers  (omigng  dependence  on  y  at  learning  2me).   May  not  be  enough  if  the  HARD  part  is  s2ll  part  of  each  inference  step.  

Page 12

Structured  Predic2on:  Learning  Algorithm   n  n 

Solu2on  II:  Disregard   some  of  the   dependencies:   assume  a  simple   model.  

For  each  example  (xi,  yi)   Do:   ¨  Predict:  perform  Inference  with  the  current  weight  vector    

n  yi’  =  argmaxy  2 Y    wEASYT  ÁEASY  (  xi  ,y)  +  wHARDT  ÁHARD  

(  xi  ,y)    

Check  the  learning  constraint   n  Is  the  score  of  the  current  predicFon  bePer  than  of  (xi,  yi)?   ¨  If  Yes  –  a  mistaken  predic2on   n  Update  w   ¨  Otherwise:  no  need  to  update  w  on  this  example   EndDo   ¨ 

n 

Page 13

Structured  Predic2on:  Learning  Algorithm   n  n 

Solu2on  III:  Disregard  some  of  the  dependencies   For  each  example  (xi,  yi)   during  learning;  take  into  account  at  decision  2me   Do:   ¨  Predict:  perform  Inference  with  the  current  weight  vector     n  yi’  =  argmaxy  2 Y    wEASYT  ÁEASY  (  xi  ,y)  +  wHARDT  ÁHARD  

(  xi  ,y)    

n 

Check  the  learning  constraint   n  Is  the  score  of  the  current  predicFon  bePer  than  of  (xi,  yi)?   ¨  If  Yes  –  a  mistaken  predic2on   n  Update  w   ¨  Otherwise:  no  need  to  update  w  on  this  example   EndDo  

n 

yi’  =  argmaxy  2 Y    wEASYT  ÁEASY  (  xi  ,y)  +  wHARDT  ÁHARD  (  xi  ,y)    

¨ 

This  is  the  most  commonly  used  soluFon  in  NLP  today   Page 14

Examples:  CCM  Formula2ons  

(SoE)  constraints  component  is  more   general  since  constraints  can  be   declara2ve,  non-­‐grounded  statements.    

CCMs  can  be  viewed  as  a  general  interface  to  easily  combine   declara2ve  domain  knowledge  with  data  driven  sta2s2cal  models   Formulate  NLP  Problems  as  ILP  problems                  (inference  may  be  done  otherwise)    1.  Sequence  tagging                        (HMM/CRF  +  Global  constraints)    2.  Sentence  Compression      (Language  Model  +  Global  Constraints)    3.  SRL                                                                            (Independent  classifiers  +  Global  Constraints)    

Constrained  CondiFonal  Models  Allow:   Sequen2al   Sentence   Compression/ Predic2on   Linguis2cs  Constraints   n  Learning  a  simple  model    (or  mulFple;  or  pipelines)     Summariza2on:     n  Make  decisions  with  a  more  complex  model    HMM/CRF  based:    Cannot  have  both  A  states  and  B  states   constraints   to  include   bias/re-­‐rank                      n  Language              Accomplished        M    Aodel   rgmax   based:   ∑  ¸ijb  xy  ij  directly  incorporaFng   in  aa  n   If   moodifier   utput   cshosen,   equence.   its  head   global  decisions  composed  of  simpler  models’  decisions                                              Argmax  ∑  ¸ijk  xijk   If  verb  is  chosen,  include  its  arguments     n  More  sophisFcated  algorithmic  approaches  exist  to  bias  the   output    [CoDL:  Cheng  et.  al  07,12;  PR:  Ganchev  et.  al.  10;  UEM:  Samdani  et.  al  12]     Page 15

Outline   n 

Constrained  Condi2onal  Models   A  formula2on  for  global  inference  with  knowledge  modeled  as  expressive   structural  constraints   ¨  A  Structured  Predic2on  Perspec2ve   ¨ 

n 

Decomposed  Learning  (DecL)       Efficient  structure  learning  by  reducing  the  learning-­‐2me-­‐inference  to  a   small  output  space     ¨  Provide  condi2ons  for  when  DecL  is  provably  iden2cal  to  global  structural   learning  (GL)   ¨ 

Page 16

Training  Constrained  Condi2onal  Models     Decompose Model

n 

Training:   ¨  ¨  ¨ 

n 

n 

Independently  of  the  constraints  (L+I)   Jointly,  in  the  presence  of  the  constraints  (IBT,  GL)   Decomposed  to  simpler  models  

Not  surprisingly,  decomposi2on  is  good.   ¨ 

n 

Decompose Model from constraints

See    [Chang  et.  al.,  Machine  Learning  Journal  2012]  

Linle  can  be  said  theore2cally  on  the  quality/generaliza2on  of  predic2ons   made  with  a  decomposed  model     Next,  an  algorithmic  approach  to  decomposi2on  that  is  both  good,  and   comes  with  interes2ng  guarantees.     Page 17

Decomposed  Structured  Predic2on   n 

y = argmaxy 2 Y wT Á (x,y) Learning  is  driven  by  the  anempt  to  find  a  weight  vector  w   such  that  for  each  given  annotated  example  (xi  ,  yi):  

y + ¢ (y,yi) 8  y   wT Á (xi , yi ) ¸ wT Á (xi , y) n 

n 

In  Global  Learning,  the  output  space  is   exponen2al  in  the  number  of  variables  –   accurate  learning  can  be  intractable   “Standard”  ways  to  decompose  it  forget   some  of  the  structure  and  bring  it  back   y1 only  at  decision  2me   Learning   Entity Inference  

we

y5

y2

y1

y3

y4 y5

y6

y2

y3

y4

Relation y6

wr

Page 18

Decomposed  Structural  Learning  (DecL)  [samdani  &  Roth  IMCL’12]     n 

Algorithm:  Restrict  the  ‘argmax’  inference  to  a  small  subset  of  variables   while  fixing  the  remaining  variables  to  their  ground  truth  values  yi   ¨  ¨ 

…  and  repea2ng  for  different  subsets  of  the  output  variables:  a  decomposiFon     The  resul2ng  set  of  assignments    considered  for  yi  is  called  a  neighborhood(yi  )  

y1

  00 01 10 11

n 

y 3

y4 y5

n 

y2

y6

000 001 010 011 100 101 110 111

Related  work:     Pseudolikelihood  –  Besag,  77;   Piecewise  Pseudolikelihood  –   Sunon  and  McCallum,  07;   Pseudomax  –  Sontag  et  al,  10  

00 W 01e  10 Key  contribuFon:   give  11condi2ons  under  which  DecL  is  provably   equivalent  to  Global  Learning  (GL)   Show  experimentally  that  DecL  provides  results  close  to  GL  when  such   condi2ons  do  not  exactly  hold   Page 19

What  are  good   neighborhoods?  

DecL  vs.  Global  Learning  (GL)   n 

GL:  Separate  ground  truth   from  8y 2 Y ¨ 

•  DecL:  Separate  ground  truth   from  8y 2 nbr(yj)   o  16 outputs

26  =  64  outputs  

wT Á (xi  ,  yi  ) ¸ wT Á (xi , y) + ¢ (y,yi) 8  y  2  Y 00000 0 00000 1 00001 0 00001 1

y 1

Likely  scenario:  nbr(yj) ¿ Y y

y

2

3

y 4

y5

y6

y 00 01 10 11

8 y 2 nbr(yj)     y y3 2

1

y 4

y5

y6 00 01 10 11

000 001 010 011 100 101 110 111 Page 20

Crea2ng  Decomposi2ons   n  n 

DecL  allows  different  decomposi2ons  Sj for  different  training   instances  yj   Example:  Learning  with  decomposi2ons  in  which  all  subsets   of  size  k  are  considered:  DecL-­‐k   ¨  ¨  ¨ 

n 

In  prac2ce,  neighborhoods  should  be  determined  based  on   domain  knowledge   ¨ 

n 

For  each  k-­‐subset  of  the  variables,  enumerate;  keep  n-­‐k  to  gold.   K=1  is  Pseudomax  [Sontag  et  al,  2010]     K=2  is  Constraint  Classifica2on  [Har-­‐Peled,  Zimak,  Roth  2002;                          Crammer,  Singer  2002]  

Put  highly  coupled  variables  in  the  same  set  

The  goal  is  to  get  results  that  are  close  to  doing  exact   inference.                  Are  there  small  &  good  neighborhoods?  

Page 21

Exactness  of  DecL   n 

n  n 

Different  label   space  

Separa2ng  weights  for  DecL j   j,y),  8  y 2  nbr(yj)}   Wdecl:  {w|  w¢Á(xj,  yj) ¸    w¢Á(xj,  y)y+    ¢(yw ¨ 

n 

Score  of    all  non   ground  truth  y    

Key  Result:  YES.  Under  “reasonable  condi2ons”,  DecL  with   small  sized  neighborhoods  nbr(yj)  gives  the  same  results  as   Global  Learning.   For  analyzing  the  equivalence  between  DecL  and  GL,  we  need  a   no2on  of  ‘separability’  of  the  data   Separability:  existence  of  set  of  weights  W* that  sa2sfy    W*:  {w  |  w¢Á(xj,  yj) ¸    w¢Á(xj,  y) +    ¢(yj,y),  8  y 2  Y }  

    n 

Score  of  ground   truth  yj    

W*

Naturally:  W* µ Wdecl  

Wdecl

Exactness  Results:  The  set  of  separa2ng  weights  for  DecL  is   equal  to  the  set  of  separa2ng  weights  for  GL:  W*=Wdecl Page 22

Example  of  Exactness:  Pairwise  Markov  Networks   n 

Scoring  func2on  define  over  a  graph  with  edges  E

  n 

Assume  domain  knowledge  on  W*: for  correct  (separa2ng)  w 2 Pairwise/Edge   W*, which  of  Singleton/Vertex   the   p airwise   Á (.;w) are: i,k components   components   ¨ 

Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0) y2 y3 1 0y1 1 1 0 0

>

0

1

0

1

OR¨  Supermodular:  Á (0,0)+ Á (1,1) < Á (0,1) + Á (1,0) i,k i,k y4 i,k i,k 1

1

0

y5

0

<

1

0

y6

Page 23

h

n X

max Myarkov   Decomposi2on  for  arg Pairwise   i wi · xN.etworks   y2Y

0

1

i=1

n X

i (yi , x; w)

E = {(u, v) 2 E| 4

0, yuj

f (x, y; w) = sub(Á) j

n 

+

i=1

X

X X i,k (yi , yk , x; w) .

i,k2E

sup(Á)

uv

>

=

yvj

or 4

uv

X < 0, yuj 6= yvj }

m j ⇣ ⌘ E C E2 1 X j j j kwk + max yw · x y w · x + Iyj 6=y , j j j     For  2an  example   ),  define  E by  removing  edges  from  E   m (x ,yy2{0,1} j=1 where  the  labels  disagree  with  the  Á‘s  

E j = {(u, v) 2 E|(sub( uv ) ^ yuj = yvj ) _ (sup( uv ) ^ yuj 6= yvj )}   n  Theorem:  Decomposing  the  variables   as  T cUy onnected     1 j yields   q(y) / PE(y|x; w)   e . components  of  E xactness.

q above is not well defined and so we take the limit

! 0 in (9) an Page 24

Experiments:  Informa2on  extrac2on   Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . PredicFon  result  of  a  trained  HMM   [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 .

Violates  lots  of  natural  constraints!   Page 25

Adding  Expressivity  via  Constraints   n 

Each  field  must  be  a  consecu2ve  list  of  words  and  can  appear   at  most  once  in  a  cita2on.    

  n 

State  transi2ons  must  occur  on  punctua2on  marks.  

  n 

The  cita2on  can  only  start  with  AUTHOR  or  EDITOR.    

  n  n  n  n 

The  words  pp.,  pages  correspond  to  PAGE.   Four  digits  star2ng  with  20xx  and  19xx  are  DATE.   Quota2ons  can  appear  only  in  TITLE   …….   Page 26

Informa2on  Extrac2on  with  Constraints   n 

Adding  constraints,  we  get  correct  results!      

n 

[AUTHOR] [TITLE] [TECH-REPORT] [INSTITUTION] [DATE] Experimental   Goal:     n  n 

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May, 1994 .

Inves2gate  DecL  with  small  neighborhoods   Note  that  the  required  theore2cal  condi2ons  hold  only  approximately:     n  Output  tokens  tend  to  appear  in  con2guous  blocks     n  Use  neighborhoods  similar  to  the  PMN   Page 27

Typical  Results:  Informa2on  Extrac2on  (Ads  Data)     Accuracy   81  

F1 Scores

80   79  

HMM  (LL)  

78  

L+I   GL  

77  

DecL  

76   75  

HMM  (LL)  

L+I  

GL  

DecL   Page 28

Typical  Results:  Informa2on  Extrac2on  (Ads  Data)   81  

80  

F1 Scores

60   79  

50  

78  

40   30  

77  

20   76  

10  

75  

0  

HMM  (LL)  

L+I  

GL  

Time taken to train (Minutes)

70  

80  

Time  

DecL   Page 29

Thank  You!  

Conclusion   n 

Presented  Constrained  Condi2onal  Models:       ¨ 

n 

Supports  joint  inference  while  maintaining  modularity  and  tractability  of   training   ¨ 

n 

Interdependent  components  are  learned  (independently  or  pipelined)  and,  via   joint  inference,  support  coherent  decisions,  modulo  declara2ve  constraints.  

Presented  Decomposed  Learning  (DecL):  efficient  joint  learning  by  reducing   the  learning-­‐2me-­‐inference  to  a  small  output  space   ¨ 

n 

An  ILP  formula2on  for  structured  predic2on  that  augments  sta2s2cally  learned   models  with  declara2ve  constraints    as  a  way  to  incorporate  knowledge  and   support  decisions  in  expressive  output  spaces    

Provided  condi2ons  for  when  DecL  is  provably  iden2cal  to  global  structural   learning  (GL)  

Interes2ng  open  ques2ons  in  developing  further  understanding  of  how  to   support  efficient  joint  inference  

Check  out  our  tools,  demos,  tutorials   Page 30

Decomposing Structured Predic)on via Constrained ...

Adding Expressivity via Constraints. ▫ Each field must be a consecu)ve list of words and can appear at most once in a cita)on. ▫ State transi)ons must occur on punctua)on marks. ▫ The cita)on can only start with AUTHOR or EDITOR. ▫ The words pp., pages correspond to PAGE. ▫ Four digits star)ng with 20xx and 19xx are ...

2MB Sizes 1 Downloads 196 Views

Recommend Documents

Minimally-Constrained Multilingual Embeddings via ...
Oracle Labs [email protected]. Pallika Kanani. Oracle Labs pallika.kanani. Adam Pocock. Oracle Labs adam.pocock. Abstract. We present a method that consumes a large corpus of multilingual text and pro- duces a single, unified word embedding in

Spectral Learning of General Weighted Automata via Constrained ...
A broad class of such functions can be defined by weighted automata. .... completion, and an extension of the analysis of spectral learning to an agnostic setting ...

Spectral Learning of General Weighted Automata via Constrained ...
[19], the so-called spectral method has proven to be a valuable tool in ... from that of learning an arbitrary weighted automaton from labeled data drawn from some unknown ... The proof contains two main novel ingredients: a stability analysis of an

Simple Training of Dependency Parsers via Structured ...
best approach to obtaining robust parsers for real data. ... boosting, is used to successively modify the training data so that the training ..... dependency parsing: An exploration. In Proc. of ... The Elements of Statistical Learning: Data Mining,.

Simple Training of Dependency Parsers via Structured Boosting
developing training algorithms for learning structured predic- tors from data ..... the overall training cost to a few hours on a few computers, since the ... First, to determine the effectiveness of the basic structured .... Online large-margin trai

Robust and Practical Face Recognition via Structured ...
tion can be efficiently solved using techniques given in [21,22]. ... Illustration of a four-level hierarchical tree group structure defined on the error image. Each ...... Human Sixth Sense Programme at the Advanced Digital Sciences Center from.

Decomposing Differences in R0
employ, but the analysis they conduct is still consistent and valid because the terms in (12) still sum to ε .... An Excel spreadsheet and an R package with example data and a tutorial will be available in March 2011 to accompany the methods.

DECOMPOSING INBREEDING AND COANCESTRY ...
solutions can be obtained by tracing the pedigree up and down. ... bt bt bt a. After expanding the relationship terms up to the founders and algebraic ...

Developing a Framework for Decomposing ...
Nov 2, 2012 - with higher prevalence and increases in medical care service prices being the key drivers of ... ket, which is an economically important segmento accounting for more enrollees than ..... that developed the grouper software.

Decomposing Discussion Forums using User Roles - DERI
Apr 27, 2010 - Discussion forums are a central part of Web 2.0 and Enterprise 2.0 infrastructures. The health and ... they been around for many years in the form of newsgroups [10]. Commerical ... Such analysis will enable host organizations to asses

Decomposing time-frequency macroeconomic relations
Aug 7, 2007 - As an alternative, wavelet analysis has been proposed. Wavelet analysis performs ... For example, central banks have different objectives in ...... interest rates was quite high in the 3 ∼ 20 year scale. Note that the causality is ...

CONSTRAINED POLYNOMIAL OPTIMIZATION ...
The implementation of these procedures in our computer algebra system .... plemented our algorithms in our open source Matlab toolbox NCSOStools freely ...

Decomposing Differences in R0
and can be executed in any spreadsheet software program. ... is to describe vital rates and their consequences, in which case analytic decomposition (that.

DECOMPOSING INBREEDING AND COANCESTRY ...
Note that minimum ID is always placed in column 1 for faster access to already existing combinations of genes. The upward exploration algorithm is.

Collusion Constrained Equilibrium
Jan 16, 2017 - (1986).4 In political economy Levine and Modica (2016)'s model of ...... instructions - they tell them things such as “let's go on strike” or “let's ...

Constrained School Choice
ordering over the students and a fixed capacity of seats. Formally, a school choice problem is a 5-tuple (I,S,q,P,f) that consists of. 1. a set of students I = {i1,...,in},.

Decomposing the Gender Wage Gap with Sample ...
b) Calculate at the j distribution the percentile levels at which qi lies and call these Pi. ... work but not the wage, are home ownership, number of children between 2 and 6 ..... may exert a gender equalizing effect on intermediate earnings jobs.

Decomposing Duration Dependence in a Stopping ...
Feb 28, 2016 - armed with the same data on the joint distribution of the duration of two ... Previous work has shown that small fixed costs can generate large ...

Decomposing bivariate dominance for social welfare ...
Mar 29, 2017 - Department of Economics, Copenhagen Business School, ... The latter definition (2) has a foundation in expected utility theory and .... With this intuition we prove the following technical lemma, which is useful for showing.

The relative sensitivity of algae to decomposing barley ...
... Germany and Sciento strains were obtained from Sciento, Manchester, UK. .... current results with E. gracilis support those of Cooper et al. (1997) who showed ...

Decomposing Duration Dependence in a Stopping ...
Apr 7, 2017 - as much of the observed decline in the job finding rate as we find with our preferred model. Despite the .... unemployment using “...a search model where workers accumulate skills on the job and lose skills during ...... Nakamura, Emi