On the Role of Seed Lexicons in Learning Bilingual ...

Viewer
Transcript

On the Role of Seed Lexicons in Learning Bilingual Word Embeddings Ivan Vuli¢

and Anna

Korhonen

University of Cambridge

[email protected] ACL 2016; Berlin; August 8, 2016

1 / 42

Word Embeddings Dense representations →

real-valued low-dimensional vectors

Word embedding induction

→ learn word-level features which generalise well across tasks and

languages Word embeddings capture interesting and universal regularities:

2 / 42

Word Embeddings Dense representations →

real-valued low-dimensional vectors

Word embedding induction

→ learn word-level features which generalise well across tasks and languages

Word embeddings capture interesting and universal regularities:

3 / 42

Motivation The NLP community has developed useful features for several tasks but nding features that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings

→

this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

4 / 42

Motivation The NLP community has developed useful features for several tasks but nding features that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings

→

this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

Learn word-level features which generalise across tasks and languages 5 / 42

Word Embeddings Representation of each word w ∈ V : vec(w) = [f1 , f2 , . . . , fdim ] Word representations in the same shared semantic (or

Image courtesy of [Gouws et al., ICML 2015]

6 / 42

embedding)

space!

Bilingual Word Embeddings (BWEs) Representation of a word w1S ∈ V S : 1 vec(w1S ) = [f11 , f21 , . . . , fdim ]

Exactly the same representation for w2T ∈ V T : 2 vec(w2T ) = [f12 , f22 , . . . , fdim ]

Language-independent word representations in the same shared semantic (or embedding) space! 7 / 42

Bilingual Word Embeddings

Monolingual

vs.

Bilingual

Q1 →

How to align semantic spaces in two dierent languages?

Q2 →

Which bilingual

signals

are used for the alignment?

See also: [Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical

8 / 42 Comparison; ACL 2016]

Bilingual Word Embeddings

Two desirable properties:

P1 → Leverage (large) monolingual training sets

through a bilingual signal

tied together

in order to learn a shared space in a scalable and widely applicable manner across languages and domains

P2 → Use as inexpensive bilingual signal as possible

9 / 42

BWEs and Bilingual Signals

(Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]

(Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]

(Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016]

(Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]

10 / 42

BWEs and Bilingual Signals

(Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]

(Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]

(Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016]

(Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]

11 / 42

BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons

12 / 42

BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons

Learn to transform the pre-trained source language embeddings into a space where the distance between a word and its translation pair is minimised 13 / 42Bilingual

signal →

word translation pairs

BWEs: Type 4

Post-Hoc Mapping with Seed Lexicons

Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries?

Key Question →

14 / 42

BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons

Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries?

Key Question →

We analyse a spectrum of seed lexicons with respect to controllable parameters such as: Lexicon source Lexicon size Translation method Translation pair reliability ...

15 / 42

Basic Framework Monolingual WE model

→ Skip-gram with negative sampling (SGNS)

[Mikolov et al., NIPS 2013]

16 / 42

Basic Framework Monolingual WE model

→ Skip-gram with negative sampling (SGNS)

[Mikolov et al., NIPS 2013] Bilingual signal

17 / 42

→N

word translation pairs

(xi , yi ) , i = 1, . . . , N

Basic Framework Monolingual WE model

→ Skip-gram with negative sampling (SGNS)

[Mikolov et al., NIPS 2013] Bilingual signal

→N

word translation pairs

Transformation between spaces

→

(xi , yi ) , i = 1, . . . , N

we assume linear mapping

[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]

min

W∈RdS ×dT

||XW − Y||2F + λ||W||2F

X→

Source language vectors for words from a training set

Y→

Target language vectors for words from a training set

W→

Translation (or transformation) matrix

(n.b.: max-margin framework [Lazaridou et al., ACL 2915] yields similar

18 / 42insights)

A Hybrid Model: Type 3 + Type 4 A type-hybrid procedure which would retain only highly reliable translation pairs obtained by a Type 3 model as a seed lexicon for Type 4 models satises P1 and P2.

Type 3 model used: [Vuli¢ and Moens, JAIR 2016]

19 / 42

Seed Lexicon Source and Translation Method Previous work

→

5K most frequent words translated using a dictionary or

Google Translate (GT)

20 / 42

Seed Lexicon Source and Translation Method Previous work

→

5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

21 / 42

→ BNC+GT

Seed Lexicon Source and Translation Method Previous work

→

5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

→ BNC+GT

Why not translating BNC using a Type 3 model?

22 / 42

→ BNC+HYB

Seed Lexicon Source and Translation Method Previous work

→

5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

→ BNC+GT

Why not translating BNC using a Type 3 model?

Or use the frequency list of a Type 3 model?

23 / 42

→ BNC+HYB

→ HFQ+HYB

Seed Lexicon Source and Translation Method Previous work

→

5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

→ BNC+GT

Why not translating BNC using a Type 3 model?

Or use the frequency list of a Type 3 model?

→ HFQ+HYB

Or simply words shared between two languages? [Kiros et al., NIPS 2015]

24 / 42

→ BNC+HYB

→ ORTHO

Seed Lexicon Size

Previous work

25 / 42

→

typically 5K training pairs

Seed Lexicon Size

Previous work

→

typically 5K training pairs

We also investigate more extreme settings: Limited setting: only 100-500 pairs?

26 / 42

Seed Lexicon Size

Previous work

→

typically 5K training pairs

We also investigate more extreme settings: Limited setting: only 100-500 pairs? Testing the more the merrier hypothesis

27 / 42

→

40K-50K training pairs?

Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs

The symmetry constraint

→

using only pairs that are mutual nearest

neighbours as training pairs

BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint

28 / 42

→ BNC+HYB+ASYM

and

HFQ+HYB+ASYM

Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs

The symmetry constraint

→

using only pairs that are mutual nearest

neighbours as training pairs

BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint

→ BNC+HYB+ASYM

Symmetry with a threshold

→

and

HFQ+HYB+ASYM

even more conservative reliability criteria

sim(xi , yi ) − sim(xi , zi ) > T HR sim(yi , xi ) − sim(yi , wi ) > T HR 29 / 42

Experimental Setup Task

→

Bilingual lexicon learning (BLL)

Goal

→

to build a non-probabilistic bilingual lexicon of word translations

Test Sets

→

ground truth word translation pairs built for three language pairs:

Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN)

[Vuli¢ and Moens, NAACL 2013, EMNLP 2013] (Similar relative performance on other BLL test sets)

Evaluation Metric

→

Top 1 accuracy (Acc1 )

(Similar model rankings with

30 / 42

Acc5

and

Acc10 )

Baseline BWE Models Type 1

→

BiCVM

[Hermann and Blunsom, ACL 2014]

Type 2

→

BilBOWA

[Gouws et al., ICML 2015]

Type 3

→

BWESG with length-ratio shue

[Vuli¢ and Moens, JAIR 2016]

Type 4

→

Linear mapping (BNC+GT)

[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]

→

All baselines trained with standard suggested settings (more in the paper)

→

Baselines use similar training data as our Type 4 models, e.g., Polyglot Wiki

plus Europarl for BilBOWA, document-aligned LinguaTools Wiki for BWESG

31 / 42

Training Setup and Data (Our Models)

Monolingual SGNS on Polyglot Wikipedias

Standard pre-processing and SGNS hyper-parameters (window size: 4)

We report results with (similar results with

32 / 42

d = 300

for all models

d = 40, 64, 500)

Ranked Lists with Dierent Seed Lexicons

BNC+GT

BNC+HYB BNC+HYB HFQ+HYB HFQ+HYB ORTHO +ASYM +SYM +ASYM +SYM

casamiento

casamiento

marriage marry marrying betrothal wedding wed elopement

33 / 42

casamiento

casamiento

casamiento

casamiento maría señor doña juana noche amor guerra

marry

marriage

marriage

marriage

marrying wed wedding betrothal remarry

marry marrying wedding betrothal wed marriages

marry betrothal marrying wedding daughter betrothed

marry betrothal marrying wedding wed elopement

marriage

Experiments Experiment I: Standard BLL Setting

(5K seed lexicons)

Model BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3)

0.532 0.632 0.676

0.583 0.636 0.626

0.569 0.647 0.643

BNC+GT (Type 4)

0.677

0.641

0.646

ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3388; 2738; 3145) HFQ+HYB+ASYM HFQ+HYB+SYM

0.233 0.673 0.681

0.506 0.626 *

0.658

0.224 0.644 0.663*

0.596 0.657*

0.667*

→

0.673

0.695*

0.635

Document-level semantic spaces can provide seed lexicons

→ Reliability matters

34 / 42

ES-EN NL-EN IT-EN

Experiments Experiment II: Lexicon-Size

(Spanish-English)

0.7

0.6

Acc1 scores

0.5

0.4

0.3

0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO

0.1

0

0.1k 35 / 42

0.2k

0.5k

1k

2k 5k Lexicon size

10k

20k

50k

Experiments Experiment II: Lexicon-Size

(Dutch-English)

0.7

0.6

0.5

0.4

0.3

0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO

0.1

0

0.1k

0.2k

0.5k

1k

2k 5k Lexicon size

10k

36 / 42BNC+SYM and HFQ+SYM are the best models overall

20k

50k

Experiments Experiment III: Translation Pair Reliability

(Spanish-English)

0.7

Acc1 scores

0.68

0.66

0.64

0.62

THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1

0.6

1k 37 / 42

2k

4k 5k 10k Lexicon size

20k

40k

Experiments Experiment III: Translation Pair Reliability

(Dutch-English)

0.66

0.64

0.62

0.6

0.58

THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1

0.56

0.54

1k

2k

4k 5k 10k Lexicon size

20k

38 / 42Stricter selection criteria can help (but not necessarily)

40k

Experiments Experiment IV: Another Task - Suggesting Word Translations in Context (6K seed lexicons)

Model

0.406 0.703

0.433 0.712

0.408 0.789

BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3)

0.506 0.586 0.783

0.586 0.656 0.858

0.522 0.589 0.792

BNC+GT (Type 4)

0.794

0.858

0.783

0.647 0.806* *

0.794 0.872 *

0.678 0.778 *

ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3839; 3117; 3693) HFQ+HYB+ASYM HFQ+HYB+SYM (THR = None) HFQ+HYB+SYM (THR=0.01) HFQ+HYB+SYM (THR=0.025)

39 / 42

ES-EN NL-EN IT-EN

No Context Best System [Vuli¢ and Moens, EMNLP 2014]

0.808

0.875

0.814

0.789 0.792 0.792 0.800

0.864 0.869 0.858 0.853

0.781 0.786 0.789 0.792

Conclusion and Future Work

Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but...

The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision)

40 / 42

Conclusion and Future Work

Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but...

The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision)

More sophisticated reliability measures? Other models of pair selection? Other context types and mapping functions? Other languages? Language pairs with scarce resources?

41 / 42

Questions?

42 / 42

On the Role of Seed Lexicons in Learning Bilingual ...

P2 â Use as inexpensive bilingual signal as possible in order to learn a shared space in a ... and widely applicable manner across languages and domains.

Download PDF

935KB Sizes 0 Downloads 138 Views

Report

On the Role of Seed Lexicons in Learning Bilingual ...

Recommend Documents