On the Role of Seed Lexicons in Learning Bilingual Word Embeddings Ivan Vuli¢

and Anna

Korhonen

University of Cambridge

[email protected] ACL 2016; Berlin; August 8, 2016

1 / 42

Word Embeddings Dense representations →

real-valued low-dimensional vectors

Word embedding induction

→ learn word-level features which generalise well across tasks and

languages Word embeddings capture interesting and universal regularities:

2 / 42

Word Embeddings Dense representations →

real-valued low-dimensional vectors

Word embedding induction

→ learn word-level features which generalise well across tasks and languages

Word embeddings capture interesting and universal regularities:

3 / 42

Motivation The NLP community has developed useful features for several tasks but nding features that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings



this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

4 / 42

Motivation The NLP community has developed useful features for several tasks but nding features that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings



this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

Learn word-level features which generalise across tasks and languages 5 / 42

Word Embeddings Representation of each word w ∈ V : vec(w) = [f1 , f2 , . . . , fdim ] Word representations in the same shared semantic (or

Image courtesy of [Gouws et al., ICML 2015]

6 / 42

embedding)

space!

Bilingual Word Embeddings (BWEs) Representation of a word w1S ∈ V S : 1 vec(w1S ) = [f11 , f21 , . . . , fdim ]

Exactly the same representation for w2T ∈ V T : 2 vec(w2T ) = [f12 , f22 , . . . , fdim ]

Language-independent word representations in the same shared semantic (or embedding) space! 7 / 42

Bilingual Word Embeddings

Monolingual

vs.

Bilingual

Q1 →

How to align semantic spaces in two dierent languages?

Q2 →

Which bilingual

signals

are used for the alignment?

See also: [Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical

8 / 42 Comparison; ACL 2016]

Bilingual Word Embeddings

Two desirable properties:

P1 → Leverage (large) monolingual training sets

through a bilingual signal

tied together

in order to learn a shared space in a scalable and widely applicable manner across languages and domains

P2 → Use as inexpensive bilingual signal as possible

9 / 42

BWEs and Bilingual Signals

(Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]

(Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]

(Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016]

(Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]

10 / 42

BWEs and Bilingual Signals

(Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]

(Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]

(Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016]

(Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]

11 / 42

BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons

12 / 42

BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons

Learn to transform the pre-trained source language embeddings into a space where the distance between a word and its translation pair is minimised 13 / 42Bilingual

signal →

word translation pairs

BWEs: Type 4

Post-Hoc Mapping with Seed Lexicons

Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries?

Key Question →

14 / 42

BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons

Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries?

Key Question →

We analyse a spectrum of seed lexicons with respect to controllable parameters such as: Lexicon source Lexicon size Translation method Translation pair reliability ...

15 / 42

Basic Framework Monolingual WE model

→ Skip-gram with negative sampling (SGNS)

[Mikolov et al., NIPS 2013]

16 / 42

Basic Framework Monolingual WE model

→ Skip-gram with negative sampling (SGNS)

[Mikolov et al., NIPS 2013] Bilingual signal

17 / 42

→N

word translation pairs

(xi , yi ) , i = 1, . . . , N

Basic Framework Monolingual WE model

→ Skip-gram with negative sampling (SGNS)

[Mikolov et al., NIPS 2013] Bilingual signal

→N

word translation pairs

Transformation between spaces



(xi , yi ) , i = 1, . . . , N

we assume linear mapping

[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]

min

W∈RdS ×dT

||XW − Y||2F + λ||W||2F

X→

Source language vectors for words from a training set

Y→

Target language vectors for words from a training set

W→

Translation (or transformation) matrix

(n.b.: max-margin framework [Lazaridou et al., ACL 2915] yields similar

18 / 42insights)

A Hybrid Model: Type 3 + Type 4 A type-hybrid procedure which would retain only highly reliable translation pairs obtained by a Type 3 model as a seed lexicon for Type 4 models satises P1 and P2.

Type 3 model used: [Vuli¢ and Moens, JAIR 2016]

19 / 42

Seed Lexicon Source and Translation Method Previous work



5K most frequent words translated using a dictionary or

Google Translate (GT)

20 / 42

Seed Lexicon Source and Translation Method Previous work



5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

21 / 42

→ BNC+GT

Seed Lexicon Source and Translation Method Previous work



5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

→ BNC+GT

Why not translating BNC using a Type 3 model?

22 / 42

→ BNC+HYB

Seed Lexicon Source and Translation Method Previous work



5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

→ BNC+GT

Why not translating BNC using a Type 3 model?

Or use the frequency list of a Type 3 model?

23 / 42

→ BNC+HYB

→ HFQ+HYB

Seed Lexicon Source and Translation Method Previous work



5K most frequent words translated using a dictionary or

Google Translate (GT)

To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT

→ BNC+GT

Why not translating BNC using a Type 3 model?

Or use the frequency list of a Type 3 model?

→ HFQ+HYB

Or simply words shared between two languages? [Kiros et al., NIPS 2015]

24 / 42

→ BNC+HYB

→ ORTHO

Seed Lexicon Size

Previous work

25 / 42



typically 5K training pairs

Seed Lexicon Size

Previous work



typically 5K training pairs

We also investigate more extreme settings: Limited setting: only 100-500 pairs?

26 / 42

Seed Lexicon Size

Previous work



typically 5K training pairs

We also investigate more extreme settings: Limited setting: only 100-500 pairs? Testing the more the merrier hypothesis

27 / 42



40K-50K training pairs?

Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs

The symmetry constraint



using only pairs that are mutual nearest

neighbours as training pairs

BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint

28 / 42

→ BNC+HYB+ASYM

and

HFQ+HYB+ASYM

Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs

The symmetry constraint



using only pairs that are mutual nearest

neighbours as training pairs

BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint

→ BNC+HYB+ASYM

Symmetry with a threshold



and

HFQ+HYB+ASYM

even more conservative reliability criteria

sim(xi , yi ) − sim(xi , zi ) > T HR sim(yi , xi ) − sim(yi , wi ) > T HR 29 / 42

Experimental Setup Task



Bilingual lexicon learning (BLL)

Goal



to build a non-probabilistic bilingual lexicon of word translations

Test Sets



ground truth word translation pairs built for three language pairs:

Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN)

[Vuli¢ and Moens, NAACL 2013, EMNLP 2013] (Similar relative performance on other BLL test sets)

Evaluation Metric



Top 1 accuracy (Acc1 )

(Similar model rankings with

30 / 42

Acc5

and

Acc10 )

Baseline BWE Models Type 1



BiCVM

[Hermann and Blunsom, ACL 2014]

Type 2



BilBOWA

[Gouws et al., ICML 2015]

Type 3



BWESG with length-ratio shue

[Vuli¢ and Moens, JAIR 2016]

Type 4



Linear mapping (BNC+GT)

[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]



All baselines trained with standard suggested settings (more in the paper)



Baselines use similar training data as our Type 4 models, e.g., Polyglot Wiki

plus Europarl for BilBOWA, document-aligned LinguaTools Wiki for BWESG

31 / 42

Training Setup and Data (Our Models)

Monolingual SGNS on Polyglot Wikipedias

Standard pre-processing and SGNS hyper-parameters (window size: 4)

We report results with (similar results with

32 / 42

d = 300

for all models

d = 40, 64, 500)

Ranked Lists with Dierent Seed Lexicons

BNC+GT

BNC+HYB BNC+HYB HFQ+HYB HFQ+HYB ORTHO +ASYM +SYM +ASYM +SYM

casamiento

casamiento

marriage marry marrying betrothal wedding wed elopement

33 / 42

casamiento

casamiento

casamiento

casamiento maría señor doña juana noche amor guerra

marry

marriage

marriage

marriage

marrying wed wedding betrothal remarry

marry marrying wedding betrothal wed marriages

marry betrothal marrying wedding daughter betrothed

marry betrothal marrying wedding wed elopement

marriage

Experiments Experiment I: Standard BLL Setting

(5K seed lexicons)

Model BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3)

0.532 0.632 0.676

0.583 0.636 0.626

0.569 0.647 0.643

BNC+GT (Type 4)

0.677

0.641

0.646

ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3388; 2738; 3145) HFQ+HYB+ASYM HFQ+HYB+SYM

0.233 0.673 0.681

0.506 0.626 *

0.658

0.224 0.644 0.663*

0.596 0.657*

0.667*



0.673

0.695*

0.635

Document-level semantic spaces can provide seed lexicons

→ Reliability matters

34 / 42

ES-EN NL-EN IT-EN

Experiments Experiment II: Lexicon-Size

(Spanish-English)

0.7

0.6

Acc1 scores

0.5

0.4

0.3

0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO

0.1

0

0.1k 35 / 42

0.2k

0.5k

1k

2k 5k Lexicon size

10k

20k

50k

Experiments Experiment II: Lexicon-Size

(Dutch-English)

0.7

0.6

0.5

0.4

0.3

0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO

0.1

0

0.1k

0.2k

0.5k

1k

2k 5k Lexicon size

10k

36 / 42BNC+SYM and HFQ+SYM are the best models overall

20k

50k

Experiments Experiment III: Translation Pair Reliability

(Spanish-English)

0.7

Acc1 scores

0.68

0.66

0.64

0.62

THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1

0.6

1k 37 / 42

2k

4k 5k 10k Lexicon size

20k

40k

Experiments Experiment III: Translation Pair Reliability

(Dutch-English)

0.66

0.64

0.62

0.6

0.58

THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1

0.56

0.54

1k

2k

4k 5k 10k Lexicon size

20k

38 / 42Stricter selection criteria can help (but not necessarily)

40k

Experiments Experiment IV: Another Task - Suggesting Word Translations in Context (6K seed lexicons)

Model

0.406 0.703

0.433 0.712

0.408 0.789

BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3)

0.506 0.586 0.783

0.586 0.656 0.858

0.522 0.589 0.792

BNC+GT (Type 4)

0.794

0.858

0.783

0.647 0.806* *

0.794 0.872 *

0.678 0.778 *

ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3839; 3117; 3693) HFQ+HYB+ASYM HFQ+HYB+SYM (THR = None) HFQ+HYB+SYM (THR=0.01) HFQ+HYB+SYM (THR=0.025)

39 / 42

ES-EN NL-EN IT-EN

No Context Best System [Vuli¢ and Moens, EMNLP 2014]

0.808

0.875

0.814

0.789 0.792 0.792 0.800

0.864 0.869 0.858 0.853

0.781 0.786 0.789 0.792

Conclusion and Future Work

Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but...

The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision)

40 / 42

Conclusion and Future Work

Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but...

The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision)

More sophisticated reliability measures? Other models of pair selection? Other context types and mapping functions? Other languages? Language pairs with scarce resources?

41 / 42

Questions?

42 / 42

On the Role of Seed Lexicons in Learning Bilingual ...

P2 → Use as inexpensive bilingual signal as possible in order to learn a shared space in a ... and widely applicable manner across languages and domains.

935KB Sizes 0 Downloads 123 Views

Recommend Documents

The role of Pteridium arachnoideum (Kaulf) on the seed ...
Abstract. The native bracken (Pteridium arachnoideum) often occurs in mono-specific stands in the Brazilian Cerrado, and this dominance can impact on both the above-ground vegetation and soil seed bank. This study investigated how invasion by this sp

The Role of Imitation in Learning to Pronounce
adult judgment of either similarity or functional equivalence, the child can determine correspondences ...... Analysis (probably) of variable data of this kind from a range of speakers. 3. .... that ultimately produce them, including changes in respi

The Role of Imitation in Learning to Pronounce
Summary. ..... 105. 7.3.3. Actions: what evidence do we have of the acts of neuromotor learning that are supposed to be taking place?

The Role of Technology in Improving Student Learning ...
coupled with the data richness of society in the information age, led to the development of curriculum materials geared .... visualization, simulation, and networked collaboration. The strongest features of ..... student involvement tools (group work

The Role of Technology in Improving Student Learning ...
Technology Innovations in Statistics Education Journal 1(1), ... taught in a classroom with a computer projected on a screen, or may take place in a laboratory ...

The Role of Imitation in Learning to Pronounce
I, Piers Ruston Messum, declare that the work presented in this thesis is my own. Where ... both for theoretical reasons and because it leaves the developmental data difficult to explain ...... Motor, auditory and proprioceptive (MAP) information.

The Role of Imitation in Learning to Pronounce
The second mechanism accounts for how children learn to pronounce speech sounds. ...... In the next chapter, I will describe a number of mechanisms which account ...... (Spanish, for example, showing a reduced effect compared to English.) ...

The Role of Imitation in Learning to Pronounce
SUMMARY . ..... Actions: what evidence do we have of the acts of neuromotor learning that are supposed to be taking place?

ON THE ROLE OF STRUCTURE IN PART-BASED ...
normalisation factor Z(X; θ) in (1), and the likelihood is not a convex function of θ due to the hidden layer. Here, we use a. Newton gradient ascent method to find ...

On the Role of Ontological Semantics in Routing ...
forwarding is the method used for routing in the Siena hierarchical implementation. The tree of subscriptions is used to assist in pruning the number of subscriptions forwarded. Essentially, root subscriptions are the only ones sent. As such, subscri

Learning Chinese Polarity Lexicons by Integration of ...
methodto compute the word polarity by calculating the semantic distance between words ... [12] measured sentiment degrees of Chinese words by averaging the ...

Learning Times for Large Lexicons Through ... - Wiley Online Library
In addition to his basic finding that working cross-situational learning algorithms can be provided ... It seems important to explore whether Siskind's finding ...... As t fi Ґ, we necessarily have that P1(t) fi 1, and hence that to leading order,.

Learning Times for Large Lexicons Through ...
This is a worthwhile and important enterprise. ... describe an impressive system that takes video of visual scenes paired with .... This differential equation has the solution ...... the 31st annual conference of the Cognitive Science Society (pp.

Learning Compact Lexicons for CCG Semantic Parsing - Slav Petrov
tions, while learning significantly more compact ...... the same number of inference calls, and in prac- .... Proceedings of the Joint Conference on Lexical and.

The potential key seed-dispersing role of the arboreal ...
Sep 4, 2008 - link function and binomial error distribution to test whether passage through the gut of D. gliroides affected seed germi- nation (McCullagh and ...

The role of learning in the acquisition of threat-sensitive ...
learning is through conditioning with conspecific alarm cues paired with visual and/or chemical cues of the ... the acquisition of threat-sensitive predator learning in prey animals. In this study, we focus on understanding the ... The experiment con

The role of learning in the development of threat ...
Schematic diagram (side view) of test tanks used in experiments 1 and 2. ANIMAL BEHAVIOUR, 70, 4 ..... Ecoscience, 5, 353–360. Siegel, S. & Castellan, N. J. ...

The Role of Social Interaction in the Evolution of Learning
Apr 29, 2013 - way for one agent to learn will depend on the way that other agents are learn- ..... Consider the following learning rule, which we will call “competitive- ..... (eds), 2003, Advances in Artificial Life: 7th European Conference ECAL 

The role of learning in the development of threat ...
Prey should gain a fitness advantage by displaying antipredator responses with an intensity .... grid pattern drawn on the side and contained a gravel substrate ...

Effect of seed treatment on seed quality of hybrid rice ...
Department of Seed Sci. and Technology, Tamil Nadu Agricultural University, ... in India, the success in hybrid rice technology could be .... V V P т vP Pr vT vpT.

The Role of the EU in Changing the Role of the Military ...
of democracy promotion pursued by other countries have included such forms as control (e.g. building democracies in Iraq and Afghanistan by the United States ...

Investigation on seed development and maturation in ...
Key words : Desmanthus, Seed, Development and maturation, Pod and seed ... each stage of collection in both crops observations .... The vascular system of.