On the Role of Seed Lexicons in Learning Bilingual Word Embeddings Ivan Vuli¢
and Anna
Korhonen
University of Cambridge
[email protected] ACL 2016; Berlin; August 8, 2016
1 / 42
Word Embeddings Dense representations →
real-valued low-dimensional vectors
Word embedding induction
→ learn word-level features which generalise well across tasks and
languages Word embeddings capture interesting and universal regularities:
2 / 42
Word Embeddings Dense representations →
real-valued low-dimensional vectors
Word embedding induction
→ learn word-level features which generalise well across tasks and languages
Word embeddings capture interesting and universal regularities:
3 / 42
Motivation The NLP community has developed useful features for several tasks but nding features that are...
1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings)
2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings
→
this talk)
...is non-trivial and time-consuming (20+ years of feature engineering...)
4 / 42
Motivation The NLP community has developed useful features for several tasks but nding features that are...
1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings)
2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings
→
this talk)
...is non-trivial and time-consuming (20+ years of feature engineering...)
Learn word-level features which generalise across tasks and languages 5 / 42
Word Embeddings Representation of each word w ∈ V : vec(w) = [f1 , f2 , . . . , fdim ] Word representations in the same shared semantic (or
Image courtesy of [Gouws et al., ICML 2015]
6 / 42
embedding)
space!
Bilingual Word Embeddings (BWEs) Representation of a word w1S ∈ V S : 1 vec(w1S ) = [f11 , f21 , . . . , fdim ]
Exactly the same representation for w2T ∈ V T : 2 vec(w2T ) = [f12 , f22 , . . . , fdim ]
Language-independent word representations in the same shared semantic (or embedding) space! 7 / 42
Bilingual Word Embeddings
Monolingual
vs.
Bilingual
Q1 →
How to align semantic spaces in two dierent languages?
Q2 →
Which bilingual
signals
are used for the alignment?
See also: [Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical
8 / 42 Comparison; ACL 2016]
Bilingual Word Embeddings
Two desirable properties:
P1 → Leverage (large) monolingual training sets
through a bilingual signal
tied together
in order to learn a shared space in a scalable and widely applicable manner across languages and domains
P2 → Use as inexpensive bilingual signal as possible
9 / 42
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
10 / 42
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
11 / 42
BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons
12 / 42
BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons
Learn to transform the pre-trained source language embeddings into a space where the distance between a word and its translation pair is minimised 13 / 42Bilingual
signal →
word translation pairs
BWEs: Type 4
Post-Hoc Mapping with Seed Lexicons
Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries?
Key Question →
14 / 42
BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons
Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries?
Key Question →
We analyse a spectrum of seed lexicons with respect to controllable parameters such as: Lexicon source Lexicon size Translation method Translation pair reliability ...
15 / 42
Basic Framework Monolingual WE model
→ Skip-gram with negative sampling (SGNS)
[Mikolov et al., NIPS 2013]
16 / 42
Basic Framework Monolingual WE model
→ Skip-gram with negative sampling (SGNS)
[Mikolov et al., NIPS 2013] Bilingual signal
17 / 42
→N
word translation pairs
(xi , yi ) , i = 1, . . . , N
Basic Framework Monolingual WE model
→ Skip-gram with negative sampling (SGNS)
[Mikolov et al., NIPS 2013] Bilingual signal
→N
word translation pairs
Transformation between spaces
→
(xi , yi ) , i = 1, . . . , N
we assume linear mapping
[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]
min
W∈RdS ×dT
||XW − Y||2F + λ||W||2F
X→
Source language vectors for words from a training set
Y→
Target language vectors for words from a training set
W→
Translation (or transformation) matrix
(n.b.: max-margin framework [Lazaridou et al., ACL 2915] yields similar
18 / 42insights)
A Hybrid Model: Type 3 + Type 4 A type-hybrid procedure which would retain only highly reliable translation pairs obtained by a Type 3 model as a seed lexicon for Type 4 models satises P1 and P2.
Type 3 model used: [Vuli¢ and Moens, JAIR 2016]
19 / 42
Seed Lexicon Source and Translation Method Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
20 / 42
Seed Lexicon Source and Translation Method Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT
21 / 42
→ BNC+GT
Seed Lexicon Source and Translation Method Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT
→ BNC+GT
Why not translating BNC using a Type 3 model?
22 / 42
→ BNC+HYB
Seed Lexicon Source and Translation Method Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT
→ BNC+GT
Why not translating BNC using a Type 3 model?
Or use the frequency list of a Type 3 model?
23 / 42
→ BNC+HYB
→ HFQ+HYB
Seed Lexicon Source and Translation Method Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT
→ BNC+GT
Why not translating BNC using a Type 3 model?
Or use the frequency list of a Type 3 model?
→ HFQ+HYB
Or simply words shared between two languages? [Kiros et al., NIPS 2015]
24 / 42
→ BNC+HYB
→ ORTHO
Seed Lexicon Size
Previous work
25 / 42
→
typically 5K training pairs
Seed Lexicon Size
Previous work
→
typically 5K training pairs
We also investigate more extreme settings: Limited setting: only 100-500 pairs?
26 / 42
Seed Lexicon Size
Previous work
→
typically 5K training pairs
We also investigate more extreme settings: Limited setting: only 100-500 pairs? Testing the more the merrier hypothesis
27 / 42
→
40K-50K training pairs?
Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs
The symmetry constraint
→
using only pairs that are mutual nearest
neighbours as training pairs
BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint
28 / 42
→ BNC+HYB+ASYM
and
HFQ+HYB+ASYM
Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs
The symmetry constraint
→
using only pairs that are mutual nearest
neighbours as training pairs
BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint
→ BNC+HYB+ASYM
Symmetry with a threshold
→
and
HFQ+HYB+ASYM
even more conservative reliability criteria
sim(xi , yi ) − sim(xi , zi ) > T HR sim(yi , xi ) − sim(yi , wi ) > T HR 29 / 42
Experimental Setup Task
→
Bilingual lexicon learning (BLL)
Goal
→
to build a non-probabilistic bilingual lexicon of word translations
Test Sets
→
ground truth word translation pairs built for three language pairs:
Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN)
[Vuli¢ and Moens, NAACL 2013, EMNLP 2013] (Similar relative performance on other BLL test sets)
Evaluation Metric
→
Top 1 accuracy (Acc1 )
(Similar model rankings with
30 / 42
Acc5
and
Acc10 )
Baseline BWE Models Type 1
→
BiCVM
[Hermann and Blunsom, ACL 2014]
Type 2
→
BilBOWA
[Gouws et al., ICML 2015]
Type 3
→
BWESG with length-ratio shue
[Vuli¢ and Moens, JAIR 2016]
Type 4
→
Linear mapping (BNC+GT)
[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]
→
All baselines trained with standard suggested settings (more in the paper)
→
Baselines use similar training data as our Type 4 models, e.g., Polyglot Wiki
plus Europarl for BilBOWA, document-aligned LinguaTools Wiki for BWESG
31 / 42
Training Setup and Data (Our Models)
Monolingual SGNS on Polyglot Wikipedias
Standard pre-processing and SGNS hyper-parameters (window size: 4)
We report results with (similar results with
32 / 42
d = 300
for all models
d = 40, 64, 500)
Ranked Lists with Dierent Seed Lexicons
BNC+GT
BNC+HYB BNC+HYB HFQ+HYB HFQ+HYB ORTHO +ASYM +SYM +ASYM +SYM
casamiento
casamiento
marriage marry marrying betrothal wedding wed elopement
33 / 42
casamiento
casamiento
casamiento
casamiento maría señor doña juana noche amor guerra
marry
marriage
marriage
marriage
marrying wed wedding betrothal remarry
marry marrying wedding betrothal wed marriages
marry betrothal marrying wedding daughter betrothed
marry betrothal marrying wedding wed elopement
marriage
Experiments Experiment I: Standard BLL Setting
(5K seed lexicons)
Model BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3)
0.532 0.632 0.676
0.583 0.636 0.626
0.569 0.647 0.643
BNC+GT (Type 4)
0.677
0.641
0.646
ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3388; 2738; 3145) HFQ+HYB+ASYM HFQ+HYB+SYM
0.233 0.673 0.681
0.506 0.626 *
0.658
0.224 0.644 0.663*
0.596 0.657*
0.667*
→
0.673
0.695*
0.635
Document-level semantic spaces can provide seed lexicons
→ Reliability matters
34 / 42
ES-EN NL-EN IT-EN
Experiments Experiment II: Lexicon-Size
(Spanish-English)
0.7
0.6
Acc1 scores
0.5
0.4
0.3
0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO
0.1
0
0.1k 35 / 42
0.2k
0.5k
1k
2k 5k Lexicon size
10k
20k
50k
Experiments Experiment II: Lexicon-Size
(Dutch-English)
0.7
0.6
0.5
0.4
0.3
0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO
0.1
0
0.1k
0.2k
0.5k
1k
2k 5k Lexicon size
10k
36 / 42BNC+SYM and HFQ+SYM are the best models overall
20k
50k
Experiments Experiment III: Translation Pair Reliability
(Spanish-English)
0.7
Acc1 scores
0.68
0.66
0.64
0.62
THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1
0.6
1k 37 / 42
2k
4k 5k 10k Lexicon size
20k
40k
Experiments Experiment III: Translation Pair Reliability
(Dutch-English)
0.66
0.64
0.62
0.6
0.58
THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1
0.56
0.54
1k
2k
4k 5k 10k Lexicon size
20k
38 / 42Stricter selection criteria can help (but not necessarily)
40k
Experiments Experiment IV: Another Task - Suggesting Word Translations in Context (6K seed lexicons)
Model
0.406 0.703
0.433 0.712
0.408 0.789
BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3)
0.506 0.586 0.783
0.586 0.656 0.858
0.522 0.589 0.792
BNC+GT (Type 4)
0.794
0.858
0.783
0.647 0.806* *
0.794 0.872 *
0.678 0.778 *
ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3839; 3117; 3693) HFQ+HYB+ASYM HFQ+HYB+SYM (THR = None) HFQ+HYB+SYM (THR=0.01) HFQ+HYB+SYM (THR=0.025)
39 / 42
ES-EN NL-EN IT-EN
No Context Best System [Vuli¢ and Moens, EMNLP 2014]
0.808
0.875
0.814
0.789 0.792 0.792 0.800
0.864 0.869 0.858 0.853
0.781 0.786 0.789 0.792
Conclusion and Future Work
Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but...
The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision)
40 / 42
Conclusion and Future Work
Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but...
The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision)
More sophisticated reliability measures? Other models of pair selection? Other context types and mapping functions? Other languages? Language pairs with scarce resources?
41 / 42
Questions?
42 / 42