Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction Sharid Lo´aiciga, Sara Stymne, Preslav Nakov, Christian Hardmeier, J¨ org Tiedemann, Mauro Cettolo & Yannick Versley
8th September 2017
Pronoun Translation
Machine translation problem caused by: I
Mismatch in pronoun systems: differences in gender, number, case, formality, animacy, etc.
I
Null subjects: generating a pronoun in the target for which there is no pronoun in the source.
I
Functional ambiguity: pronouns with the same surface form but different function.
English
Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world.
German
Facit war ein großartiges Unternehmen. Entstanden tief im schwedischen Wald, bauten sie die besten mechanischen Rechenautomaten der Welt.
French
Facit ´etait une entreprise fantastique, fond´ee dans la forˆet su´edoise. Elle fabriquait les meilleures calculatrices m´ecaniques au monde.
English
And among these organisms is a bacterium by the name of Deinococcus radiodurans. It is known to be able to withstand cold, dehydration, vacuum, acid, and, most notably, radiation.
German
Unter diesen Lebewesen existiert ein Bakterium namens Deinococcus radiodurans. Seine Resistenz gegen K¨alte, Dehydration, Vakuum, S¨auren ist bekannt sowie insbesondere gegen Strahlung.
French
Et parmi ces organismes, il y a une bact´erie appel´ee Deinococcus radiodurans. Elle est connue pour ˆetre capable de supporter le froid, la d´eshydratation, le vide, l’acide et, le plus notable, les radiations.
Spanish [Arupa] tambi´en explic´o que ella us´ o la retina de Thomas y su ARN para tratar de desactivar el gen que causaba la formaci´ on de tumores. Luego nos llev´o al congelador y nos mostr´o las dos muestras que todav´ıa conservaba. Dijo que las guard´ o porque no sab´ıa cu´ando podr´ıa conseguir m´as. Despu´es de esto, [el personal de laboratorio] agasaj´o a Callum con [un regalo de cumplea˜ nos]. Era el kit de laboratorio para ni˜ nos. Y tambi´en le ofrecieron unas pr´acticas.
English [Arupa] also explained that she is using Thomas’s retina and his RNA to try to inactivate the gene that causes tumor formation. Then she took us to the freezer and she showed us the two samples that she still has. She said she saved it because she doesn’t know when she might get more. After this, [the lab staff] presented Callum with [a birthday gift]. It was a child’s lab kit. And they also offered him an internship.
We use this same word, depression, to describe how a kid feels when it rains on his birthday, and to describe how somebody feels the minute before they commit suicide. A sense of belonging to the European Union will develop only gradually, as the EU achieves tangible results and explains more clearly what it is doing for people. So in other words, I need to tell you everything I learned at medical school. But believe me, it isn’t going to take very long.
Pleonastic
We use this same word, depression, to describe how a kid feels when it rains on his birthday, and to describe how somebody feels the minute before they commit suicide. → il
Nominal reference
A sense of belonging to the European Union will develop only gradually, as the EU achieves tangible results and explains more clearly what it is doing for people. → elle, il
Event reference
So in other words, I need to tell you everything I learned at medical school. But believe me, it isn’t going to take very long. → cela, ¸ca
Pronoun prediction What is the task? source target
class
If you ask for the happiness of the remembering self, it’s a completely different thing. Si vous r´efl´echissez sur le bonheur du “moi des souvenirs”, REPLACE 11 est une toute autre histoire. ce/c’
Advantages: I
It defines set of possible translations (classes).
I
It offers a controlled testing of different types of linguistic information.
I
Explicit anaphora or coreference resolution is not necessary.
More about the task
I
The language pairs included are English→French, Spanish→English, and English↔German.
I
Focus on subject pronouns, data is filtered accordingly. Baseline consists in 5-gram language models for each target language, trained on all training data and additional monolingual data from WMT.
I
I
I
Optimized with a penalty for shorter strings.
Macro-averaged recall is the official score.
Data
News Commentary v.9 Europarl v.7 TED talks
de-en X X X
Train en-de es-en X X X X X
en-fr X X X
Dev & Test all pairs
X
I
TED talks are particular with respect to pronoun use.
I
Pronouns are frequent, including first and second person, but anaphoric references are not always clear.
English → French example from the development dataset ce OTHER ce|PRON qui|PRON It ’s an idiotic debate . It has to stop . REPLACE 0 ˆetre|VER un|DET d´ebat|NOM REPLACE 6 devoir|VER stopper|VER .|. 0-0 1-1 2-2 3-4 4-3 6-5 7-6 8-6 9-7 10-8
pronoun class: limited number of classes to predict original token: not all original tokens are predicted source: not modified lemmatized target: discourages use of the target REPLACE X: placeholder word alignments: bidirectional word alignments
idiot|ADJ
Submitted Systems
I
Turku NLP
I
Uppsala
I
NYU
I
UU-Hardmeier
I
Stymne16
Four new submissions for all language pairs, yielding also contrastive systems, and one comparative submission from a 2016 SVM system.
Key characteristics of the submitted systems
SVM Neural networks -Convolutions -GRUs -BiLSTMs Source pronoun representation Target POS tags Head dependencies Pre-trained word embeddings Source intra-sentential context Source inter-sentential context Target intra-sentential context Target inter-sentential context
TurkuNLP
NYU
Uppsala
UU-Hardmeier
X X X
X
X
X X
X X X X X
X X X X X
UU-Stymne16 X
X X
X X
X
X X X
X X X X
X X
Key characteristics of the submitted systems
SVM Neural networks -Convolutions -GRUs -BiLSTMs Source pronoun representation Target POS tags Head dependencies Pre-trained word embeddings Source intra-sentential context Source inter-sentential context Target intra-sentential context Target inter-sentential context
TurkuNLP
NYU
Uppsala
UU-Hardmeier
X X X
X
X
X X
X X X X X
X X X X X
UU-Stymne16 X
X X
X X
X
X X X
X X X X
X X
Results
German - English
Stymne16
I I Baseline
UUHardmeier C
NYU C
40
UUHardmeier P
NYU P
Uppsala C
Uppsala P
TurkuNLP C
Macro−Averaged Recall
60
TurkuNLP P
General Results
Pair with the highest scores. TurkuNLP has the best macro-averaged recall (overall). I
20
0
German−English
Bidirectional RNN with intrasentential information.
English - German
General Results Uppsala P
Baseline
Stymne16
UUHardmeier C
NYU C
40
UUHardmeier P
NYU P
I Uppsala C
TurkuNLP C
Macro−Averaged Recall
60
TurkuNLP P
80
I
I
20
0
English−German
Best ranking system is Uppsala Primary. BiLSTM with head dependency information.
There is a big difference between the first and second places.
Spanish - English
General Results
20
I Baseline
UUHardmeier C
I UUHardmeier P
NYU C
NYU P
Uppsala C
Uppsala P
TurkuNLP P
Macro−Averaged Recall
40
TurkuNLP C
60
Pair with the lowest scores overall. Best scoring systems are TurkuNLP Primary and NYU Contrastive. I
0
Spanish−English
Unlike the other teams, NYU submitted a full NMT system.
English - French
I
Best margin of improvement over the baseline among the language pairs.
I
TurkuNLP presents the best macro-averaged recall.
Baseline
Stymne16
UUHardmeier C
NYU C
UUHardmeier P
NYU P
Uppsala P
Uppsala C
40
TurkuNLP C
Macro−Averaged Recall
60
TurkuNLP P
General Results
20
0
English−French
What about the classes? German to English Spanish to English
64 OTHER 23 OTHER
8 there 22 there
1 these
12 you
2 this
System
TurkuNLP Uppsala NYU UUHardmeier Baseline −2
Classes
Classes
System 36 they
TurkuNLP Uppsala NYU UUHardmeier Baseline 1
24 you
40 they
63 it
58 it 15 she
17 she 12 he
20 he 0
25
50
Recall
75
100
0
25
50
Recall
75
100
What about the classes? German to English Spanish to English
64 OTHER 23 OTHER
8 there 22 there
1 these
12 you
2 this
System
TurkuNLP Uppsala NYU UUHardmeier Baseline −2
Classes
Classes
System 36 they
TurkuNLP Uppsala NYU UUHardmeier Baseline 1
24 you
40 they
63 it
58 it 15 she
17 she 12 he
20 he 0
25
50
Recall
75
100
0
25
50
Recall
75
100
What about the classes? German to English Spanish to English
64 OTHER 23 OTHER
8 there 22 there
1 these
12 you
2 this
System
TurkuNLP Uppsala NYU UUHardmeier Baseline −2
Classes
Classes
System 36 they
TurkuNLP Uppsala NYU UUHardmeier Baseline 1
24 you
40 they
63 it
58 it 15 she
17 she 12 he
20 he 0
25
50
Recall
75
100
0
25
50
Recall
75
100
What about the classes? English to French English to German
51 OTHER
62 OTHER
5 on
5 cela 52 es
System
System
35 ils
TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5
Classes
Classes
TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5
29 il
62 sie
12 elles
12 elle 8 er
32 ce 0
25
50
Recall
75
100
0
25
50
Recall
75
100
What about the classes? English to French English to German
51 OTHER
62 OTHER
5 on
5 cela 52 es
System
System
35 ils
TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5
Classes
Classes
TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5
29 il
62 sie
12 elles
12 elle 8 er
32 ce 0
25
50
Recall
75
100
0
25
50
Recall
75
100
Best 2017 and 2016 systems (Macro-avg. Recall) English to German
German to English 68.3 73.4 OTHER
64.471 OTHER
there
System 85.5 88.7
sie
a Uppsala P17 a TurkuNLP P16
Classes
Classes
71.3 78.8 es
87.5 0
these
30.8 50
this
79.592.5
they
62.1 69.3
it 13.3
75
76.2 82.3
she
er
System
7583
you
96.8 100
he 25
50
75
0
Recall English to French 68.6 78.8 80
on 60
77.4
Classes
cela 65.7 66.2
System
ils 48.3
60.7
a TurkuNLP P17 a TurkuNLP P16
il 48
58.3
elles 52.2
66.7
elle 86.8 87.5 ce 50
60
70
Recall
50
Recall
OTHER 55.6
25
80
75
100
a TurkuNLP C17 a TurkuNLP P16
Metrics Comparison Macro−averaged Recall vs Accuracy TurkuNLP Primary
TurkuNLP Contrastive
Uppsala Primary
Uppsala Contrastive
NYU Primary
NYU Contrastive
UUHardmeier Primary
UUHardmeier Contrastive
UUStymne16
Baseline Optimized
80 60 40 20 0 80
%
60 40 20 0 80 60
Metric accuracy recall
40 20 0 DeEn EnDe EsEn EnFr
DeEn EnDe EsEn EnFr
Pair
Questions yet to be answered
I
Results have improved since 2015, but what have we learned about pronoun translation?
I
Best performing systems rely on intra-sentential context, therefore, what is the importance of inter-sentential context?, and what is the best way to model it?
I
n-gram language models were meant to be competitive for SMT. Is this still a telling baseline?
Conclusions
I I
The shared task winner this year is TurkuNLP. The shared task has made steady progress since its first edition in 2015. However, it is less clear that our understanding of the pronoun translation has advanced. I
The NYU NMT system performed almost as well as specialized systems.
I
As in general MT, neural models have shown advantages for the task.
I
The task is not solved yet, there is plenty of room for improvement.
If you are interested in organizing this shared task next year, please let us know :)
Thank you!