Findings of the 2017 DiscoMT Shared Task on Cross ...

Viewer
Transcript

Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction Sharid Lo´aiciga, Sara Stymne, Preslav Nakov, Christian Hardmeier, J¨ org Tiedemann, Mauro Cettolo & Yannick Versley

8th September 2017

Pronoun Translation

Machine translation problem caused by: I

Mismatch in pronoun systems: differences in gender, number, case, formality, animacy, etc.

I

Null subjects: generating a pronoun in the target for which there is no pronoun in the source.

I

Functional ambiguity: pronouns with the same surface form but different function.

English

Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world.

German

Facit war ein großartiges Unternehmen. Entstanden tief im schwedischen Wald, bauten sie die besten mechanischen Rechenautomaten der Welt.

French

Facit ´etait une entreprise fantastique, fond´ee dans la forˆet su´edoise. Elle fabriquait les meilleures calculatrices m´ecaniques au monde.

English

And among these organisms is a bacterium by the name of Deinococcus radiodurans. It is known to be able to withstand cold, dehydration, vacuum, acid, and, most notably, radiation.

German

Unter diesen Lebewesen existiert ein Bakterium namens Deinococcus radiodurans. Seine Resistenz gegen K¨alte, Dehydration, Vakuum, S¨auren ist bekannt sowie insbesondere gegen Strahlung.

French

Et parmi ces organismes, il y a une bact´erie appel´ee Deinococcus radiodurans. Elle est connue pour ˆetre capable de supporter le froid, la d´eshydratation, le vide, l’acide et, le plus notable, les radiations.

Spanish [Arupa] tambi´en explic´o que ella us´ o la retina de Thomas y su ARN para tratar de desactivar el gen que causaba la formaci´ on de tumores. Luego nos llev´o al congelador y nos mostr´o las dos muestras que todav´ıa conservaba. Dijo que las guard´ o porque no sab´ıa cu´ando podr´ıa conseguir m´as. Despu´es de esto, [el personal de laboratorio] agasaj´o a Callum con [un regalo de cumplea˜ nos]. Era el kit de laboratorio para ni˜ nos. Y tambi´en le ofrecieron unas pr´acticas.

English [Arupa] also explained that she is using Thomas’s retina and his RNA to try to inactivate the gene that causes tumor formation. Then she took us to the freezer and she showed us the two samples that she still has. She said she saved it because she doesn’t know when she might get more. After this, [the lab staff] presented Callum with [a birthday gift]. It was a child’s lab kit. And they also offered him an internship.

We use this same word, depression, to describe how a kid feels when it rains on his birthday, and to describe how somebody feels the minute before they commit suicide. A sense of belonging to the European Union will develop only gradually, as the EU achieves tangible results and explains more clearly what it is doing for people. So in other words, I need to tell you everything I learned at medical school. But believe me, it isn’t going to take very long.

Pleonastic

We use this same word, depression, to describe how a kid feels when it rains on his birthday, and to describe how somebody feels the minute before they commit suicide. → il

Nominal reference

A sense of belonging to the European Union will develop only gradually, as the EU achieves tangible results and explains more clearly what it is doing for people. → elle, il

Event reference

So in other words, I need to tell you everything I learned at medical school. But believe me, it isn’t going to take very long. → cela, ¸ca

Pronoun prediction What is the task? source target

class

If you ask for the happiness of the remembering self, it’s a completely different thing. Si vous r´efl´echissez sur le bonheur du “moi des souvenirs”, REPLACE 11 est une toute autre histoire. ce/c’

Advantages: I

It defines set of possible translations (classes).

I

It offers a controlled testing of different types of linguistic information.

I

Explicit anaphora or coreference resolution is not necessary.

More about the task

I

The language pairs included are English→French, Spanish→English, and English↔German.

I

Focus on subject pronouns, data is filtered accordingly. Baseline consists in 5-gram language models for each target language, trained on all training data and additional monolingual data from WMT.

I

I

I

Optimized with a penalty for shorter strings.

Macro-averaged recall is the official score.

Data

News Commentary v.9 Europarl v.7 TED talks

de-en X X X

Train en-de es-en X X X X X

en-fr X X X

Dev & Test all pairs

X

I

TED talks are particular with respect to pronoun use.

I

Pronouns are frequent, including first and second person, but anaphoric references are not always clear.

English → French example from the development dataset ce OTHER ce|PRON qui|PRON It ’s an idiotic debate . It has to stop . REPLACE 0 ˆetre|VER un|DET d´ebat|NOM REPLACE 6 devoir|VER stopper|VER .|. 0-0 1-1 2-2 3-4 4-3 6-5 7-6 8-6 9-7 10-8

pronoun class: limited number of classes to predict original token: not all original tokens are predicted source: not modified lemmatized target: discourages use of the target REPLACE X: placeholder word alignments: bidirectional word alignments

idiot|ADJ

Submitted Systems

I

Turku NLP

I

Uppsala

I

NYU

I

UU-Hardmeier

I

Stymne16

Four new submissions for all language pairs, yielding also contrastive systems, and one comparative submission from a 2016 SVM system.

Key characteristics of the submitted systems

SVM Neural networks -Convolutions -GRUs -BiLSTMs Source pronoun representation Target POS tags Head dependencies Pre-trained word embeddings Source intra-sentential context Source inter-sentential context Target intra-sentential context Target inter-sentential context

TurkuNLP

NYU

Uppsala

UU-Hardmeier

X X X

X

X

X X

X X X X X

X X X X X

UU-Stymne16 X

X X

X X

X

X X X

X X X X

X X

Key characteristics of the submitted systems

SVM Neural networks -Convolutions -GRUs -BiLSTMs Source pronoun representation Target POS tags Head dependencies Pre-trained word embeddings Source intra-sentential context Source inter-sentential context Target intra-sentential context Target inter-sentential context

TurkuNLP

NYU

Uppsala

UU-Hardmeier

X X X

X

X

X X

X X X X X

X X X X X

UU-Stymne16 X

X X

X X

X

X X X

X X X X

X X

Results

German - English

Stymne16

I I Baseline

UUHardmeier C

NYU C

40

UUHardmeier P

NYU P

Uppsala C

Uppsala P

TurkuNLP C

Macro−Averaged Recall

60

TurkuNLP P

General Results

Pair with the highest scores. TurkuNLP has the best macro-averaged recall (overall). I

20

0

German−English

Bidirectional RNN with intrasentential information.

English - German

General Results Uppsala P

Baseline

Stymne16

UUHardmeier C

NYU C

40

UUHardmeier P

NYU P

I Uppsala C

TurkuNLP C

Macro−Averaged Recall

60

TurkuNLP P

80

I

I

20

0

English−German

Best ranking system is Uppsala Primary. BiLSTM with head dependency information.

There is a big difference between the first and second places.

Spanish - English

General Results

20

I Baseline

UUHardmeier C

I UUHardmeier P

NYU C

NYU P

Uppsala C

Uppsala P

TurkuNLP P

Macro−Averaged Recall

40

TurkuNLP C

60

Pair with the lowest scores overall. Best scoring systems are TurkuNLP Primary and NYU Contrastive. I

0

Spanish−English

Unlike the other teams, NYU submitted a full NMT system.

English - French

I

Best margin of improvement over the baseline among the language pairs.

I

TurkuNLP presents the best macro-averaged recall.

Baseline

Stymne16

UUHardmeier C

NYU C

UUHardmeier P

NYU P

Uppsala P

Uppsala C

40

TurkuNLP C

Macro−Averaged Recall

60

TurkuNLP P

General Results

20

0

English−French

What about the classes? German to English Spanish to English

64 OTHER 23 OTHER

8 there 22 there

1 these

12 you

2 this

System

TurkuNLP Uppsala NYU UUHardmeier Baseline −2

Classes

Classes

System 36 they

TurkuNLP Uppsala NYU UUHardmeier Baseline 1

24 you

40 they

63 it

58 it 15 she

17 she 12 he

20 he 0

25

50

Recall

75

100

0

25

50

Recall

75

100

What about the classes? German to English Spanish to English

64 OTHER 23 OTHER

8 there 22 there

1 these

12 you

2 this

System

TurkuNLP Uppsala NYU UUHardmeier Baseline −2

Classes

Classes

System 36 they

TurkuNLP Uppsala NYU UUHardmeier Baseline 1

24 you

40 they

63 it

58 it 15 she

17 she 12 he

20 he 0

25

50

Recall

75

100

0

25

50

Recall

75

100

What about the classes? German to English Spanish to English

64 OTHER 23 OTHER

8 there 22 there

1 these

12 you

2 this

System

TurkuNLP Uppsala NYU UUHardmeier Baseline −2

Classes

Classes

System 36 they

TurkuNLP Uppsala NYU UUHardmeier Baseline 1

24 you

40 they

63 it

58 it 15 she

17 she 12 he

20 he 0

25

50

Recall

75

100

0

25

50

Recall

75

100

What about the classes? English to French English to German

51 OTHER

62 OTHER

5 on

5 cela 52 es

System

System

35 ils

TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5

Classes

Classes

TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5

29 il

62 sie

12 elles

12 elle 8 er

32 ce 0

25

50

Recall

75

100

0

25

50

Recall

75

100

What about the classes? English to French English to German

51 OTHER

62 OTHER

5 on

5 cela 52 es

System

System

35 ils

TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5

Classes

Classes

TurkuNLP Uppsala NYU UUHardmeier UUStymne16 Baseline −1.5

29 il

62 sie

12 elles

12 elle 8 er

32 ce 0

25

50

Recall

75

100

0

25

50

Recall

75

100

Best 2017 and 2016 systems (Macro-avg. Recall) English to German

German to English 68.3 73.4 OTHER

64.471 OTHER

there

System 85.5 88.7

sie

a Uppsala P17 a TurkuNLP P16

Classes

Classes

71.3 78.8 es

87.5 0

these

30.8 50

this

79.592.5

they

62.1 69.3

it 13.3

75

76.2 82.3

she

er

System

7583

you

96.8 100

he 25

50

75

0

Recall English to French 68.6 78.8 80

on 60

77.4

Classes

cela 65.7 66.2

System

ils 48.3

60.7

a TurkuNLP P17 a TurkuNLP P16

il 48

58.3

elles 52.2

66.7

elle 86.8 87.5 ce 50

60

70

Recall

50

Recall

OTHER 55.6

25

80

75

100

a TurkuNLP C17 a TurkuNLP P16

Metrics Comparison Macro−averaged Recall vs Accuracy TurkuNLP Primary

TurkuNLP Contrastive

Uppsala Primary

Uppsala Contrastive

NYU Primary

NYU Contrastive

UUHardmeier Primary

UUHardmeier Contrastive

UUStymne16

Baseline Optimized

80 60 40 20 0 80

%

60 40 20 0 80 60

Metric accuracy recall

40 20 0 DeEn EnDe EsEn EnFr

DeEn EnDe EsEn EnFr

Pair

Questions yet to be answered

I

Results have improved since 2015, but what have we learned about pronoun translation?

I

Best performing systems rely on intra-sentential context, therefore, what is the importance of inter-sentential context?, and what is the best way to model it?

I

n-gram language models were meant to be competitive for SMT. Is this still a telling baseline?

Conclusions

I I

The shared task winner this year is TurkuNLP. The shared task has made steady progress since its first edition in 2015. However, it is less clear that our understanding of the pronoun translation has advanced. I

The NYU NMT system performed almost as well as specialized systems.

I

As in general MT, neural models have shown advantages for the task.

I

The task is not solved yet, there is plenty of room for improvement.

If you are interested in organizing this shared task next year, please let us know :)

Thank you!

A Shared Task on the Automatic Linguistic Annotation ...

CoNLL-X shared task on Multilingual Dependency Parsing - ILK

Shared Task on Source and Target Extraction from ...

Overview of the 2012 Shared Task on Parsing the Web - Slav Petrov

The CoNLL-2009 Shared Task: Syntactic and Semantic ...

Asymmetric Mammographic Findings Based on the Fourth Edition of BI ...

The Task of the Referee

Task force for the implementation of International Standards on ...

On the geometry of a generalized cross-correlation ...

The effect of frequency of shared features on judgments of semantic ...

SAC078 SSAC Advisory on Uses of the Shared Global Domain Name ...

SAC078 SSAC Advisory on Uses of the Shared Global Domain ... - icann