Dhruv - Visual Dialog - RLSS 2017.pdf

Viewer
Transcript

Visual Dialog: Towards AI agents that can see, talk, and act

Dhruv Batra

Outline Cooperative Visual Dialog Agents

Emergence of Grounded Dialog

Task (color, shape)

Q1: Y Q1: 2 Q2: X Q2: 2

I’d like the ball and hats

Negotiation Dialog Agents

I need the hats, you can have the ball Ok, if I get both books? Ok, deal

2

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning [ICCV ‘17]

Abhishek Das* (Georgia Tech)

José Moura (CMU)

Satwik Kottur* (CMU)

Stefan Lee (Virginia Tech)

Dhruv Batra (Georgia Tech)

Visual Dialog: Task • Given

Visual Dialogue

– Image I – History of human dialog (Q1, A1), (Q2, A2), …, (Qt-1, At-1) – Follow-up Question Qt

• Task – Produce free-form natural language answer At

(C) Dhruv Batra

4

Problems • No goal – Why are we talking?

• Agent not in control – Artificially injected at every round into a human conversation – Can’t steer conversation – Doesn’t get to see its errors during training

• Learning equivalent utterances – Many ways of answering the same question that should be treated equally, but aren’t – Is log-likelihood of human response really a good metric?

(C) Dhruv Batra

5

Image Guessing Game

(C) Dhruv Batra

Slide Credit: Abhishek Das

6

Image Guessing Game

Q-Bot asks questions is blindfolded (C) Dhruv Batra

Slide Credit: Abhishek Das

8

Image Guessing Game

asks questions Q-Bot is blindfolded

(C) Dhruv Batra

Slide Credit: Abhishek Das

9

Image Guessing Game

asks questions A-Bot answers questions sees an image (C) Dhruv Batra

Slide Credit: Abhishek Das

10

Image Guessing Game

asks questions answers questions A-Bot sees an image

(C) Dhruv Batra

Slide Credit: Abhishek Das

11

Image Guessing Game

asks questions

(C) Dhruv Batra

Slide Credit: Abhishek Das

12

Image Guessing Game

asks questions

(C) Dhruv Batra

Slide Credit: Abhishek Das

13

Image Guessing Game

asks questions

(C) Dhruv Batra

Slide Credit: Abhishek Das

14

RL for Cooperative Dialog Agents • Agents: (Q-bot, A-bot) • Environment: Image • Action: – Q-bot: question (symbol sequence) – A-bot: answer (symbol sequence) – Q-bot: image regression

qt at

Any people in the shot? No, there aren’t any.

• State – Q-bot: – A-bot:

(C) Dhruv Batra

15

RL for Cooperative Dialog Agents

• Action: – Q-bot: question (symbol sequence) – A-bot: answer (symbol sequence) – Q-bot: image regression

qt at

Any people in the shot? No, there aren’t any.

• State – Q-bot: – A-bot:

(C) Dhruv Batra

16

RL for Cooperative Dialog Agents • Action: – Q-bot: question (symbol sequence) – A-bot: answer (symbol sequence) – Q-bot: image regression

qt at

Any people in the shot? No, there aren’t any.

• State – Q-bot: – A-bot:

• Policy

Q-bot

A-bot

• Reward

(C) Dhruv Batra

17

Policy Networks Q-Bot

A-Bot

A-BOT

(C) Dhruv Batra

Slide Credit: Abhishek Das

18

Policy Networks Q-Bot

A-Bot

Slide Credit: Abhishek Das

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

20

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

21

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

22

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

23

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

24

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

25

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

26

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

27

Policy Networks Q-Bot

(C) Dhruv Batra

VGG-16

Slide Credit: Abhishek Das

A-Bot

28

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

29

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

30

Policy Networks Q-Bot

Fact Embedding

A-Bot

Two zebra are walking around their pen at the zoo.

(C) Dhruv Batra

Slide Credit: Abhishek Das

31

Policy Networks Q-Bot

Fact Embedding

A-Bot

Is this zoo? Yes Two zebra are walking around their pen at the zoo.

(C) Dhruv Batra

Slide Credit: Abhishek Das

32

Policy Networks Q-Bot

Fact Embedding

A-Bot

How many zebra? Two

Is this zoo? Yes Two zebra are walking around their pen at the zoo.

(C) Dhruv Batra

Slide Credit: Abhishek Das

33

Policy Networks Q-Bot History Encoder

Fact Embedding

A-Bot

How many zebra? Two

Is this zoo? Yes Two zebra are walking around their pen at the zoo.

(C) Dhruv Batra

Slide Credit: Abhishek Das

34

Policy Networks Q-Bot History Encoder

Fact Embedding

A-Bot

How many zebra? Two

Is this zoo? Yes Two zebra are walking around their pen at the zoo.

(C) Dhruv Batra

Slide Credit: Abhishek Das

35

Policy Networks Q-Bot History Encoder

Fact Embedding

A-Bot

How many zebra? Two

Is this zoo? Yes Two zebra are walking around their pen at the zoo.

(C) Dhruv Batra

Slide Credit: Abhishek Das

36

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

37

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

40

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

41

Policy Networks Q-Bot

(C) Dhruv Batra

A-Bot

Slide Credit: Abhishek Das

42

Policy Gradients

REINFORCE Gradients

(C) Dhruv Batra

Slide Credit: Abhishek Das

44

Turing Test

(C) Dhruv Batra

47

(C) Dhruv Batra

50

SL vs RL

SL Agents (C) Dhruv Batra

RL Agents 52

Image Guessing

(C) Dhruv Batra

53

Concurrent Work

(C) Dhruv Batra

55

Outline Cooperative Visual Dialog Agents

Emergence of Grounded Dialog

Task (color, shape)

Q1: Y Q1: 2 Q2: X Q2: 2

I’d like the ball and hats

Negotiation Dialog Agents

I need the hats, you can have the ball Ok, if I get both books? Ok, deal

56

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog [EMNLP ‘17]

Satwik Kottur* (CMU)

Stefan Lee (Virginia Tech)

José Moura (CMU)

Dhruv Batra (Georgia Tech)

Toy World • Sanity check shape

• Simple, synthetic world – Instances - (shape, color, style) – Total of 43 (64) instances

color

triangle

blue

filled

square

green

dashed

circle

red

dotted

star

purple

solid

– Example instances:

(triangle, purple, filled)

(square, blue, solid)

style

(circle, blue, dotted)

Task & Talk • Task (G) • Inquire pair of attributes • (color, shape), (shape, color)

• Talk • Single token per round • Two rounds

• Q-bot guesses a pair • Reward : +1 / -1 • Prediction order matters!

(C) Dhruv Batra

Task (color, shape)

Instance (purple, square, filled)

Q1: Y A1: 2 Q2: Z A2: 3 Guess: (purple, square)

Get reward!

59

Emergence of Grounded Dialog T: (style, color) P: (solid, green) X

3

Z

4

color?

green

style?

solid

T: (style, shape) P: (filled, triangle)

(C) Dhruv Batra

Y

1

Z

2

shape?

triangle

style?

filled

64

Emergence of Grounded Dialog • Compositional grounding • Predict dialog for unseen instances

Task (color, shape)

Q1: Y Q1: 2 Q2: X Q2: 2

(C) Dhruv Batra

65

Summary of findings

Setting

Vocabula ry |𝑉𝑄 |

A. Overcomplete

B. Attribute

C. Minimal

64

3

3

Memory

|𝑉𝐴 | Q-bot A-bot 64

12

4

Yes

Yes

Yes

Yes

Yes

No

Generalizati on

25.6 %

Characteristics • • • •

Non-compositional language Q-bot insignificant Inconsistent A-bot grounding Poor generalization

• • • •

Non-compositional language Q-bot uses one round to convey task Inconsistent A-bot grounding Poor generalization

• • • •

Compositional language Q-bot uses both rounds Consistent A-bot grounding Good generalization

38.5 %

74.4 %

66

Deep Multi-Agent Communication •

•

NIPS ‘16 –

[DeepMind] Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, Shimon Whiteson. NIPS ‘16.

–

[NYU / FAIR] Learning Multiagent Communication with Backpropagation. Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus. NIPS ‘16.

Arxiv ‘17 –

[OpenAI] Emergence of Grounded Compositional Language in Multi-Agent Populations. Igor Mordatch, Pieter Abbeel.

–

[FAIR] Multi-Agent Cooperation and the Emergence of (Natural) Language. Angeliki Lazaridou, Alexander Peysakhovich, Marco Baroni.

–

Learning to play guess who? and inventing a grounded language as a consequence. Emilio Jorge, Mikael Kågebäck, and Emil Gustavsson.

–

Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. Serhii Havrylov and Ivan Titov.

–

[Berkeley] Translating neuralese. Jacob Andreas, Anca Dragan and Dan Klein. ACL 2017.

(C) Dhruv Batra

67

Outline Cooperative Visual Dialog Agents

Emergence of Grounded Dialog

Task (color, shape)

Q1: Y Q1: 2 Q2: X Q2: 2

I’d like the ball and hats

Negotiation Dialog Agents

I need the hats, you can have the ball Ok, if I get both books? Ok, deal

68

Deal or No Deal? End-to-End Learning for Negotiation Dialogues [EMNLP ‘17]

Mike Lewis (FAIR)

Denis Yarats (FAIR)

Devi Parikh (Georgia Tech)

Yann Dauphin (FAIR)

Dhruv Batra (Georgia Tech)

Why Negotiation?

Adversarial

Negotiation

Slide Credit: Mike Lewis

Cooperative

Why Negotiation?

Negotiation useful when: • Agents have different goals • Not all can be achieved at once • (all the time)

Slide Credit: Mike Lewis

Why Negotiation? • Both linguistic and reasoning problem • Interpret multiple sentences, and generate new message • Plan ahead, make proposals, counter-offers, bluffing, lying, compromising

Slide Credit: Mike Lewis

Framework Both agents given reward function, can’t observe each other’s

Both agents independently select agreement

Agent 1 Goals

Agent 1 Output

Agent 1 Reward

Dialog

Agent 2 Output

Agent 2 Goals Dialogue until they agree on common action

Slide Credit: Mike Lewis

Agent 2 Reward If agents agree, they are given reward

Object Division Task Agents shown same set of object but different values for each

Asked to agree how to divide objects between them

2 points each 1 point each 5 points each

Slide Credit: Mike Lewis

Multi-Issue Bargaining

I’d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal

Slide Credit: Mike Lewis

Data Collection on AMT

Slide Credit: Mike Lewis

Dataset

• ~6k dialogs • Average 6.6 turns/dialog • Average 7.6 words/turn • 80% agreed solutions • 77% Pareto Optimal solutions

Slide Credit: Mike Lewis

Baseline Model Language model predicts both agent’s tokens

Give

me

both

books

ok

Read input at each timestep

deal

Attention over complete dialogue

Input Encoder

Output Decoder Separate classifier for each output

Slide Credit: Mike Lewis

SL-Pretraining

• Train to maximize likelihood of human-human dialogues • Decode by sampling likely messages

Slide Credit: Mike Lewis

SL-Pretraining

• Model knows nothing about task, just tries to imitate human actions • Agrees too easily

• Can’t go beyond human strategies

Slide Credit: Mike Lewis

Goal-based RL-Finetuning • Generate dialogues using self-play

reward = 9 points

• Very sensitive to hyperparameters

Slide Credit: Mike Lewis

• Backpropagate reward using REINFORCE • Interleave with supervised updates

Dialog Rollouts: Goal-based Decoding

• Dialog rollouts use model to simulate remainder of conversation • Average scores to estimate future reward Slide Credit: Mike Lewis

Intrinsic Evaluation

60

Likelihood

50

Reinforce 48

40

46

37 30 20 10

0

4.8

5.1

Perplexity

0

Average Rank

Slide Credit: Mike Lewis

Supervised learning gives most “human like” dialog

End-to-End Evaluation against SL negotiators 3

SL

RL

SL+Rollouts

SL

RL

SL+Rollouts

80

2.5 2.5

2

70

76

74

60

1.8

1.5

74

72 61

65

50 40

1

30 0.7

0.5 0

0.5 0.1

0.1

Relative Score Relative Score (all) (agreed)

20 10 0

% Agreed Slide Credit: Mike Lewis

% Pareto Optimal

End-to-End Evaluation against Turkers 0 -0.1

-0.2

-0.2

-0.4

90

-0.6

SL

RL

80

81

-0.8

77

70 -1 -1 -1.2

SL+Rollouts

60

73 67

68 59

-1.2

50

-1.4 -1.4 -1.6

40 30

-1.8 -1.8

20

-2

Relative Score Relative Score (all) (agreed)

10 0

% Agreed Slide Credit: Mike Lewis

% Pareto Optimal

6 1 0

3 1 3 I need the book and hats Can I have the hats and book? I need the book and 2 hats I can not make that deal. I need the ball and book, you can have the hats

No deal then Sorry, I want the book and one hat

No deal doesn’t work for me sorry How about I give you the book and I keep the rest Ok deal

(C) Dhruv Batra

Model generates meaningful novel language Slide Credit: Mike Lewis

87

2 1 4

0 10 0 I would like the ball and two hats I need the book and 3 hats That would work for me. I can take the ball and 1 hat

Model can be deceptive to achieve its goals (C) Dhruv Batra

Slide Credit: Mike Lewis

88

Conclusion • Negotiation is useful and challenging • End-to-End approach trades cheaper data for difficult modelling • Goal-based training and decoding improves over likelihood

• Model can generate meaningful language be be deceptive to achieve their goals

Slide Credit: Mike Lewis

Outline Cooperative Visual Dialog Agents

Emergence of Grounded Dialog

Task (color, shape)

Q1: Y Q1: 2 Q2: X Q2: 2

I’d like the ball and hats

Negotiation Dialog Agents

I need the hats, you can have the ball Ok, if I get both books? Ok, deal

90

Sneak Peek: Inner Dialog: Pragmatic Visual Dialog Agents that Rollout a Mental Model of their Interlocutors

(C) Dhruv Batra

91

Inner Dialog

(C) Dhruv Batra

92

What next? • So far – Vision + Language • Captioning  VQA  Visual Dialog

• Interacting with an intelligent agent – Perceive + Communicate + Act – Vision + Language + Reinforcement Learning – Ok Google – can you find my picture where I was wearing this red shirt? And order me a new one?

(C) Dhruv Batra

97

(C) Dhruv Batra

101

Agents in Virtual Environments

AI2 Thor

(C) Dhruv Batra

102

What next? • So far – Vision + Language • Captioning  VQA  Visual Dialog

• Interacting with an intelligent agent – Perceive + Communicate + Act – Vision + Language + Reinforcement Learning – Ok Google – can you find my picture where I was wearing this red shirt? And order me a new one?

(C) Dhruv Batra

103

What next? • So far – Vision + Language • Captioning  VQA  Visual Dialog

• Interacting with an intelligent agent – Perceive + Communicate + Act – Vision + Language + Reinforcement Learning – Ok Google – can you find my picture where I was wearing this red shirt? And order me a new one?

• Teaching with natural language – ”No, not that shirt. This one.”

(C) Dhruv Batra

104

(C) Dhruv Batra

105

Machine Learning & Perception Group PhD Qing Sun

Aishwarya Agrawal

Yash Goyal

Michael Cogswell

Dhruv Batra Assistant Professor Abhishek Das Ashwin Kalyan

Aroma Mahendru Akrit Mohapatra

MS

Postdoc Stefan Lee

Deshraj Yadav

Tejas Khot

Viraj Prabhu

Interns (C) Dhruv Batra

106

Computer Vision Lab

(C) Dhruv Batra

107

Thanks!

(C) Dhruv Batra

108

MACHINE LEARNING FOR DIALOG STATE ... - Semantic Scholar