Visual Dialog: Towards AI agents that can see, talk, and act
Dhruv Batra
Outline Cooperative Visual Dialog Agents
Emergence of Grounded Dialog
Task (color, shape)
Q1: Y Q1: 2 Q2: X Q2: 2
I’d like the ball and hats
Negotiation Dialog Agents
I need the hats, you can have the ball Ok, if I get both books? Ok, deal
2
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning [ICCV ‘17]
Abhishek Das* (Georgia Tech)
José Moura (CMU)
Satwik Kottur* (CMU)
Stefan Lee (Virginia Tech)
Dhruv Batra (Georgia Tech)
Visual Dialog: Task • Given
Visual Dialogue
– Image I – History of human dialog (Q1, A1), (Q2, A2), …, (Qt-1, At-1) – Follow-up Question Qt
• Task – Produce free-form natural language answer At
(C) Dhruv Batra
4
Problems • No goal – Why are we talking?
• Agent not in control – Artificially injected at every round into a human conversation – Can’t steer conversation – Doesn’t get to see its errors during training
• Learning equivalent utterances – Many ways of answering the same question that should be treated equally, but aren’t – Is log-likelihood of human response really a good metric?
(C) Dhruv Batra
5
Image Guessing Game
(C) Dhruv Batra
Slide Credit: Abhishek Das
6
Image Guessing Game
Q-Bot asks questions is blindfolded (C) Dhruv Batra
Slide Credit: Abhishek Das
8
Image Guessing Game
asks questions Q-Bot is blindfolded
(C) Dhruv Batra
Slide Credit: Abhishek Das
9
Image Guessing Game
asks questions A-Bot answers questions sees an image (C) Dhruv Batra
Slide Credit: Abhishek Das
10
Image Guessing Game
asks questions answers questions A-Bot sees an image
(C) Dhruv Batra
Slide Credit: Abhishek Das
11
Image Guessing Game
asks questions
(C) Dhruv Batra
Slide Credit: Abhishek Das
12
Image Guessing Game
asks questions
(C) Dhruv Batra
Slide Credit: Abhishek Das
13
Image Guessing Game
asks questions
(C) Dhruv Batra
Slide Credit: Abhishek Das
14
RL for Cooperative Dialog Agents • Agents: (Q-bot, A-bot) • Environment: Image • Action: – Q-bot: question (symbol sequence) – A-bot: answer (symbol sequence) – Q-bot: image regression
qt at
Any people in the shot? No, there aren’t any.
• State – Q-bot: – A-bot:
(C) Dhruv Batra
15
RL for Cooperative Dialog Agents
• Action: – Q-bot: question (symbol sequence) – A-bot: answer (symbol sequence) – Q-bot: image regression
qt at
Any people in the shot? No, there aren’t any.
• State – Q-bot: – A-bot:
(C) Dhruv Batra
16
RL for Cooperative Dialog Agents • Action: – Q-bot: question (symbol sequence) – A-bot: answer (symbol sequence) – Q-bot: image regression
qt at
Any people in the shot? No, there aren’t any.
• State – Q-bot: – A-bot:
• Policy
Q-bot
A-bot
• Reward
(C) Dhruv Batra
17
Policy Networks Q-Bot
A-Bot
A-BOT
(C) Dhruv Batra
Slide Credit: Abhishek Das
18
Policy Networks Q-Bot
A-Bot
Slide Credit: Abhishek Das
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
20
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
21
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
22
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
23
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
24
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
25
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
26
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
27
Policy Networks Q-Bot
(C) Dhruv Batra
VGG-16
Slide Credit: Abhishek Das
A-Bot
28
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
29
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
30
Policy Networks Q-Bot
Fact Embedding
A-Bot
Two zebra are walking around their pen at the zoo.
(C) Dhruv Batra
Slide Credit: Abhishek Das
31
Policy Networks Q-Bot
Fact Embedding
A-Bot
Is this zoo? Yes Two zebra are walking around their pen at the zoo.
(C) Dhruv Batra
Slide Credit: Abhishek Das
32
Policy Networks Q-Bot
Fact Embedding
A-Bot
How many zebra? Two
Is this zoo? Yes Two zebra are walking around their pen at the zoo.
(C) Dhruv Batra
Slide Credit: Abhishek Das
33
Policy Networks Q-Bot History Encoder
Fact Embedding
A-Bot
How many zebra? Two
Is this zoo? Yes Two zebra are walking around their pen at the zoo.
(C) Dhruv Batra
Slide Credit: Abhishek Das
34
Policy Networks Q-Bot History Encoder
Fact Embedding
A-Bot
How many zebra? Two
Is this zoo? Yes Two zebra are walking around their pen at the zoo.
(C) Dhruv Batra
Slide Credit: Abhishek Das
35
Policy Networks Q-Bot History Encoder
Fact Embedding
A-Bot
How many zebra? Two
Is this zoo? Yes Two zebra are walking around their pen at the zoo.
(C) Dhruv Batra
Slide Credit: Abhishek Das
36
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
37
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
40
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
41
Policy Networks Q-Bot
(C) Dhruv Batra
A-Bot
Slide Credit: Abhishek Das
42
Policy Gradients
REINFORCE Gradients
(C) Dhruv Batra
Slide Credit: Abhishek Das
44
Turing Test
(C) Dhruv Batra
47
(C) Dhruv Batra
50
SL vs RL
SL Agents (C) Dhruv Batra
RL Agents 52
Image Guessing
(C) Dhruv Batra
53
Concurrent Work
(C) Dhruv Batra
55
Outline Cooperative Visual Dialog Agents
Emergence of Grounded Dialog
Task (color, shape)
Q1: Y Q1: 2 Q2: X Q2: 2
I’d like the ball and hats
Negotiation Dialog Agents
I need the hats, you can have the ball Ok, if I get both books? Ok, deal
56
Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog [EMNLP ‘17]
Satwik Kottur* (CMU)
Stefan Lee (Virginia Tech)
José Moura (CMU)
Dhruv Batra (Georgia Tech)
Toy World • Sanity check shape
• Simple, synthetic world – Instances - (shape, color, style) – Total of 43 (64) instances
color
triangle
blue
filled
square
green
dashed
circle
red
dotted
star
purple
solid
– Example instances:
(triangle, purple, filled)
(square, blue, solid)
style
(circle, blue, dotted)
Task & Talk • Task (G) • Inquire pair of attributes • (color, shape), (shape, color)
• Talk • Single token per round • Two rounds
• Q-bot guesses a pair • Reward : +1 / -1 • Prediction order matters!
(C) Dhruv Batra
Task (color, shape)
Instance (purple, square, filled)
Q1: Y A1: 2 Q2: Z A2: 3 Guess: (purple, square)
Get reward!
59
Emergence of Grounded Dialog T: (style, color) P: (solid, green) X
3
Z
4
color?
green
style?
solid
T: (style, shape) P: (filled, triangle)
(C) Dhruv Batra
Y
1
Z
2
shape?
triangle
style?
filled
64
Emergence of Grounded Dialog • Compositional grounding • Predict dialog for unseen instances
Task (color, shape)
Q1: Y Q1: 2 Q2: X Q2: 2
(C) Dhruv Batra
65
Summary of findings
Setting
Vocabula ry |𝑉𝑄 |
A. Overcomplete
B. Attribute
C. Minimal
64
3
3
Memory
|𝑉𝐴 | Q-bot A-bot 64
12
4
Yes
Yes
Yes
Yes
Yes
No
Generalizati on
25.6 %
Characteristics • • • •
Non-compositional language Q-bot insignificant Inconsistent A-bot grounding Poor generalization
• • • •
Non-compositional language Q-bot uses one round to convey task Inconsistent A-bot grounding Poor generalization
• • • •
Compositional language Q-bot uses both rounds Consistent A-bot grounding Good generalization
38.5 %
74.4 %
66
Deep Multi-Agent Communication •
•
NIPS ‘16 –
[DeepMind] Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, Shimon Whiteson. NIPS ‘16.
–
[NYU / FAIR] Learning Multiagent Communication with Backpropagation. Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus. NIPS ‘16.
Arxiv ‘17 –
[OpenAI] Emergence of Grounded Compositional Language in Multi-Agent Populations. Igor Mordatch, Pieter Abbeel.
–
[FAIR] Multi-Agent Cooperation and the Emergence of (Natural) Language. Angeliki Lazaridou, Alexander Peysakhovich, Marco Baroni.
–
Learning to play guess who? and inventing a grounded language as a consequence. Emilio Jorge, Mikael Kågebäck, and Emil Gustavsson.
–
Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. Serhii Havrylov and Ivan Titov.
–
[Berkeley] Translating neuralese. Jacob Andreas, Anca Dragan and Dan Klein. ACL 2017.
(C) Dhruv Batra
67
Outline Cooperative Visual Dialog Agents
Emergence of Grounded Dialog
Task (color, shape)
Q1: Y Q1: 2 Q2: X Q2: 2
I’d like the ball and hats
Negotiation Dialog Agents
I need the hats, you can have the ball Ok, if I get both books? Ok, deal
68
Deal or No Deal? End-to-End Learning for Negotiation Dialogues [EMNLP ‘17]
Mike Lewis (FAIR)
Denis Yarats (FAIR)
Devi Parikh (Georgia Tech)
Yann Dauphin (FAIR)
Dhruv Batra (Georgia Tech)
Why Negotiation?
Adversarial
Negotiation
Slide Credit: Mike Lewis
Cooperative
Why Negotiation?
Negotiation useful when: • Agents have different goals • Not all can be achieved at once • (all the time)
Slide Credit: Mike Lewis
Why Negotiation? • Both linguistic and reasoning problem • Interpret multiple sentences, and generate new message • Plan ahead, make proposals, counter-offers, bluffing, lying, compromising
Slide Credit: Mike Lewis
Framework Both agents given reward function, can’t observe each other’s
Both agents independently select agreement
Agent 1 Goals
Agent 1 Output
Agent 1 Reward
Dialog
Agent 2 Output
Agent 2 Goals Dialogue until they agree on common action
Slide Credit: Mike Lewis
Agent 2 Reward If agents agree, they are given reward
Object Division Task Agents shown same set of object but different values for each
Asked to agree how to divide objects between them
2 points each 1 point each 5 points each
Slide Credit: Mike Lewis
Multi-Issue Bargaining
I’d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal
Slide Credit: Mike Lewis
Data Collection on AMT
Slide Credit: Mike Lewis
Dataset
• ~6k dialogs • Average 6.6 turns/dialog • Average 7.6 words/turn • 80% agreed solutions • 77% Pareto Optimal solutions
Slide Credit: Mike Lewis
Baseline Model Language model predicts both agent’s tokens
Give
me
both
books
ok
Read input at each timestep
deal
Attention over complete dialogue
Input Encoder
Output Decoder Separate classifier for each output
Slide Credit: Mike Lewis
SL-Pretraining
• Train to maximize likelihood of human-human dialogues • Decode by sampling likely messages
Slide Credit: Mike Lewis
SL-Pretraining
• Model knows nothing about task, just tries to imitate human actions • Agrees too easily
• Can’t go beyond human strategies
Slide Credit: Mike Lewis
Goal-based RL-Finetuning • Generate dialogues using self-play
reward = 9 points
• Very sensitive to hyperparameters
Slide Credit: Mike Lewis
• Backpropagate reward using REINFORCE • Interleave with supervised updates
Dialog Rollouts: Goal-based Decoding
• Dialog rollouts use model to simulate remainder of conversation • Average scores to estimate future reward Slide Credit: Mike Lewis
Intrinsic Evaluation
60
Likelihood
50
Reinforce 48
40
46
37 30 20 10
0
4.8
5.1
Perplexity
0
Average Rank
Slide Credit: Mike Lewis
Supervised learning gives most “human like” dialog
End-to-End Evaluation against SL negotiators 3
SL
RL
SL+Rollouts
SL
RL
SL+Rollouts
80
2.5 2.5
2
70
76
74
60
1.8
1.5
74
72 61
65
50 40
1
30 0.7
0.5 0
0.5 0.1
0.1
Relative Score Relative Score (all) (agreed)
20 10 0
% Agreed Slide Credit: Mike Lewis
% Pareto Optimal
End-to-End Evaluation against Turkers 0 -0.1
-0.2
-0.2
-0.4
90
-0.6
SL
RL
80
81
-0.8
77
70 -1 -1 -1.2
SL+Rollouts
60
73 67
68 59
-1.2
50
-1.4 -1.4 -1.6
40 30
-1.8 -1.8
20
-2
Relative Score Relative Score (all) (agreed)
10 0
% Agreed Slide Credit: Mike Lewis
% Pareto Optimal
6 1 0
3 1 3 I need the book and hats Can I have the hats and book? I need the book and 2 hats I can not make that deal. I need the ball and book, you can have the hats
No deal then Sorry, I want the book and one hat
No deal doesn’t work for me sorry How about I give you the book and I keep the rest Ok deal
(C) Dhruv Batra
Model generates meaningful novel language Slide Credit: Mike Lewis
87
2 1 4
0 10 0 I would like the ball and two hats I need the book and 3 hats That would work for me. I can take the ball and 1 hat
Model can be deceptive to achieve its goals (C) Dhruv Batra
Slide Credit: Mike Lewis
88
Conclusion • Negotiation is useful and challenging • End-to-End approach trades cheaper data for difficult modelling • Goal-based training and decoding improves over likelihood
• Model can generate meaningful language be be deceptive to achieve their goals
Slide Credit: Mike Lewis
Outline Cooperative Visual Dialog Agents
Emergence of Grounded Dialog
Task (color, shape)
Q1: Y Q1: 2 Q2: X Q2: 2
I’d like the ball and hats
Negotiation Dialog Agents
I need the hats, you can have the ball Ok, if I get both books? Ok, deal
90
Sneak Peek: Inner Dialog: Pragmatic Visual Dialog Agents that Rollout a Mental Model of their Interlocutors
(C) Dhruv Batra
91
Inner Dialog
(C) Dhruv Batra
92
What next? • So far – Vision + Language • Captioning VQA Visual Dialog
• Interacting with an intelligent agent – Perceive + Communicate + Act – Vision + Language + Reinforcement Learning – Ok Google – can you find my picture where I was wearing this red shirt? And order me a new one?
(C) Dhruv Batra
97
(C) Dhruv Batra
101
Agents in Virtual Environments
AI2 Thor
(C) Dhruv Batra
102
What next? • So far – Vision + Language • Captioning VQA Visual Dialog
• Interacting with an intelligent agent – Perceive + Communicate + Act – Vision + Language + Reinforcement Learning – Ok Google – can you find my picture where I was wearing this red shirt? And order me a new one?
(C) Dhruv Batra
103
What next? • So far – Vision + Language • Captioning VQA Visual Dialog
• Interacting with an intelligent agent – Perceive + Communicate + Act – Vision + Language + Reinforcement Learning – Ok Google – can you find my picture where I was wearing this red shirt? And order me a new one?
• Teaching with natural language – ”No, not that shirt. This one.”
(C) Dhruv Batra
104
(C) Dhruv Batra
105
Machine Learning & Perception Group PhD Qing Sun
Aishwarya Agrawal
Yash Goyal
Michael Cogswell
Dhruv Batra Assistant Professor Abhishek Das Ashwin Kalyan
Aroma Mahendru Akrit Mohapatra
MS
Postdoc Stefan Lee
Deshraj Yadav
Tejas Khot
Viraj Prabhu
Interns (C) Dhruv Batra
106
Computer Vision Lab
(C) Dhruv Batra
107
Thanks!
(C) Dhruv Batra
108