Relating Natural Language and Visual Recognition Marcus Rohrbach1,2 , Jacob Andreas1 , Trevor Darrell1 , Jiashi Feng7 , Lisa Anne Hendricks1 , Dan Klein1 , Ronghang Hu1 , Raymond Mooney4 , Anna Rohrbach3 , Kate Saenko5 , Bernt Schiele3 , Subhashini Venugopalan4 , Huazhe Xu6 1
UC Berkeley EECS, 2 ICSI, Berkeley, 3 MPI for Informatics, 4 UT Austin, 5 UMass Lowell, 6 Tsinghua University, 7 National University of Singapore
In this poster we will relate and discuss several of our most recent efforts for “Closing the Loop Between Vision and Language”. In Section 1 we show how we can describe videos [6] and images [3] with natural language sentences (Vision ⇒ Language). In Section 2 we show how we ground phrases in images [4] (Language ⇔ Vision). And finally, in Section 3, we discuss how compositional computation allows for effective question answering about images [1] (Language & Vision ⇒ Language).
SMT [7] S2VT [8] Visual labels [6] Reference SMT [7] S2VT [8] Visual labels [6] Reference SMT [7] S2VT [8] Visual labels [6] Reference
1. Describing visual content with natural language sentences
Figure 2: Results on MPII Movie Description dataset [7]. The “Visual labels” approach [6] identifies activities, objects, and places better than related work. From [6].
In [6], we decompose the challenging task of describing movies in two steps. First we recognize the most relevant activities/verbs, scenes, and objects, and then we describe them with natural sentences using a recurrent networks, namely an LSTM. The approach is visualized in Fig. 1 and achieves state-of-the-art performance on the challenging MPII Movie Description dataset[7], both with respect to automatic and human evaluation. Qualitative results are shown in Fig. 2.
Deep Compositional Captioner
Unpaired Image Data
State-of-the art deep image and video captioning approaches are limited to describe objects which appear in paired image/sentence data. Hendricks et al. [3] show how to exploit vision only and language only unpaired data to describe novel categories (Fig. 3).
Video
Visual recognition
Language generation
Sentence
LSTM
Someone enters the room.
Verbs Places Objects
Someone is a man, someone is a man. Someone looks at him, someone turns to someone. Someone is standing in the crowd, a little man with a little smile. Someone, back in elf guise, is trying to calm the kids. The car is a water of the water. On the door, opens the door opens. The fellowship are in the courtyard. They cross the quadrangle below and run along the cloister. Someone is down the door, someone is a back of the door, and someone is a door. Someone shakes his head and looks at someone. Someone takes a drink and pours it into the water. Someone grabs a vodka bottle standing open on the counter and liberally pours some on the hand.
Otter
Alpaca
Pizza
Bus
Paired ImageSentence Data A bus driving down the street. Yummy pizza sitting on the table.
Existing Methods
A otter that is sitting in the water.
Unpaired Text Data Otters live in a variety of aquatic environments. Pepperoni is a popular pizza topping.
A dog sitting on a boat in the water.
Figure 3: Existing deep caption methods are unable to generate sentences about objects unseen in caption corpora (like otter). In contrast our model (DCC) effectively incorporates information from independent image datasets and text corpora to compose descriptions about novel objects without any paired image-captions. From [3].
Figure 1: Describing movie snippets with natural sentences.
1
candidate location set
input image object proposal
...
natural language query:
white car on the right
global context
spatial configuration
man squatting
standing guy
leaves of left tree
pillar building in the middle
local descriptor
Spatial Context Recurrent ConvNet candidate scores output object retrieval result
0.28 0.15
top score candidate
0.42 ... 0.07 0.54
Figure 4: Overview of our approach to grounding phrases in images. Given an input image, a text query and a set of candidate locations (e.g. from object proposal methods), a recurrent neural network model is used to score candidate locations based on local descriptors, spatial configurations and global context. The highest scoring candidate is retrieved. Form [4].
Figure 5: Correctly localized examples (IoU ≥ 0.5) on ReferIt [5] with EdgeBox [9]. Ground truth in yellow and correctly retrieved box in green. Where is the dog?
Parser
couch
LSTM
Layout
count
dog
2. Grounding natural language phrases in images In many human-computer interaction or robotic scenarios it is important to be able to ground, i.e. localize, referential natural language expression in visual content. Hu et al. [4] propose to do this by ranking bounding box proposals using local, context, and spatial information (Fig. 4). An important aspect of our approach is to transfer models trained on full-image description datasets to this new task.
where
cat
color
standing
...
...
CNN
Figure 6: We use a natural language parser to dynamically lay out a deep network composed of reusable modules. For visual question answering tasks, an additional sequence model provides sentence context and learns common-sense knowledge. From [1].
3. Answering questions about images In the third part we discuss how to answer natural sentence questions about images. Andreas et al. [1] describe an approach to visual question answering based on neural module networks (NMNs). The proposed approach answers natural language questions about images using collections of jointly-trained neural “modules”, dynamically composed into deep networks based on linguistic structure. Concretely, given an image and an associated question (e.g. where is the dog?), we wish to predict a corresponding answer (e.g. on the couch, or perhaps just couch) by decomposing it into a where and dog module as shown in Fig. 6.
This surpasses performance of prior work on the MSCOCO based VQA dataset [2] as well as on a novel challenging shapes dataset which requires composing up to 6 modules together to answer a question.
References [1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. arXiv:1511.02799, 2015. [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. [3] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. arXiv:1511, 2015. [4] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. arXiv:1511.04164, 2015. [5] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. [6] A. Rohrbach, M. Rohrbach, and B. Schiele. The longshort story of movie description. Proceedings of the German Confeence on Pattern Recognition (GCPR), 2015. [7] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [8] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. arXiv:1505.00487v2, 2015. [9] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision (ECCV), pages 391– 405. Springer, 2014.