Sherlock: Modeling Structured Knowledge in Images∗

1

Mohamed Elhoseiny1,2 , Scott Cohen1 , Walter Chang1 , Brian Price1 , Ahmed Elgammal2 2 Adobe Research Department of Computer Science, Rutgers University

It is a capital mistake to theorize in advance of the facts. -Sherlock Holmes ([3]) How can we build a machine learning method that can continuously gain structured visual knowledge by learning structured facts? We address this question by proposing a problem setting where training data comes as structured facts in images including (1) objects (e.g., ), (2) attributes (e.g., ), (3) actions (e.g., ), and (4) interactions (e.g., ). Each structured fact has a semantic language view (e.g., < boy, playing>) and a visual view (an image with this fact). A human is able to efficiently gain visual knowledge by learning facts in a never ending process, and we believe in a structured way (e.g., understanding “playing” is the action part of , and hence can generalize to recognize if is also understood). Inspired by human visual perception, we propose a model that (1) is able to learn a representation which covers different types of structured facts, (2) could flexibly get fed with structured fact language-visual view pairs in a never ending way to gain more structured knowledge, (3) could generalize to unseen facts, and (4) allows retrieval of both the fact language view given the visual view and vice versa. We also propose a novel method to generate hundreds of thousands of structured fact pairs from image caption data to train our model, which can be useful for other applications. Motivation: Imagine a scene with the following facts: there are objects like a “man”, “baby”, and “toy”; these have attributes like “man smiling” and “baby smiling”, and there are interactions such as “man sitting on chair”, “baby sitting on chair”, “man feeding baby”. We expect that the imagined scene will be very close to the image in Fig. 1 due to the precise structured description. On the other hand, if we were given the same image and asked to describe it, we might expect only a short title “man feeding a baby”. In this work, we want an algorithm with the keen observational skills of a detective that is able to automatically identify the facts present in a scene. Key differences to related works: State-of-the-art captioning methods (e.g., [5, 10, 11, 7]) rely on the idea of generating a sequence of words given an image, inspired by the success of sequence to sequence training of neural nets for translation (e.g., [2]). While an impressive step, the mechanism of these captioning systems makes them incapable of conveying structured information in an image and providing a confidence of the generated caption given the facts in the image. In other words, while captions and unstructured tags communicate facts to humans, they may not be the best way to represent knowledge in a way that is searchable for machines. For example, if one searches for images of a “red flower”, a bag-of-words approach that considers ”red” and “flower” separately may return images of flowers that are not red but have red elsewhere in the image. It is important to know that a user is looking for the fact . There are indeed other research problems that aim to explicity understand facts about an image. The main objective in object (e.g., [9]), scene (e.g., [13]), and activity categorization (e.g., [4]) methods is to have a set of discrete categories that the system can recognize. These methods face a scalability problem since adding a new category means changing the architecture and re-training the model (e.g., needed for adding a new output node). This drawback is also shared with recent attribute based methods (e.g., [1]) that realize that attribute appearance is dependent on the class. This motivates them to jointly learn a classifier for every class attribute pair, which suffers from scalability problems as the number of the classes and attributes grows. Our goal is to model structured facts in a way that is scalable, i.e., to avoid the need to change the model in order to add more facts. Furthermore, we aim that the model can gain structured visual knowledge by being continuously fed with instances of structured facts, that could be of different types (e.g., , or , < man, tall>, ) with example image(s). At any point the machine does not have a fixed dictionary of trained object, scene, activity categories. Goal: Modeling the connection between structured facts in its language form and its visual view facilitates gaining richer visual knowledge, which is our focus in this work. Several applications can make use of modeling that connection, such as structured fact tagging, high precision image search from text, generating comprehensive descriptions of complicated scenes, and making higher level reasoning about a scene. In our work, we aim to cover three types of facts by the same model. First order facts in images are defined by the set of objects and the scene in an image (e.g,. , , ). ∗ Available

on Arxiv http://arxiv.org/abs/1511.04891

1

Figure 1: Sherlock Problem: Gaining Structured Visual Knowledge Figure 2: Problem Definition Second order facts are defined by attributes and single-frame action that might be performed by the objects (e.g., < baby, smiling>, ). An object interacting with another object defines a third order fact in the image (e.g., , ). We denote the first, second, and third order facts by , , and respectively, abbreviated as , , and ; see Fig. 1. Inspired by the modifier concept in language grammar, we model higher order facts as visual modifiers of the low order facts. For example, is applying the visual modifier to . Based on this notion, we propose a model for learning a representation of structured facts covering different orders that can be continuously fed with these facts to gain structured knowledge. Specifically, both language and visual views of a structured fact inhabit in a continuous space; see Fig 2. Modeling structured facts in a continuous space allows us to extend the gained knowledge from the training facts to unseen facts. For example if during training a model learned the facts , , and , it should be able to recognize the fact even though it did not see that fact before. Since the proposed setting is aiming for a model that has an eye for details and potentially allows higher order reasoning (Fig. 1), we denote this problem as the Sherlock Problem. Structured Fact Data Collection. To train a model for our setting, we needed to collect structured fact annotations in the form of language view, visual view pairs (e.g., as the language view and an image with this fact as a visual view). This is a challenging task. We started by manually annotating and mining several existing datasets to extract structured fact annotations, which we found limiting for both covering different types of facts. We propose a novel method to automatically annotate structured facts by processing image caption data since structured facts in image captions are highly likely to be located in the image. Our Sherlock Automatic Fact Annotation (SAFA) method extracts fact language views from image captions and then localizes the facts to image regions to get visual views. SAFA collected tens of thousands of unique knowledge annotations within hundreds of thousands of images in just several hours. Contributions. There are three contributions in this work: (1) We introduce the Sherlock problem of structured knowledge modeling in an image and propose a model that can learn structured facts of different types and perform both-view retrieval (retrieve structured fact language view (e.g. ) given the visual view (i.e. image) and vice versa. (2) We propose an automatic stuctured fact annotation method based on sophisticated Natural Language Processing methods for acquiring high quality structured fact annotation pairs at large scale from free-form image descriptions. We applied the pipeline to MS COCO [6] and Flickr30K Entities [8, 12] image caption datasets. In total, we build a structured fact dataset of more than 816, 000 language&image-view fact pairs covering more than 202, 000 unique facts in the language view. (3) We develop a novel learning representation network architecture to jointly model the structured fact language and visual views by mapping both views into a common space and using a wild card loss to uniformly represent first, second, and third order facts. Our modeling approach is scalable to new facts without any change to the network architecture; see Fig 3 and 4.



0.876 0.623 0.595

Figure 4: Language View Retrieval Example (red means unseen facts) Figure 3: Sherlock Models. See Fig. 2 for the full system 2

References [1] C.-Y. Chen and K. Grauman. Inferring analogous attributes. In CVPR, 2014. [2] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014. [3] S. A. C. Doyle. The adventure of the second stain. In The Return of Sherlock Holmes. 1905. [4] G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015. [5] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR. 2015. [6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014. [7] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015. [8] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015. [9] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. 2015. [11] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML. 2015. [12] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, pages 67–78, 2014. [13] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014.

3

Sherlock: Modeling Structured Knowledge in Images

develop a novel learning representation network architecture to jointly model the ... Learning deep features for scene recognition using places database. In NIPS ...

812KB Sizes 0 Downloads 134 Views

Recommend Documents

Modeling and Integrating Background Knowledge in Data ...
However, the adversary may know the correlations between Emphysema and the non-sensitive attributes Age and Sex, e.g., “the prevalence of emphysema was appreciably higher for the 65 and older age group than the. 45-64 age group for each race-sex gr

structured language modeling for speech ... - Semantic Scholar
20Mwds (a subset of the training data used for the baseline 3-gram model), ... it assigns probability to word sequences in the CSR tokenization and thus the ...

STRUCTURED LANGUAGE MODELING FOR SPEECH ...
A new language model for speech recognition is presented. The model ... 1 Structured Language Model. An extensive ..... 2] F. JELINEK and R. MERCER.

Modeling how incoming knowledge, persistence ...
Aug 8, 2015 - We discuss the importance of these findings, and ... The main goal of these adaptive educational games is to create an ... dand then determining how to leverage the information to improve student .... include: (a) log file data of in-ga

Structured Stochastic Modeling of Fault-Tolerant Systems - CiteSeerX
an expression of non-typed programming languages, e.g., C language. ... definition composed of D DMIS, R roles, and P play- ers (see Section 2) can be ...

Structured Stochastic Modeling of Fault-Tolerant Systems
ple threads of control: managing their creation and destruc- tion, and controlling ... been used to implement control software for several safety- critical systems [25 ..... bility and efficiency, may be taken into account, but must not conflict with

Structured Stochastic Modeling of Fault-Tolerant Systems - CIn-UFPE
rotary table, two presses, and a rotary robot that has two orthogonal extendible arms equipped with electromagnets. (see Fig. 7). These devices are associated ...

Contextual Modeling of Functional MR Images ... - Semantic Scholar
Markov model, the conditional random field (CRF) models the contextual dependencies in a .... HRF used in the statistical parametric mapping software from ..... R. S. J. Frackowiak, and R. Turner, “Analysis of fMRI time-series re- visited ...

Sherlock Holmes.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Sherlock Holmes.pdf.

Sherlock s01e03 xvid.pdf
Page 1 of 3. Sherlock s01e03 xvid. Watch sherlock s01e03 online free alluc free streaming links. Watch. sherlock.s01e03.dvdrip.xvid haggis.avi online free alluc. Watch. sherlock s01e03 online free alluc free streaming links. Watch. sherlock.s01e03.dv

Wiki-based Knowledge Sharing in A Knowledge ... - Springer Link
and also includes a set of assistant tools that support this collaboration. .... knowledge, and can also query desirable knowledge directly by the search engine.

Watch Sherlock Holmes in Washington (1943) Full Movie Online ...
Watch Sherlock Holmes in Washington (1943) Full Movie Online Free .MP4.pdf. Watch Sherlock Holmes in Washington (1943) Full Movie Online Free .MP4.pdf.

Noninverted images in inferior mirages
covered with water and the distance of this reflective zone away from the ... An observer at height h above the road will see the near edge of the ... dal height profile. gПxч ¼ asinП2πx/Dч. П4ч with amplitude a ¼ 2cm and period D ¼ 50m. A

Steganography: Data Hiding in Images
cryptography. Cryptography concentrates more on the security or encryption of data whereas. Steganography aims to defeat the knowledge of encryption of the message. Watermarking is about protecting the content in images;. Steganography is all about c

Wiki-based Knowledge Sharing in A Knowledge ... - Springer Link
with other hyper text systems such as BBS or Blog, Wiki is more open and .... 24. Wiki-based Knowledge Sharing in A Knowledge-Intensive Organization.

sherlock holmes stories in malayalam pdf
malayalam pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. sherlock holmes stories in malayalam pdf. sherlock holmes ...

pdf-145\chinese-characters-for-everyone-sherlock-holmes-in-the ...
... the apps below to open or edit this item. pdf-145\chinese-characters-for-everyone-sherlock-hol ... f-chinese-characters-by-mrs-marie-laure-deshazer.pdf.

Sherlock - Season 4
Page 1 of 19. Thesignal.french.Songschildren pdf.87987058328 - Download Sherlock - Season 4.SnowBunnyGets Revenge.She better watch out four the. paranoia has turned who let the dogs out black-heartedness,and heforever shall doanything atallto keep hi