Learning the structure of objects from Web supervision David Novotny1

1

Visual Geometry Group University of Oxford

2

Computer Vision Group Xerox Research Centre Europe

[email protected]

Diane Larlus2 [email protected]

Andrea Vedaldi1 http://www.robots.ox.ac.uk/~vedaldi

1

Introduction

While recent research in image understanding has often focused on recognizing more types of objects, understanding more about the objects is just as important. Advances in tasks such as image captioning, activity recognition, and many others have ventured far beyond the standard image classification or object detection problems in order to extract richer information from visual scenes. Even so, image understanding remains rather crude, oblivious to most of the nuances of real world images. Consider for example the notion of object category, which is a basic unit of understanding in computer vision. Modern benchmarks such as ILSVRC consider an increasingly large number of such categories. However, there is only limited understanding of their internal geometric structure and semantics. We aim at filling this gap by jointly learning about objects, their semantic parts, and their geometric relationship. Learning about semantic nameable parts plays a crucial role in visual understanding. However, standard large scale approaches to this task face the difficulty of collecting vast quantities of corresponding annotated example images. Instead, scalable algorithms must be designed to discover this information, with minimal or no supervision. As others have done, we look at Web supervision to learn the structure of objects from thousands of images obtained automatically by querying search engines. However, this poses two significant challenges: identifying images of the semantic parts in very noisy Web results while, at the same time, discovering their geometric relationships in the presence of drastic scale changes between parts seen in the context of the whole object or in isolation. To this end, we propose a novel embedding that encodes the geometry of an object and of its parts by expressing them relatively to a reference frame which is robust to the large scale variations. The reference frame is formed by non-semantic anchor parts which are learned automatically using a new method for non-semantic part discovery.

Input Noisy Web results for "car wheel"

Learned concept "Car wheel" as an object

Car images

"Car wheel" as a component

Figure 1: Our goal is to learn the semantic structure of objects automatically using Web supervision. For example, given noisy images obtained by querying an Internet search engine for “car wheel” and for “cars”, we aim at learning the “car wheel” concept, and its dual nature: as an object in its own right, and as a component of another object.

gions R pk ,x containing each anchor pk . We define the following geometric embedding φ g of a region R with respect to the anchors: 

 IoU(R, R p1 ,x )   .. φ g (x|R) =   .

(1)

IoU(R, R pK ,x ) where IoU is an intersection-over-union box overlap measure. Since IoU is invariant to scaling, rotation, and translation of the regions, so is the embedding φ g . Hence, as long as anchors stay attached to fixed locations on the surface of the common object class, φ g (x|R) encodes the location of R relative to an object-centric frame. To further enrich the embedding, the geometric encoding φ g (x|R) is combined with the appearance descriptor φ a (x|R) in a joint appearancegeometric embedding φ ag : φ ag (x|R) = φ a (x|R) ⊗ φ g (x|R)

(2)

where ⊗ is the Kronecker product. After vectorization, this vector is used as a descriptor φ (x|R) = φ ag (x|R) of region R. An important property of this novel embedding arises once φ ag is 2 The MIL framework plugged into the MIL objective function. The result is a scoring function which interpolates between appearance models based on how the region We formulate the problem of part detection in web images as an instance R is geometrically related to the individual anchors. In particular, by of the Multiple Instance Learning problem (MIL). More specifically, we selecting different anchors this model may capture simultaneously the apfirst query for the Web images corresponding to the name of a given sepearance of all parts of an object. mantic part class (e.g. “eye”, “car wheel”, etc.). Then, for each part class separatelly, we learn a weakly supervised object/part detector which Properties of the geometric embedding. To provide better understanddistinguishes the positive part images from a set of negative images that ing of the novel appearance-geometry embedding we also proove that the come from a common background clutter samples. Following the MIL IoU measure is a special case of a family of PD kernels. Such property algorithm, we learn a linear scoring function hφ (xi |R), wi, where w is a theoretically motivates our approach since the geometric embedding can vector of parameters and φ (xi |R) ∈ Rd is a descriptor of the region R of then be seen as feature map induced by the IoU kernel and, consequently, image xi . MIL should automatically discover regions that are most predic- the appearance-geometry embedding corresponds to the product between tive of a given label, and which therefore should correspond to the sought the (linear) appearance kernel and the IoU kernel. visual entity (object or semantic part). However, this process may fail if descriptors are not sufficiently strong.

3

A novel geometric embedding

The aforementioned MIL formulation does not leverage the fact that objects have a well-defined geometric structure, which significantly constrains the search space for parts. Here, we propose an alternative method that captures geometry indirectly, on top of a rich set of unsupervised mid-level non-semantic parts {p1 , ..., pK }, which we call anchors. Let us assume that, given an image x, we can locate the (selective search) re-

4

Anchors: weakly-supervised non-semantic parts

The geometric embedding φ g leverages the power of an object specific intermediate representation: a collection of anchors {pk }K k=1 , learned automatically using weak supervision. To learn these anchors, we first gather all the web images corresponding to a given object class and its parts into a single set of positive samples. Then, we optimize a novel objective function, to learn a collection of diverse linear anchor appearance models ω 1 , . . . , ω K that discriminate between the negative background clutter images and the object-specific positive images.

Supervision

Method Cho et al. [4] Bilen & Vedaldi [2] Baseline MIL Ours RCNN Ours

Web

Full

{face} 16.6 2.7 20.6 44.9 53.7 61.4

mAP {car} 16.9 12.0 29.1 34.4 51.2 60.3

{bus} 12.4 4.7 22.7 23.0 48.2 54.1

Table 1: Semantic part detection results averaged for the face, car, and bus parent classes. CNN features Decaf [6] VGG-VD [9]

Baseline 57.7 68.9

Method BoE [7] 69.7 77.6

Ours 71.5 77.8

Table 2: Classification results on MIT Scenes [8].

5

Geometry aware MIL

The combination of the geometric appearance embedding with the MIL framework defines our geometry aware MIL: Once the anchor appearance models are obtained, the geometric embedding of an an arbitrary region R is obtained by first localizing anchors in the corresponding image and then forming the geometric descriptor φ g by computing the overlaps between the anchor positions and R (eq. (1)). This embedding is then combined with the appearance descriptor φ a to obtain the appearancegeometric embedding φ ag (eq. (2)). φ ag is then used as an improved descriptor for learning semantic parts with the MIL framework.

6

Figure 2: Navigating the visual semantic atlas. Each pair of solid bounding boxes connected by an arrow denotes a preselected part box (the starting point of an arrow) as detected by our algorithm and the most similar semantic match (the endpoint of the arrow). The best matching bounding box is the detection with highest appearance-geometry descriptor similarity among all the detections in our database of web images. The dashed boxes denote anchors that contributed the most to the similarity.

Experiments 7

Conclusions

The proposed methods were evaluated on the Labeled Face Parts in the Wild (LFPW) [1] which contains annotations for the “face” class parts We have proposed a novel method for learning about objects and their and on the PascalParts dataset [3] where we evaluated performance on semantic parts from noisy Web supervision. This is achieved through a parts corresponding to the “bus” and “car” classes. novel appearance-geometry embedding which relies on a set of mid-level Webly supervised localization of objects and parts. We evaluate the de- visual elements which define a robust object-centric coordinate frame. We tection performance of our approach against strong baselines [2], [4] and showed improved performance over three strong detection baselines on all a baseline MIL detector. We also report a fully supervised RCNN detector benchmarked datasets. Further experiments show comparable results to [5] and the results of our method in the fully supervised scenario. Table 1 state-of-the-art on the mid-level element discovery and superior results on reports the average AP over parts of a given object class for all compared the semantic matching tasks. Finally, our method also provides a visually methods. Our method ourperforms all webly supervised baselines. More- intuitive way to navigate Web images and predicted annotations. over, a substantial improvement over the fully supervised RCNN baseline is also achieved indicating that our representation may be applicable well References beyond weakly supervised learning. Leveraging a single annotation. An issue with weakly supervised part [1] Peter N Belhumeur, David W Jacobs, David J Kriegman, and Narendra Kumar. Localizing parts of faces using a consensus of exemplars. learning is the inherent ambiguity in the part extent, that may differ from PAMI, 2013. dataset to dataset. We address the ambiguity by considering a scenario where a single strong annotation per semantic part is available. Results [2] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In Proc. CVPR, 2015. indicate that our method outperforms the nearest competitor by 13.6, 3.1 and 3.2 mAP points for “face”, “car” and “bus” parts respectively. [3] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and repreDiscriminative power of anchors. Since most of the existing methods senting objects using holistic models and body parts. In Proc. CVPR, for learning mid-level patches are evaluated in terms of discriminative 2014. content in a classification setting, we adopt the same protocol here. In particular, we evaluate the anchors as mid-level patches on the MIT Scene 67 [4] Minsu Cho, Suha Kwak, Cordelia Schmid, and Jean Ponce. Unsuindoor scene classification task [8]. The results in table 2 indicate that our pervised Object Discovery and Localization in the Wild: Part-based mid-level anchor mining algorithm outperforms both baseline and comMatching with Bottom-up Region Proposals. In Proc. CVPR, 2015. petitors when both Decaf and VGG-VD features are used for classifica- [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. tion. Rich feature hierarchies for accurate object detection and semantic Ability of anchors to establish semantic matches. The last set of exsegmentation. In Proc. CVPR, 2014. periments assessed the ability of our mid-level anchors to induce good [6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, semantic matches between pairs of images. Results show that our method Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Daroutperforms strong baselines also on this task which validates our inturell. Caffe: Convolutional architecture for fast feature embedding. ition that the local geometry of an object is well-captured by the anchors. arXiv preprint arXiv:1408.5093, 2014. [7] Yao Li, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Mid-level deep pattern mining. In Computer Vision and Pattern An atlas for visual semantic As a byproduct of Webly-supervised learnRecognition (CVPR), 2015 IEEE Conference on, 2015. ing, our method annotates the Web images with semantic parts. By endowing an image dataset with such concepts, we show here that it is pos- [8] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. sible to browse these annotated images. All of this composes our visual In Proc. CVPR, 2009. semantic atlas (see a subset of the atlas in Figure 2) that allows to navigate [9] Karen Simonyan and Andrew Zisserman. Very deep convolutional from one image to another, even between an image of a full object and a networks for large-scale image recognition. arXiv:1409.1556, 2014. zoomed-in image of one of its parts.

Learning the structure of objects from Web supervision

sider for example the notion of object category, which is a basic unit of understanding in .... parts corresponding to the “bus” and “car” classes. Webly supervised ...

3MB Sizes 2 Downloads 261 Views

Recommend Documents

Weakly Supervised Learning of Object Segmentations from Web ...
tackle weakly supervised training of pixel-level object models solely from large ..... Grundmann, M., Kwatra, V., Essa, I.: Auto-directed video stabilization with ...

a model for generating learning objects from digital ...
In e-Learning and CSCL there is the necessity to develop technological tools that promote .... generating flexible, adaptable, open and personalized learning objects based on digital ... The languages for the structuring of data based on the Web. ...

a model for generating learning objects from digital ...
7.2.9 Tools for generating Learning Objects. ....................................................... ... 9 Schedule of activities . ..... ones: Collaborative Notebook (Edelson et. al. 1995) ...

Identification of Time-Varying Objects on the Web
sonal names account for 5 to 10% of all Web queries[11]. General-purpose search engines .... ing social networks among persons [2]. However, in previous.

Evolution of the Chilean Web Structure Composition
Barbara Poblete. Center for Web Research. Dept. of Computer Science .... We acknowledge the support of Millennium Nucleus. Grant P01-029-F from Mideplan, ...

Dynamics of the Chilean Web structure
Dec 9, 2005 - (other non .cl sites hosted in Chile are estimated to number ... but there is no path to go back to MAIN; and. (d) other ... with 94,348 having a DNS server. Hence, in ..... site appeared at the end of 1993 in our CS depart- ment.

Milestones of Space Eleven Iconic Objects from the Smithsonian ...
Milestones of Space Eleven Iconic Objects from the Smithsonian National Air and Space Museum.pdf. Milestones of Space Eleven Iconic Objects from the ...

Learning from a Web Tutor on Fostering Critical Thinking
Department of Psychology and Institute for Intelligent Systems, University of Memphis ...... Wiley, J., Goldman, S. R., Graesser, A. C., Sanchez, C. A., Ash, I. K., ...

Learning from a Web Tutor on Fostering Critical ... - Semantic Scholar
the tutors in their implementation of the program. Researchers .... practical limitations that present serious obstacles to collecting such data. The subject ..... social issues in each story. Experts are ...... Educational Data Mining 2009. 151-160.

Learning from a Web Tutor on Fostering Critical Thinking
revealed that a 0.4 effect size is routinely reported in educational studies for successful ... how scientific principles of learning can be implemented in a technology that not .... effort, initiative, and organization, all of which contribute to le

Learning from a Web Tutor on Fostering Critical Thinking
Art Graesser. Department ..... who tutored middle school students in mathematics or college students in research methods. The ...... Orlando, FL: Academic Press.

Learning from a Web Tutor on Fostering Critical Thinking
In our view, deep comprehension of topics ..... The first is a Hint button on the Google ..... International Journal of Human-Computer Studies, 65, 348-360. Meyer ...

Extracting knowledge from the World Wide Web - Proceedings of the ...
Apr 13, 2002 - vol. 101 suppl. 1 www.pnas.orgcgidoi10.1073pnas.0307528100 ... (e.g., sites ''coming soon''), web hosting firms that present their homepage on ..... improved, and how can communities be best structured or presented to ...

Factor Structure of Content Preparation for E-Business Web Sites
To enhance the quality of e-business web sites, a study of factor ..... The best way to determine what information customers want in e-business operation.

Exploring a European Market of Learning Objects ...
This work presents ELEONET, a portal for LO registration, DOI assignment and search across more than 12,500 LOs from about 30 companies and organizations; content pro- ducers span educational publishers, non-profit entities, media companies, etc. ELE

supervision of students
Apr 12, 2016 - 3.1.4 Use of technology (walky- talky, cell phones, etc.) .... higher skill level, including physical education teachers, coaches, Career Technology.

Structure-Perceptron Learning of a Hierarchical Log ...
the same dimension at all levels of the hierarchy although they denote different subparts, thus have different semantic meanings. The conditional distribution over all the states is given by a log-linear model: P(y|x; α) = 1. Z(x; α) exp{Φ(x, y) Â

Compensation Structure and Employer Learning
Nov 23, 2009 - Applying this strategy to job spells in the Panel Study of Income Dynamics I ..... In Figure 1 I show experience-specific measures of residual ...

Connected Component Labeling - Extract objects from image.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Connected ...

learning style and structure of human behavior
length of memory for more accurate modeling. Each pro- totypical pose is represented by a stylized decomposable triangulated graph (S-DTG), which features two .... be computed by applying Equation 5 on every triples of dimensions. For N-dimensional s

pdf-0946\the-principles-of-beautiful-web-design-from-oreilly ...
pdf-0946\the-principles-of-beautiful-web-design-from-oreilly-associates-inc.pdf. pdf-0946\the-principles-of-beautiful-web-design-from-oreilly-associates-inc.pdf.