Learning the structure of objects from Web supervision

Viewer
Transcript

Learning the structure of objects from Web supervision David Novotny1

1

Visual Geometry Group University of Oxford

2

Computer Vision Group Xerox Research Centre Europe

[email protected]

Diane Larlus2 [email protected]

Andrea Vedaldi1 http://www.robots.ox.ac.uk/~vedaldi

1

Introduction

While recent research in image understanding has often focused on recognizing more types of objects, understanding more about the objects is just as important. Advances in tasks such as image captioning, activity recognition, and many others have ventured far beyond the standard image classification or object detection problems in order to extract richer information from visual scenes. Even so, image understanding remains rather crude, oblivious to most of the nuances of real world images. Consider for example the notion of object category, which is a basic unit of understanding in computer vision. Modern benchmarks such as ILSVRC consider an increasingly large number of such categories. However, there is only limited understanding of their internal geometric structure and semantics. We aim at filling this gap by jointly learning about objects, their semantic parts, and their geometric relationship. Learning about semantic nameable parts plays a crucial role in visual understanding. However, standard large scale approaches to this task face the difficulty of collecting vast quantities of corresponding annotated example images. Instead, scalable algorithms must be designed to discover this information, with minimal or no supervision. As others have done, we look at Web supervision to learn the structure of objects from thousands of images obtained automatically by querying search engines. However, this poses two significant challenges: identifying images of the semantic parts in very noisy Web results while, at the same time, discovering their geometric relationships in the presence of drastic scale changes between parts seen in the context of the whole object or in isolation. To this end, we propose a novel embedding that encodes the geometry of an object and of its parts by expressing them relatively to a reference frame which is robust to the large scale variations. The reference frame is formed by non-semantic anchor parts which are learned automatically using a new method for non-semantic part discovery.

Input Noisy Web results for "car wheel"

Learned concept "Car wheel" as an object

Car images

"Car wheel" as a component

Figure 1: Our goal is to learn the semantic structure of objects automatically using Web supervision. For example, given noisy images obtained by querying an Internet search engine for “car wheel” and for “cars”, we aim at learning the “car wheel” concept, and its dual nature: as an object in its own right, and as a component of another object.

gions R pk ,x containing each anchor pk . We define the following geometric embedding φ g of a region R with respect to the anchors: 

 IoU(R, R p1 ,x )   .. φ g (x|R) =   .

(1)

IoU(R, R pK ,x ) where IoU is an intersection-over-union box overlap measure. Since IoU is invariant to scaling, rotation, and translation of the regions, so is the embedding φ g . Hence, as long as anchors stay attached to fixed locations on the surface of the common object class, φ g (x|R) encodes the location of R relative to an object-centric frame. To further enrich the embedding, the geometric encoding φ g (x|R) is combined with the appearance descriptor φ a (x|R) in a joint appearancegeometric embedding φ ag : φ ag (x|R) = φ a (x|R) ⊗ φ g (x|R)

(2)

where ⊗ is the Kronecker product. After vectorization, this vector is used as a descriptor φ (x|R) = φ ag (x|R) of region R. An important property of this novel embedding arises once φ ag is 2 The MIL framework plugged into the MIL objective function. The result is a scoring function which interpolates between appearance models based on how the region We formulate the problem of part detection in web images as an instance R is geometrically related to the individual anchors. In particular, by of the Multiple Instance Learning problem (MIL). More specifically, we selecting different anchors this model may capture simultaneously the apfirst query for the Web images corresponding to the name of a given sepearance of all parts of an object. mantic part class (e.g. “eye”, “car wheel”, etc.). Then, for each part class separatelly, we learn a weakly supervised object/part detector which Properties of the geometric embedding. To provide better understanddistinguishes the positive part images from a set of negative images that ing of the novel appearance-geometry embedding we also proove that the come from a common background clutter samples. Following the MIL IoU measure is a special case of a family of PD kernels. Such property algorithm, we learn a linear scoring function hφ (xi |R), wi, where w is a theoretically motivates our approach since the geometric embedding can vector of parameters and φ (xi |R) ∈ Rd is a descriptor of the region R of then be seen as feature map induced by the IoU kernel and, consequently, image xi . MIL should automatically discover regions that are most predic- the appearance-geometry embedding corresponds to the product between tive of a given label, and which therefore should correspond to the sought the (linear) appearance kernel and the IoU kernel. visual entity (object or semantic part). However, this process may fail if descriptors are not sufficiently strong.

3

A novel geometric embedding

The aforementioned MIL formulation does not leverage the fact that objects have a well-defined geometric structure, which significantly constrains the search space for parts. Here, we propose an alternative method that captures geometry indirectly, on top of a rich set of unsupervised mid-level non-semantic parts {p1 , ..., pK }, which we call anchors. Let us assume that, given an image x, we can locate the (selective search) re-

4

Anchors: weakly-supervised non-semantic parts

The geometric embedding φ g leverages the power of an object specific intermediate representation: a collection of anchors {pk }K k=1 , learned automatically using weak supervision. To learn these anchors, we first gather all the web images corresponding to a given object class and its parts into a single set of positive samples. Then, we optimize a novel objective function, to learn a collection of diverse linear anchor appearance models ω 1 , . . . , ω K that discriminate between the negative background clutter images and the object-specific positive images.

Supervision

Method Cho et al. [4] Bilen & Vedaldi [2] Baseline MIL Ours RCNN Ours

Web

Full

{face} 16.6 2.7 20.6 44.9 53.7 61.4

mAP {car} 16.9 12.0 29.1 34.4 51.2 60.3

{bus} 12.4 4.7 22.7 23.0 48.2 54.1

Table 1: Semantic part detection results averaged for the face, car, and bus parent classes. CNN features Decaf [6] VGG-VD [9]

Baseline 57.7 68.9

Method BoE [7] 69.7 77.6

Ours 71.5 77.8

Table 2: Classification results on MIT Scenes [8].

5

Geometry aware MIL

The combination of the geometric appearance embedding with the MIL framework defines our geometry aware MIL: Once the anchor appearance models are obtained, the geometric embedding of an an arbitrary region R is obtained by first localizing anchors in the corresponding image and then forming the geometric descriptor φ g by computing the overlaps between the anchor positions and R (eq. (1)). This embedding is then combined with the appearance descriptor φ a to obtain the appearancegeometric embedding φ ag (eq. (2)). φ ag is then used as an improved descriptor for learning semantic parts with the MIL framework.

6

Figure 2: Navigating the visual semantic atlas. Each pair of solid bounding boxes connected by an arrow denotes a preselected part box (the starting point of an arrow) as detected by our algorithm and the most similar semantic match (the endpoint of the arrow). The best matching bounding box is the detection with highest appearance-geometry descriptor similarity among all the detections in our database of web images. The dashed boxes denote anchors that contributed the most to the similarity.

Experiments 7

Conclusions

The proposed methods were evaluated on the Labeled Face Parts in the Wild (LFPW) [1] which contains annotations for the “face” class parts We have proposed a novel method for learning about objects and their and on the PascalParts dataset [3] where we evaluated performance on semantic parts from noisy Web supervision. This is achieved through a parts corresponding to the “bus” and “car” classes. novel appearance-geometry embedding which relies on a set of mid-level Webly supervised localization of objects and parts. We evaluate the de- visual elements which define a robust object-centric coordinate frame. We tection performance of our approach against strong baselines [2], [4] and showed improved performance over three strong detection baselines on all a baseline MIL detector. We also report a fully supervised RCNN detector benchmarked datasets. Further experiments show comparable results to [5] and the results of our method in the fully supervised scenario. Table 1 state-of-the-art on the mid-level element discovery and superior results on reports the average AP over parts of a given object class for all compared the semantic matching tasks. Finally, our method also provides a visually methods. Our method ourperforms all webly supervised baselines. More- intuitive way to navigate Web images and predicted annotations. over, a substantial improvement over the fully supervised RCNN baseline is also achieved indicating that our representation may be applicable well References beyond weakly supervised learning. Leveraging a single annotation. An issue with weakly supervised part [1] Peter N Belhumeur, David W Jacobs, David J Kriegman, and Narendra Kumar. Localizing parts of faces using a consensus of exemplars. learning is the inherent ambiguity in the part extent, that may differ from PAMI, 2013. dataset to dataset. We address the ambiguity by considering a scenario where a single strong annotation per semantic part is available. Results [2] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In Proc. CVPR, 2015. indicate that our method outperforms the nearest competitor by 13.6, 3.1 and 3.2 mAP points for “face”, “car” and “bus” parts respectively. [3] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and repreDiscriminative power of anchors. Since most of the existing methods senting objects using holistic models and body parts. In Proc. CVPR, for learning mid-level patches are evaluated in terms of discriminative 2014. content in a classification setting, we adopt the same protocol here. In particular, we evaluate the anchors as mid-level patches on the MIT Scene 67 [4] Minsu Cho, Suha Kwak, Cordelia Schmid, and Jean Ponce. Unsuindoor scene classification task [8]. The results in table 2 indicate that our pervised Object Discovery and Localization in the Wild: Part-based mid-level anchor mining algorithm outperforms both baseline and comMatching with Bottom-up Region Proposals. In Proc. CVPR, 2015. petitors when both Decaf and VGG-VD features are used for classifica- [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. tion. Rich feature hierarchies for accurate object detection and semantic Ability of anchors to establish semantic matches. The last set of exsegmentation. In Proc. CVPR, 2014. periments assessed the ability of our mid-level anchors to induce good [6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, semantic matches between pairs of images. Results show that our method Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Daroutperforms strong baselines also on this task which validates our inturell. Caffe: Convolutional architecture for fast feature embedding. ition that the local geometry of an object is well-captured by the anchors. arXiv preprint arXiv:1408.5093, 2014. [7] Yao Li, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Mid-level deep pattern mining. In Computer Vision and Pattern An atlas for visual semantic As a byproduct of Webly-supervised learnRecognition (CVPR), 2015 IEEE Conference on, 2015. ing, our method annotates the Web images with semantic parts. By endowing an image dataset with such concepts, we show here that it is pos- [8] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. sible to browse these annotated images. All of this composes our visual In Proc. CVPR, 2009. semantic atlas (see a subset of the atlas in Figure 2) that allows to navigate [9] Karen Simonyan and Andrew Zisserman. Very deep convolutional from one image to another, even between an image of a full object and a networks for large-scale image recognition. arXiv:1409.1556, 2014. zoomed-in image of one of its parts.

Learning the structure of objects from Web supervision

sider for example the notion of object category, which is a basic unit of understanding in .... parts corresponding to the âbusâ and âcarâ classes. Webly supervised ...

Download PDF

3MB Sizes 2 Downloads 306 Views

Report

Learning the structure of objects from Web supervision

Recommend Documents