Interactive Visual Object Search through Mutual Information Maximization Jingjing Meng1, 2 , Junsong Yuan2 , Yuning Jiang2 , Nitya Narasimhan1 , Venu Vasudevan1 , Ying Wu3 1 Applied Research Center, Motorola, Schaumburg, IL, USA 2 School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore 3 Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USA

ABSTRACT Searching for small objects (e.g., logos) in images is a critical yet challenging problem. It becomes more difficult when target objects differ significantly from the query object due to changes in scale, viewpoint or style, not to mention partial occlusion or cluttered backgrounds. With the goal to retrieve and accurately locate the small object in the images, we formulate the object search as the problem of finding subimages with the largest mutual information toward the query object. Each image is characterized by a collection of local features. Instead of only using the query object for matching, we propose a discriminative matching using both positive and negative queries to obtain the mutual information score. The user can verify the retrieved subimages and improve the search results incrementally. Our experiments on a challenging logo database of 10,000 images highlight the effectiveness of this approach.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models

General Terms Algorithms, Experimentation, Theory

1.

INTRODUCTION

The development of invariant local features and the fast search algorithms allow us to search for small visual objects within large image databases [1] [2] [3] [4] [5]. Now, consider the cases where a user crops a visual object from one image, then searches for occurrences of that object within images in a large database. In such cases, we need to not only retrieve matching images, but also accurately locate the target object within each match. Despite previous work, such small object search remains a challenging problem mainly due to two reasons. First, we have the matching problem. The target object may differ

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’10, October 25–29, 2010, Firenze, Italy. Copyright 2010 ACM 978-1-60558-933-6/10/10 ...$10.00.

Figure 1: Illustration of our discriminative mutual information score and object localization. Left upper: query object. Left bottom: negative queries. Right: search and localization result. Each green point is a local feature of positive score, while each red point is a local feature of negative score. The green bounding box is the localization of the query object.

from the query object in scale, viewpoint, lighting or style, or be partially-occluded. Second, we have the localization problem. The presence of other objects or cluttered backgrounds can make it non-trivial to mark the target object efficiently and accurately. In addition, since a large number of candidate images need to be processed, the computational cost of localization is critical. We propose an efficient object search approach that addresses the two challenges mentioned above. Given a query object, we formulate visual search as a localization problem to find the subimage with maximum mutual information toward the query object. Characterizing each image as a collection of local feature points, we first measure the pointwise mutual information score employing the non-parametric nearest neighbor classifier. The relevance score of the subimage is then computed as the summation of the relevance scores of its local features. We refer to this as the discriminative mutual information score (Fig. 1) because it involves both positive and negative examples for a discriminative relevance evaluation. Finally, upon user verification of the returned results, we update the relevance scores and re-rank the subimages. This approach has the following advantages: (1) It does not rely on the visual vocabulary to index and match local features; thus there is no quantization loss during matching. The search results only depend on the matching of local features; thus this is more accurate and robust. (2) The proposed relevance score is compatible to the branch-and-bound search framework proposed in [1], where the complexity of the localization can be sub-linear in the number of images, as we do not have to exhaustively check every image and every subimage. However, because we do not adopt the vi-

sual word histogram to characterize subimages, we have a reduced memory cost when building the integral images. (3) It readily incorporates relevance feedback, where users can interactively clarify their preferences to further refine subsequent search results. The relevance score can be updated incrementally without requiring retraining of the classifier.

2.



ALGORITHM

Given an image database D = {It }, the goal of object search is to find the subimage I ⊂ I, where I ∈ D, so it has the maximum relevance score toward the query object. § 2.1 gives the definition of relevance score based on mutual information between the query and a subimage. § 2.2 describes how to derive the quality bound for the relevance score to enable efficient subimage retrieval via branch-and-bound. § 2.3 describes how to incrementally update relevance score with relevance feedback.

2.1 Discriminative Mutual Information Score We characterize each image I ∈ D as a collection of local invariant features [6]. Each feature point p is denoted by p = {x, y, dp }, where (x, y) is the location and dp ∈ RN is the feature descriptor. Given any subimage I = {pi }, we want to measure its relevance to the query object. We denote by Ω+ the positive class. Initially, Ω+ contains only one query object. After relevance feedback, Ω+ would expand to contain more examples that have been detected from the images. In general, we denote by Ω+ = {Qi } the positive training dataset. The query object is represented by the collection of the interest points from all the positive sample images: Ω+ = {dj }. In the feature space, this collection of feature points approximates the distribution of the query object class. For the negative training dataset Ω− , we randomly pick some background images to initialize it. It is later updated by adding false detections through relevance feedback. Ω− is also characterized as the collection of feature points from all of its training images. In contrast to previous search methods that use only one positive query example for matching, our matching scheme is more discriminative by using both Ω+ and Ω− . By incrementally updating Ω+ and Ω− online, our method can better adapt to the query example. Inspired by the previous work in [7], we measure the relevance score of a subimage I by the discriminative pointwise mutual information. Applying the Naive-Bayes assumption and assuming interest points are independent of each other, the relevance score of I is: P (I|Ω+ ) s(I) = M I(Ω+ , I) = log P (I)  +  P (dp |Ω+ ) p∈I P (dp |Ω ) = = log  log P (dp ) p∈I P (dp ) p∈I =

 p∈I

=

 p∈I

log

P (dp |Ω ) P (dp |Ω+ )P (Ω+ ) + P (dp |Ω− )P (Ω− ) 1 P (Ω+ ) +

P (dp |Ω− ) P (Ω− ) P (dp |Ω+ )

=



− 12 (d−d− 2 −d−d+ 2 ). NN NN 2σ

exp

s(p), (1)

p∈I

where s(p) is the relevance score of an individual point p. To evaluate the likelihood ratio for each p ∈ I, we follow the same strategy in [8] and apply the kernel density esti-

(2)

− Here d− and d+ NN is the nearest neighbors of d in Ω NN is +

|Ω | the nearest neighbor of d in Ω+ . We omit |Ω − | since only one nearest neighbor is considered in both classes. Such an approximation has been shown to be effective in image matching in [8]. To speed up the search of nearest neighbors, we find approximate -nearest neighbors (-NN) by locality sensitive hashing (LSH) [9]. The E2LSH software package is used here and the probability for correct retrieval is set to p = 0.9. We also set 2σ 2 = 2 in our experiments.

2.2 Branch-and-Bound Search Our objective of object search is to find a subimage I ∗ ⊂ I, where I ∈ D, such that it has the maximum relevance score toward the query object: I∗

= arg = arg

max M I(Ω+ , I)  s(p) = arg max

(3)

I⊆I,I∈D

I⊆I,I∈D

p∈I

max s(I),

I⊆I,I∈D

 where s(I) = p∈I s(p) is the objective function. Considering a single image I of size m × n, the total number of subimages of I is in the order of O(n2 m2 ). To avoid an exhaustive search, we also apply the branch-and-bound search as in ESR [1], which enables object search in sublinear time. Despite different objective function s(I), we will explain that branch-and-bound search is still feasible in our case. The details of the branch-and-bound search can be found in [1] and below we only derive the upper bound function. Let I be the collection of all subimages in a given image I. Assume there exist two subimages Imin and Imax such that for any I ∈ I, Imin ⊆ I ⊆ Imax .Then we have s(I) ≤ s+ (Imax )+s− (Imin ), where s+ (I) = p∈I max(s(p), 0) con tains only positive votes, while s− (I) = p∈I min(s(p), 0) contains only negative ones. We denote the upper bound of s(I) for all I ∈ I by: sˆ(I) = s+ (Imax ) + s− (Imin ) ≥ max s(I). I∈I

(4)

It is easy to see that if I is the only element in I, we have the following equality: sˆ(I) = s(I).

+

log

mation based on the training data Ω+ and Ω− . Applying the nearest neighbor approximation for the Gaussian kernel estimation, we have:  1 P (d|Ω− ) d ∈Ω− K(d − dj ) |Ω− |  j = 1 + P (d|Ω ) dj ∈Ω+ K(d − dj ) |Ω+ |

(5)

Eq. 4 and Eq. 5 thus meet the two requirements discussed in [10] and serve as the effective upper bound for branch-andbound search. It is worth mentioning that our new objective function s(I) has two advantages over the histogram-based functions used in ESR. First, local feature matching becomes more accurate, since we do not quantize the visual features. Second, memory cost is lower when building the integral images for branch-and-bound search, because our method only needs to store one number for each pixel, instead of a histogram per pixel as in ESR.

# ground truth ESR Results 1st round search 2nd round search

Precision Recall Precision Recall Precision Recall

CocaCola 32 0 0 0 0 NA 0

Dexia 494 0.429 0.024 0.810 0.032 0.100 0.020

Ferrari 76 0.020 0.026 0.010 0.013 0.750 0.039

Mercedes 76 0.25 0.092 0.917 0.145 1.000 0.184

Peugeot 6 0 0 0.010 0.167 0.024 0.333

President 14 0.090 0.643 0.050 0.357 0.826 1

Table 1: The precision and recall scores of our detection result. We update the number of ground truth if some of our detections are not included in the ground truth file.

2.3 Incremental Updates of Relevance Score via Relevance Feedback Given the training dataset Ω+ and Ω− , branch-and-bound search returns the subimages of highest relevance scores. After user verifies the returned results, we update the training datasets by adding correct detections {I+ } to Ω+ and false detections {I− } to Ω− . Since the training datasets are updated, the relevance score s(I) needs to be updated accordingly. As we apply the non-parametric nearest neighbor classifier, the relevance score s(I) only depends on the relevance score s(p) of individual point p. This enables us to update s(I) incrementally without re-training the classifier. Specifically, according to Eq. 2, the score s(p) relies on the two distances: (1) the distance from p to the positive point set Ω+ , d+ NN , and (2) the distance from p to the negative point set Ω− , d− NN . As a result, to update − s(p), we only need to update d+ NN and dNN . Assuming the + current training datasets are Ω1 and Ω− 1 , and the newly − verified subimages form Ω+ 2 and Ω2 . Now the new train+ − − − ing datasets become Ω+ = Ω+ 1 ∪ Ω2 , and Ω = Ω1 ∪ Ω2 . + − Let dNN1 and dNN1 denote the nearest neighbors of dp in − + Ω+ 1 and Ω1 respectively, we only need to calculate dNN2 and − + − dNN2 from Ω2 and Ω2 , respectively. Finally we can update + + − − − d+ NN = min{dNN1 , dNN2 } and dNN = min{dNN1 , dNN2 }.

3. EXPERIMENTS 3.1 Implementation We test our algorithm on the challenging BelgaLogos dataset [4], which consists of 10,000 images covering diverse categories of objects and events. As in [4], images in the dataset are resized so the maximum value of height and width is 800 pixels, preserving the original aspect ratio. We extract scale and affine invariant interest points from each image in the dataset using Harris-Affine detector [11], and characterize them using 128-D SIFT descriptors for point matching [6]. After feature extraction, the entire dataset results in a total of 24, 172, 440 SIFT points. As in [4], 6 external logos from Google first result page are used as queries to test our algorithm (see Table 1 and supplementary materials). Initially, Ω+ only contains the single query, while Ω− consists of two subimages, each containing a face(Fig. 1), which are cropped out from two images randomly selected from the dataset. Our implementation includes two stages. The first stage prunes irrelevant images before subimage search. So for each feature point in Ω+ , we find its -nearest neighbors in the BelgaLogos dataset using LSH. Then we filter images that contain less than < 3%×M matches to the points in Ω+ (M is the total number of feature points in Ω+ ). The second stage performs interactive subimage search on the remaining images. It includes the following steps, and a user can iterate the process until they are satisfied with the results:

1. For each feature point in the dataset, find their matches in current Ω+ and Ω− using LSH; 2. Calculate point-wise relevance score s(p) (Eq. 1); 3. Use branch-and-bound search (§ 2.2) to retrieve subimages, whose relevance score s(I) is greater than a given detection threshold τ . τ controls the number of detected subimages. Abnormal subimages with an aspect ratio greater than 10 are ignored. 4. User verifies the top n retrieved subimages, adds correct detections to Ω+ and false detections to Ω− . Since the number of false detections is usually much greater than the number of correct detections in a huge dataset, to balance the sizes of Ω+ and Ω− , we use K-means to cluster feature points from all false detections and add the cluster centers to Ω− . In our experiments, the LSH range query parameter  is set to 300 for the Presidential Seal and 230 for the other five queries. The detection threshold τ is set to 3.0 for all six queries. Both parameters stay the same in each round. We only iterate the search twice for each logo.

3.2 Results Evaluation To evaluate the performance in each round, we calculate the precision and recall scores of detected subimages. Precision is the percentage of correct detections in all detections, while recall is the percentage of correct detections against the total number of ground truth images. Since BelgaLogo dataset does not provide locations of a given logo in its ground truth images, we estimate the recall based on the total number of images that contain a query logo. If more than 100 subimages are returned, only the top 100 results are used as detection results for user verification. Otherwise, we verify all returned subimages. All detected subimages are manually verified to ensure that they indeed contain the query. Interestingly, we have also detected logo instances that are not included in the ground truth files [4] for Mercedes (Fig. 3) and Peugeot. In both cases, we update the total number of ground truth images and calculate the recall accordingly. If multiple logos are found in the same image, we only count them once for our recall calculation. Table 1 summarizes our results for the 6 logos. After one round of relevance feedback, both precision and recall scores have been improved significantly. To demonstrate the effectiveness of our method, we also compare it with ESR [1]. Using the same set of local features, we first build a codebook of 10, 000 words by a hierarchical clustering of 5% of the SIFT points. We then perform ESR using normalized histogram intersection (NHI) as the quality function, which is shown to be effective [1]. We apply the code provided in [1], but exhaustively search the bounding boxes with multiple scales, ranging from 0.25 to 8 times

Figure 2: The 2nd round search results of the presidential seal. The first image is the query and the rest are the top-14 detections, which are all correct. There are only 14 presidential seals in the dataset. Results are listed in descending order of relevance score, from top-left to bottom-right.

Figure 3: The 2nd round search results of the Mercedes logo. The first image is the query and the rest are the top-15 detections, which are all correct. There are in total 76 Mercedes logos in the dataset. The 6th and 15th detections come from the same image. The 2nd , 11th , and 12th are new detections missing from the ground truth [4]. of the original query size, with a step factor of 1.1. The precisions and recalls are also shown in Table 1. To make a fair comparison, for the precision score of each logo, we check the same number of top detections as those returned from our algorithm, while the recall is based on the total number of ground truth images. Experimental results on the 6 logos demonstrate that our algorithm performs slightly better than ESR even without user feedback in the first round, while the second round shows more significant improvements. The detection results (Fig. 2, Fig. 3) demonstrate that our method can effectively handle many challenging cases for object retrieval, like changes in scale and viewpoint, deformations, blurring, as well as severe partial occlusions (e.g., 1st image in 2nd row, Fig. 2). Our method can also detect multiple objects in a single image (Fig. 3) and handle multiple query images (§ 2.3).

4.

CONCLUSION

Retrieving small visual objects from a large collection of images is a challenging problem due to the possible appearance variations of objects, as well as the large computational cost to search the image dataset. Our method combines subimage search with relevance feedback to search and locate small objects. Using the branch-and-bound search, it ranks the subimages (instead of whole images) by the mutual information score toward the query object. Once the user verifies the relevant and irrelevant subimages, the mutual information score is updated incrementally and the search results can be further improved. Our experiments on a challenging logo dataset validate the effectiveness and efficiency of our method. Our future work includes how to further reduce the search cost when handling millions of images.

Acknowledgments This work is supported in part by the Nanyang Assistant Professorship (SUG M58040015) and National Science Foundation grant IIS-0347877 and IIS-0916607.

5. REFERENCES

[1] Christoph H. Lampert, “Detecting objects in large image collections and videos by efficient subimage retrieval,” in Proc. IEEE Intl. Conf. on Computer Vision, 2009. [2] S. Litayem, A. Joly, and N. Boujemaa, “Interactive object retrieval with efficient boosting,” ACM Multimedia, 2009. [3] Jim Kleban, Xing Xie, and Wei-Ying Ma, “Spatial pyramid mining for logo detection in natural scenes,” in Proc. IEEE Conf. on Multimedia Expo, 2008, pp. 1077–1080. [4] Alexis Joly and Olivier Buisson, “Logo retrieval with a contrario visual query expansion,” in Proc. ACM Multimedia, 2009. [5] Josef Sivic and Andrew Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Proc. IEEE Intl. Conf. on Computer Vision, 2003. [6] David Lowe, “Distinctive image features from scale-invariant keypoints,” Intl. Journal of Computer Vision, 2004. [7] Junsong Yuan, Zicheng Liu, and Ying Wu, “Discriminative subvolume search for efficient action detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2009. [8] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighbor based image classification,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2008. [9] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni, “Locality-sensitive hashing scheme based on p-stable distribution,” in Proc. of Twentieth Annual Symposium on Computational Geometry, 2004, pp. 253–262. [10] Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann, “Beyond sliding windows: Object localization by efficient subwindow search,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2008. [11] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005. [12] S. Zhang, Q. Huang, G. Hua, S. Jiang, W. Gao, and Q. Tian, “Building Contextual Visual Vocabulary for Large-scale Image Applications,” ACM Multimedia, 2010. (to appear)

Interactive Visual Object Search through Mutual Information ...

Oct 29, 2010 - Figure 1: Illustration of our discriminative mutual infor- mation score ... query object class. ... one nearest neighbor is considered in both classes.

1MB Sizes 6 Downloads 294 Views

Recommend Documents

Robust Direct Visual Odometry using Mutual Information
Differences based Lucas-Kanade tracking formulation. Further, we propose a novel approach that combines the robustness benefits of information-based measures and the speed of tra- ditional intensity based Lucas-Kanade tracking for robust state estima

Predicting Search User Examination with Visual Saliency Information
⋆School of Computer and Information, HeFei University of Technology, Hefei, China ..... peaked approximately 600 ms after page loading and decreased over time. ... experimental search engine system to control the ranking position ...

Predicting Search User Examination with Visual Saliency Information
⋆School of Computer and Information, HeFei University of Technology, Hefei, China ..... training to predict eye fixations on images. The results of these work indicated that ...... eye-tracking study of user interactions with query auto completion.

Mutual Information Statistics and Beamforming ... - CiteSeerX
Engineering, Aristotle University of Thessaloniki, 54 124, Thessaloniki,. Greece (e-mail: ...... Bretagne, France and obtained a diploma degree in electrical ...

Reconsidering Mutual Information Based Feature Selection: A ...
Abstract. Mutual information (MI) based approaches are a popu- lar feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variabl

Mutual Information Statistics and Beamforming ... - CiteSeerX
G. K. Karagiannidis is with the Department of Electrical and Computer. Engineering, Aristotle ... The most general approach so far has been reported in [16] ...... His research mainly focuses on transmission in multiple antenna systems and includes p

Experian Proprietary Information Agreement (Mutual).pdf ...
1.01.09 3 of 3 ProprietaryInformationAgreement(Mutual). Page 3 of 3. Experian Proprietary Information Agreement (Mutual).pdf. Experian Proprietary Information ...

Weighted Average Pointwise Mutual Information for ... - CiteSeerX
We strip all HTML tags and use only words and numbers as tokens, after converting to .... C.M., Frey, B.J., eds.: AI & Statistics 2003: Proceedings of the Ninth.

Interactive Activation and Mutual Constraint Satisfaction ...
ISSN: 0364-0213 print / 1551-6709 online ... cDepartment of Psychology, University of Maryland ..... 2b, shows the time course of activation, demonstrating.

Interactive Visual Reporting And PowerPivot.pdf
Interactive Visual Reporting And PowerPivot.pdf. Interactive Visual Reporting And PowerPivot.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying ...

Labor market and search through personal contacts
May 3, 2012 - Keywords: Labor market; unemployment; job search; social network. ..... link between workers in time period t is formed with probability. 10 ...

Characteristic sounds facilitate visual search
dog, door, [running] faucet, keys, kiss, lighter, mosquito, phone, piano, stapler, thunder, toilet, train, and wine glass; see the Appendix for the images), for which ...

Species independence of mutual information in coding and noncoding ...
5624. ©2000 The American Physical Society .... on Eq. 2, we are able to express, for each single DNA sequence, the maxima and .... Also the variances of log 10. I¯ are almost the same ... This finding leads us to the conclusion that there must.

k-ANMI: A mutual information based clustering ...
Available online 10 July 2006. Abstract ... the class label of each data object to improve the value of object function. That is ... Available online at www.sciencedirect.com ...... high-dimensional data mining, Ph.D. thesis, The University of Texas.

On Hypercontractivity and the Mutual Information ...
sd + (1 − s)(1 − c). If c and d are in (0, 1) the first derivative is −∞ at p(W =1)=0 and it is +∞ at p(W =1)=1. When c or d is in {0, 1} we can use continuity to ...

Information Object Design Pattern
nologies and object-oriented software engineering closer together. ... The information object carries meaning on the cognitive levels of the agent. Agent refers to ...

G-ANMI: A mutual information based genetic clustering ...
better clustering results than the algorithms in [4,2]. Meanwhile,. G-ANMI has the .... http://www.cs.umb.edu/~dana/GAClust/index.html. 2 We use a data set that is ...

Mutual Information Based Extrinsic Similarity for ...
studies. The use of extrinsic measures and their advantages have been previously stud- ied for various data mining problems [5,6]. Das et al. [5] proposed using extrin- sic measures on market basket data in order to derive similarity between two prod

Mutual Information Phone Clustering for Decision Tree ...
State-of-the-art speech recognition technology uses phone level HMMs to model the ..... ing in-house linguistic knowledge, or from linguistic liter- ature on the ...

Maximum Mutual Information Multi-Phone Units in Direct Modeling
of state transitions given acoustic features, in an HMM that still encodes conventional sequencing constraints from the lexicon and decision tree. In the CRF approach, the total path probabil- ity is factored as the product of state-state transition

Information Object Design Pattern
tion specific and too much biased towards the purpose of the application. ... realized when a web designer creates the web site in HTML, i.e. once an actual in-.

Quantum Mutual Information Capacity for High ...
Apr 5, 2012 - counting module. The DMD chip allowed us to raster .... We acknowledge discussions with B.I. Erkmen and support from DARPA DSO InPho ...