Towards a Relevant and Diverse Search of Social Images

Viewer
Transcript

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

829

Towards a Relevant and Diverse Search of Social Images Meng Wang, Member, IEEE, Kuiyuan Yang, Xian-Sheng Hua, Member, IEEE, and Hong-Jiang Zhang, Fellow, IEEE

Abstract—Recent years have witnessed the great success of social media websites. Tag-based image search is an important approach to accessing the image content on these websites. However, the existing ranking methods for tag-based image search frequently return results that are irrelevant or not diverse. This paper proposes a diverse relevance ranking scheme that is able to take relevance and diversity into account by exploring the content of images and their associated tags. First, it estimates the relevance scores of images with respect to the query term based on both the visual information of images and the semantic information of associated tags. Then, we estimate the semantic similarities of social images based on their tags. Based on the relevance scores and the similarities, the ranking list is generated by a greedy ordering algorithm which optimizes average diverse precision, a novel measure that is extended from the conventional average precision. Comprehensive experiments and user studies demonstrate the effectiveness of the approach. We also apply the scheme for web image search reranking, and it is shown that the diversity of search results can be enhanced while maintaining a comparable level of relevance. Index Terms—Diversity, social image search, tag.

I. INTRODUCTION HERE is an explosion of community-contributed multimedia content available online, such as Youtube, Flickr, and Zooomr. Such media repositories promote users to collaboratively create, evaluate, and distribute media information. They also allow users to annotate their uploaded media data with descriptive keywords called tags, which can greatly facilitate the organization of the social media. However, performing search on large-scale social media data is not an easy task. Currently, Flickr provides two ranking options for tag-based image search. One is “most recent,” which orders images based on their uploading time, and the other is “most interesting,” which ranks the images by “interestingness,” a measure that integrates the information of click-through or comments, for example. In the following discussion, we name these two methods

T

Manuscript received October 10, 2009; revised February 25, 2010 and May 02, 2010; accepted June 14, 2010. Date of publication June 28, 2010; date of current version November 17, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Francesco G. B. De Natale. M. Wang is with the Media Computing Group, Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]). X.-S. Hua is with the Internet Media Group, Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]). K. Yang is with the Department of Automation, the University of Science and Technology of China, Anhui 230027, China (e-mail: [email protected]). H.-J. Zhang is with Microsoft Advanced Technology Center, Beijing 100080, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2010.2055045

time-based ranking and interestingness-based ranking, respectively. Since they both rank images according to measures (interestingness or time) that are not related to relevance, many irrelevant images will be introduced among the top search results. As an example, Fig. 1 illustrates the top results of query “waterfall” with the two ranking options, in which we can see that many images are irrelevant to the query, such as those marked with red boxes. In addition to relevance, lack of diversity is also a problem. Many images from social media websites are actually close to each other. For example, many users upload continuously captured images in batch, and such images are usually visually and semantically close. When these images appear simultaneously as top results, users will get only limited information. From Fig. 1, we can also observe this fact. The images marked with blue or green boxes (online version) are very close to at least one of the other images. Therefore, a ranking scheme that can simultaneously generate relevant and diverse results is highly desired. This problem is closely related to a key scientific challenge released recently by Yahoo research: “how do we combine both content-based retrieval with tags to do something better than either approach alone for multimedia retrieval.”1 The importance of relevance is clear. In fact, this is usually regarded as the bedrock of information retrieval: if an IR system’s response to each query is a ranking of documents in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be maximized [18]. The time-based and interestingness-based ranking options are of course useful. For example, users can easily browse the images that are recently uploaded via the time-based ranking. However, when users perform a search with the intention of finding specific images, relevance will be more important than time and interestingness. The necessity of diversity may seem less intuitive than relevance, but its importance has also been long acknowledged in information retrieval [5], [12]. One explanation is that the relevance of a document (web page, image, or video) with respect to the query should depend on not only the document itself but also its difference from the documents appearing before it. Now, we observe this issue from another perspective. In many cases, users cannot precisely and exhaustively describe their requests, and, thus, keeping diversity in the search results will provide users more chances to find the desired content quickly. For example, we can consider the following cases in image search. 1) The users only provide an ambiguous query. For example, the query “apple” may refer to different topics, such as fruit, computers, and mobile devices. Thus, it is better to provide diverse results to cover multiple topics. 1Yahoo Key Scientific Challenges Program. [Online]. Available: http:// research.yahoo.com/ksc/multimedia

1520-9210/$26.00 © 2010 IEEE

830

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

Fig. 1. Top 20 search results of the query “waterfall” with the two ranking options. (a) Interestingness-based ranking. (b) Time-based ranking. We can see that many images are irrelevant to the query (marked with red (online version) border) or close to others (marked with blue or green (online version) borders).

2) The users cannot fully describe their requests with simple words. For example, although a user only provides a simple query “car”, he/she may actually want to find a picture of a red car on grass. In this case, the hit probability of a diverse image set should be greater than a set of images that are quite close. Therefore, diversity of results is also important for users [12], [20]. This fact can be explained in the information theoretic point of view as well. If the returned images are all identical for a query, the information gained by the user is actually equivalent to only returning one image. To address the aforementioned relevance and diversity problems, we propose a diverse relevance ranking (DRR) scheme for social image search. It is able to rank the images based on their relevance levels with respect to query tag while simultaneously considering the diversity of the ranking list.2 The scheme works as follows. First, we estimate the relevance score of each image with respect to the query term as well as the semantic similarity of each image pair. The relevance estimation incorporates both the visual information of images and the semantic information of their associated tags into an optimization framework, and the semantic similarity is mined based on the associated tags of images. With the estimated relevance scores and similarities, we then implement the DRR algorithm, which can be viewed as a greedy ordering algorithm that optimizes average diverse precision (ADP), a novel measure that is extended from conventional 2When we rank a set of images, the images will all appear in the ranking list no matter how many duplicates exist. Here, keeping the diversity of a ranking list or search results actually means making top images diverse. This is reasonable since most users will only focus on the top results, and the average diverse precision (ADP) measure is also designed according to this principle (see Section IV).

average precision (AP). Different from AP that only considers relevance, ADP further takes diversity into account and, thus, the derived DRR algorithm can generate results that are both relevant and diverse. It is worth noting that, although the approach is proposed for social image search, it is actually a flexible scheme and can be easily extended to be applied in many other applications. As an example, in Section VI, we will apply it in web image search reranking and show that it is able to enhance the diversity of search results. The main contribution of this paper can be summarized as follows. 1) We propose a diverse relevance ranking scheme for social image search, which is complementary to the existing ranking approaches. 2) We propose a method to estimate the relevance scores of images with respect to a query tag. It leverages both the visual information of images and the semantic information of tags. 3) We extend the conventional AP measure to ADP to take diversity into account and then derive a greedy ordering algorithm accordingly that compromises relevance and diversity. The organization of the remainder of this paper is as follows. In Section II, we provide a short review on the related work. In Section III, we present DRR as a general ranking algorithm. In Section IV, we detail the relevance and semantic similarity estimation of social images. Empirical study is presented in Section V. In Section VI, we show the application of DRR in enhancing the diversity of web image search. Finally, we conclude the paper in Section VII.

WANG et al.: TOWARDS A RELEVANT AND DIVERSE SEARCH OF SOCIAL IMAGES

II. RELATED WORKS A. Social Image Search The last decade has witnessed a great advance in image search technology [6], [13], [19]. Different from general web images, social images are usually associated with a set of user-provided descriptors called tags, and, thus, tag-based search can be easily accomplished by using the descriptors as index terms. However, user-provided tags are usually very noisy [9], [14], and it usually results in unsatisfactory search results. In comparison with the extensive studies on how to help users better perform tagging or mining tags for other applications, the literature regarding tag-based image search is still very sparse. Most of these efforts focus on how to refine the image’s tags or measure their relevance levels. Wang et al. [25] proposed a simple method to construct a comprehensive set of models for tags with free data from a social media website. Li et al. proposed a tag-relevance learning method which is able to assign each tag a relevance score, and they have shown its application in tag-based image search [14]. Kennedy et al. [10] proposed a method to establish reliable tags by investigating highly similar images that are annotated by different photographers. Liu et al. [15] proposed an optimization scheme for tag refinement based on the visual and semantic connection between images. Sun and Bhowmick [22] proposed a method to measure the tag clarity score based on the query language model and the collection language model. These methods can help tag-based image search by improving the tags’ quality, but they cannot deal with the aforementioned lack-of-diversity problem. B. Diversity in Image Search It has been long acknowledged that diversity plays an important role in information retrieval. In 1964, Goffman recognized that the relevance of a document must be determined with respect to the documents appearing before it [5]. Carbonell et al. proposed a ranking method named maximal marginal relevance, which attempts to maximize relevance while minimizing similarity to higher ranked documents [2]. Zhai et al. proposed a subtopic search method, which aims to return the results that cover more subtopics [28], [29]. Many related works can be found in [28], [29] and the references therein. The diversity problem is actually more challenging in image search, as it involves not only the semantic ambiguity of queries but also the visual similarity of search results [12]. Currently, there are two typical approaches to enhancing the diversity in image search: search results clustering and duplicate removal. When performing search results clustering, a representative image can be selected from each cluster. Then, we can either only present these representatives to users or put other images behind them in the ranking list. In [1], Cai et al. proposed a method to cluster web image search results into different semantic clusters to facilitate users’ browsing. Jing et al. [8] proposed an IGroup system for clustering image search results. Song et al. studied the topic coverage of an image search diversification method [20]. Recently, Leuken et al. investigated different clustering methods for visual diversification of image search results [12]. Different from clustering, the duplicate-removal approach directly eliminates the duplicates

831

or near-duplicates detected in image search results. Many different duplicate detection methods have been proposed, such as pair-wise image comparison [7], approximate search [23], and fingerprint-based algorithms [21]. Recent progress of image duplicate detection can be found in [30] and [31]. Although encouraging results have been demonstrated, the clustering and duplicate removal techniques have their limitations due to the involved heuristics. For clustering, how to establish the number of clusters is a problem. If too many clusters are generated, then the diversity of search results cannot be guaranteed, and, contrarily, if the clusters are too few, then the search relevance may be degraded. In addition, how to take images’ relevance levels into the clustering process is also a problem. For duplicate removal, if we set a low threshold for near-duplicate detection, then the diversity of search results cannot be guaranteed, and, contrarily, if we set a high threshold for near-duplicate detection, many informative images will be removed. In this work, our diverse relevance ranking scheme adopts a different approach. We just order all images to optimize a performance metric that considers both relevance and diversity. In this way, we do not need to establish the number of clusters or the threshold for near-duplicate detection, and users will not miss any information as we do not remove images. III. DIVERSE RELEVANCE RANKING Here, we introduce the DRR approach. We present it as a general ranking algorithm and leave the two flexible components, i.e., relevance score and similarity estimation of images, to the next section.3 We first prove that ranking by relevance scores can be viewed as the process of optimizing the mathematical expectation of the conventional AP measure. Then we analyze the limitation of AP and generalize it to an ADP measure to integrate diversity. The DRR algorithm is then derived by greedily optimizing the mathematical expectation of ADP measurement. A. Average Precision AP is a widely applied performance-evaluation measure in information retrieval. Given a collection of images , we denote the binary relevance label of with respect to the given query as , i.e., , if is relevant; otherwise . Denote by an ordering of the images, and let be the image at the position of rank (lower number indicates image with higher rank). Let be the number of true relevant images in the set . Then, the noninterpolated AP is defined as (1) Intuitively, ranking images with their relevance scores in decreasing order is the most intuitive approach. Now, we prove that the ranking list generated in this way actually maximizes the mathematical expectation of AP measurement. the relevance score of (how to estimate Denote by it will be introduced in the next section), and it is reasonable 3These two components are actually replaceable. For example, they will be different in social image search and web image search reranking (see Section VI).

832

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

Fig. 2. Extreme example to illustrate the limitation of the conventional AP measure that only considers relevance. (a) All of the top 20 results are highly relevant to “car.” (b) Several images are irrelevant. Therefore, illustrating the images in (a) on top in the ranking list will most probably introduce higher AP measurement than (b), but clearly the images in (b) are more informative because (a) only illustrates duplicates.

for us to assume that , i.e., we regard as the probability that is relevant. the relevance score Since can be regarded as a constant, we do not take it into account in the expectation estimation. We also assume that the relevance of an image is independent of other images, and hence can be computed as follows: the expected value of

(3) Since

(2) Then, we have the following theorem. Theorem 1: Ranking the images in with relevance scores in nonincreasing order maximizes . the ranking of images in Proof: Denote by with relevance scores in nonincreasing order, i.e., . Then we only need to prove for every possible . Without loss of generality, we consider an ordering that has exchanged the documents at the positions of rank and in , i.e., and . Actually, it can be decomis not difficult to find that any change on the posed into a series of such adjacent exchanges. Thus, our task . is simplified to prove and . For simplicity, we denote , , and when and , Since we have

, we have , i.e., , which completes the proof. This demonstrates that the AP performance evaluation measure encourages prioritizing images with high relevance. However, the measure may be inconsistent with users’ experience due to the neglect of diversity. Fig. 2 illustrates an example to demonstrate this fact. In Fig. 2(a), all images are relevant, and several images in Fig. 2(b) are irrelevant. Therefore, most probably illustrating images in (a) on the top of the ranking list introduces higher AP measurement than those in (b), but clearly it provides little information for users because the images are just duplicates. Therefore, the conventional AP measure can be improved to be more consistent with user experience by integrating diversity. B. Average Diverse Precision Here we generalize the existing AP measure to ADP to take diversity into account, which is defined as

(4) where indicates the diversity score of . We deas its minimal difference with the images apfine pearing before it, i.e., (5)

WANG et al.: TOWARDS A RELEVANT AND DIVERSE SEARCH OF SOCIAL IMAGES

833

where is a similarity measure between two images. Comparing the definition of AP and ADP [see (1) and (4)], we can to see that the only difference is that we have changed . For an image in the ranking list, its contribution to the ADP measure is not only determined by its relevance with respect to the query but also its difference with the images appearing before it. If an image is identical to one of the previously appeared images, it will add no contribution to the ADP measurement. Thus, the ADP measure takes both relevance and the optimal ranking list diversity into account. Denote by under the ADP performance evaluation measure, i.e., the list that achieves the highest ADP measurement, we can prove that for any . This indicates that the top images will tend to be more relevant and diverse. Here, we omit its proof since it is analogous to Theorem 1. C. Diverse Relevance Ranking

Fig. 3. Pseudo-code of the proposed DRR algorithm.

The DRR algorithm is actually a greedy approach to optimizing the expected value of the ADP measurement. Analogous to AP, we can estimate the expected value of ADP as

(6)

is a permutation The direct optimization of problem and the solution space scales is . Thus, here we propose a greedy method to solve it. Considering the top documents have been established, based on (6), we can derive that the th image should be decided as follows: (7) where

tations will be used. Given a query tag , denote by the collection of images that are associated with the tag. For image , denote by the set of its associated tags. The relevance scores of all images in are represented in a vector , denotes the relevance score of image whose element with respect to query tag . Denote by a similarity maindicates the visual similarity between trix whose element images and . A. Relevance Estimation Our relevance estimation approach is accomplished by leveraging both the visual information of images and the semantic information of tags. Our first assumption is that the relevance of an image should depend on the “closeness” of its tags to the query tag. Thus, we first have to define the similarity of tags. Different from images that can be represented as sets of low-level features, tags are textual words, and their similarity exists only in semantics. Recently, there have been several works that aimed to address this issue [3], [27]. Here, we adopt an approach that is analogous to Google distance [3], in which the similarity between tag and is defined as

(8) (9) Fig. 3 illustrates implementation process of the DRR algorithm. Note that can be viewed as constant in (7). Thus, we can clearly see that the selection of the th image will be determined by two factors: the relevance of the image and its difference with the previously selected images.

(10) and are the numbers of images associated with where and on Flickr, respectively, is the number of images associated with both and simultaneously, and is the total number of images on Flickr. Therefore, the similarity of the query tag and the tag set of image can be computed as (11)

IV. RELEVANCE AND SIMILARITY OF SOCIAL IMAGES Here, we introduce the estimation of relevance scores and similarities of social images, which are the two necessary components of the DRR algorithm (see Fig. 3). The following no-

Fig. 4 illustrates two examples to demonstrate the rationale of this approach. Fig. 4(a) and (b) shows two images associated with query tags “dolphin” and “eagle,” respectively. Intu-

834

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

and we can derive that (16)

Fig. 4. (a) Two images tagged with “dolphin.” (b) Two images tagged with “eagle.” We can see that the images on the left are more relevant with respect to the query tags, and their associated tags are also closer to the query tags.

itively, we can see that the images on the left are much more relevant than those on the right. Then we can see that actually this fact can be inferred from the associated tag sets of the images. The tags of the images on the left are obviously closer to the query tags (for example, “dolphin” is strongly correlated with “ocean” and “water,” and this correlation can be reflected on their wordnet distance or Google distance [3]). Our second assumption is that the relevance scores of visually similar images should be close. The visual similarity between two images can be directly computed based on Gaussian kernel function with a radius parameter , i.e.,

This is the closed-form solution of our optimization framework. However, we can see that the above solution involves the inverse of an matrix, of which the computational cost . So, we propose a more efficient algorithm to scales as solve in an iterative way, with the steps given here. by (11) if Step 1) Construct the image affinity matrix and otherwise . . The initial values will not influence Step 2) Initialize the final results. Step 3) Iterate until convergence. The method can be viewed as a random walk process, and it will converge to a fixed point, i.e., Theorem 2, given as follows. Theorem 2: The iterative process converges to the optimal in (16). Proof: According to the iterative function (17)

(12) we have However, it is worth mentioning that we can also adopt other similarity measures here, such as those proposed in [24] and [17]. Note that this assumption may not hold for several images, but it is still reasonable in most cases. Based on the two assumptions, we formulate a regularization framework as follows:

(18)

Based on the fact that of matrix

and the eigenvalues are in (0,1), we have (19)

(13) is the relevance score of , and . where We can see that the above regularization scheme contains two terms. The first term is a smoothness term which means that the relevance scores between two visually similar images should and should be close if is great), be close (i.e., and the second term is a consistency term, which means that the relevance score should be consistent with the relevance of the should be large if is great. The tag set (i.e., above equation can be written in matrix form as

and

(20) Hence, (21)

(14) and

where . Taking the derivative of (14) with respect to , we obtain

(15)

This is the same with the closed-form solution in (16). B. Similarity Estimation We define a semantic similarity for social images, which is estimated based on their associated tag sets. Note that we have

WANG et al.: TOWARDS A RELEVANT AND DIVERSE SEARCH OF SOCIAL IMAGES

TABLE I AGGREGATION SCORE COMPARISON OF RELEVANT IMAGES WITH VISUAL SIMILARITY AND WITH SEMANTIC SIMILARITY. HIGHER AGGREGATION SCORE INDICATES THAT THE DISTRIBUTION OF RELEVANT SAMPLES IS TIGHTER IN THE SPACE AND THE DIVERSIFYING PROCESS IS MORE LIKELY TO DEGRADE SEARCH RELEVANCE

obtained the similarity of a tag pair in (10). Consequently, we as estimate the semantic similarity of images and

835

aggregation scores with visual similarity are much higher than those obtained with semantic similarity. This indicates that the distribution of relevant images is tighter in visual space than in semantic space (now we can revisit our second assumption in relevance estimation and see its rationale: the relevance scores of visually similar images should be close). Therefore, in our diverse relevance ranking approach it will be difficult to simultaneously maintain high relevance level and visual diversity. For example, in the process in Fig. 3, if previous images are relevant, then the next image will have high probability to be irrelevant as we enforce it to be visually different with the previous images. Empirical results in Section V will demonstrate this fact, and user study will show the superiority of semantic similarity over visual similarity. V. EMPIRICAL STUDY A. Flickr Dataset

(22)

We can see that the above definition satisfies the following properties. , i.e., the semantic similarity is 1) symmetry. if , i.e., the semantic similarity of 2) two images is 1 if their tag sets are identical. if and only if for every 3) and , i.e., the semantic similarity is 0 if and only if every pair formed by the two tag sets has zero similarity. This method is close to Song et al.’s approach [20], which estimates the similarity of images based on their annotated semantic concepts. However, their method simply counts the overlapped concepts of two images, and our approach further takes the relationship between different concepts into consideration. Now we explain why we do not use visual similarity which should be the most straightforward approach. First, we emphasize that visual diversity and semantic diversity have both been investigated in many research works [12], [20], [26], and they have their own rationale. However, in our scheme, search relevance will be remarkably degraded if we adopt visual similarity. This is because the distribution of relevant images is tighter in visual space than in semantic space. To quantitatively demonstrate this fact, we first define the aggregation score of relevant images for a query as follows: (23)

where and are the sets of relevant and irrelevant image, respectively. Then, we compare the aggregation scores using visual similarity and semantic similarity for several queries. The dataset and parameter settings will be described in Section V. Table I illustrates the results. From the table, we can see that the obtained

We evaluate our approach on a set of social images that are collected from Flickr. We first select a diverse set of popular tags from the tag list of [24]. They are: airshow, apple, beach, bird, car, cow, dolphin, eagle, flower, fruit, jaguar, jellyfish, lion, owl, panda, starfish, triumphal, turtle, watch, waterfall, wolf, chopper, fighter, flame, hairstyle, horse, motorcycle, rabbit, shark, snowman, sport, wildlife, aquarium, basin, bmw, chicken, decoration, forest, furniture, glacier, hockey, matrix, Olympics, palace, rainbow, rice, sailboat, seagull, spider, swimmer, telephone, and weapon. We then perform tag-based image search by regarding each of the 52 tags as query and we use the “ranking by most recent” option. The top 2000 returned images for each query are collected together with their associated information, including tags, uploading time, and user identifier. In this way, we obtain a social image collection consisting of 104 000 images and 83 999 unique tags. However, many of the raw tags are misspelling and meaningless. Hence, we adopt a prefiltering process on these tags. Specifically, we match each tag with the entries in a Wikipedia thesaurus, and only the tags that have a coordinate in Wikipedia are kept. In this way, 12 921 unique tags are kept for our experiment, and there are 7.74 tags associated with an image in average. Fig. 5 illustrates the frequencies of the tags in the dataset before and after filtering. For each image, we extract 428-D features, including 225-D block-wise color moment features generated from 5 5 fixed partition of the image, 128-D wavelet texture features, and 75-D edge distribution histogram features. The ground truth of the relevance of each image is voted by three human labelers.4 The radius parameter in (12) is empirically set to the median value of all of the pair-wise Euclidean distances between images, and the weighting parameter is empirically set to 0.1 for all queries. 4There are 0.59% of images that have labeling disagreements (for the web image dataset introduced in Section VI, 0.67% of the images have labeling disagreements). These disagreements are mainly caused by: 1) careless labeling mistakes and 2) several images that are relevant but not so typical, and thus they involve subjective disagreements. Thus, here we perform a voting with three labelers. In this way, most mislabeling caused by carelessness can be corrected, and for subjective disagreements it is also fair to adopt majority decision.

836

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

Fig. 5. Frequency distribution of tags before and after filtering.

B. Experimental Results We first compare the following six ranking methods. 1) Using time-based ranking, i.e., we order the images according to their uploading time. 2) Using relevance-based ranking, i.e., we order the images . according to their estimated relevance scores 3) We cluster the images with affinity propagation [4] and then generate the ranking list accordingly. Here, we choose affinity propagation because it can simultaneously perform clustering and select an exemplar (i.e., a representative sample) from each cluster. In addition, it can automatically determine the number of clusters. We put forward the cluster exemplars in the ranking list and they are ordered according to their relevance scores. The nonexemplars are then ordered based on their relevance scores behind the exemplars. 4) We first rank images according to their relevance scores and then perform the folding method proposed in [12]. It first sequentially selects the images that are most relevant and have at least a distance of to the selected samples. The remaining images are then ordered according to their relevance scores behind them. We set the parameter to the mean value of pairwise distances among all images. 5) We use DRR with visual similarity. 6) We use DRR with semantic similarity, i.e., the method proposed in this work. For simplicity, these methods are denoted as “time-based ranking”, “relevance-based ranking,” “AP-based diversifying,” “folding-based diversifying,” “DRR with visual similarity,” and “DRR with semantic similarity,” respectively. The first two methods are baseline. The AP-based and folding-based diversifying methods both diversify search results with clustering approach on a relevance-based ranking list, but the AP-based method favors diversity more and the folding-based method favors relevance more. Fig. 6 illustrates the top 20 results of an exemplary query “waterfall,” from which we can see that the results of our method are both relevant and diverse, whereas the results of the other methods are not satisfying in terms of either relevance or diversity. Figs. 7 and 8 illustrate the AP and ADP measurements obtained by different methods, respectively. We also illustrate the mean AP (MAP) and mean ADP (MADP)

measurements that are averaged over all queries. The MAP measurements of the six methods are 0.583, 0.684, 0.646, 0.621, 0.577, and 0.664 respectively, and their MADP measurements are 0.308, 0.361, 0.374, 0.334, 0.331, and 0.411 respectively. It can be found that relevance-based ranking achieves the highest AP measurement, but its ADP measurement is rather low. This indicates that it suffers from the lack-of-diversity problem. Although the AP-based diversifying, folding-based diversifying, and DRR with visual similarity methods can enhance the diversity, they degrade search relevance much in comparison with relevance-based ranking (we have analyzed why DRR with visual similarity will degrade search relevance in Section VI-B), and thus we can see that their ADP measurements are not high. The DRR with semantic similarity achieves the best ADP measurements, and it only performs slightly worse than relevance-based ranking in terms of AP. This shows that it is able to achieve a good tradeoff between relevance and diversity. We have conducted a user study to compare the six ranking schemes. To avoid bias, a third-party data management company was involved. The company invited 30 anonymous participants, who declared they were regular users of the Internet and familiar with image-search and media-sharing websites. They were asked to freely choose queries and observe image ranking lists. They compared DRR with semantic similarity, i.e., our proposed approach, with each of the other five ranking approaches in terms of search relevance and diversity.5 The users are asked to give the comparison results using “ ,” “ ,” and “ ,” which mean “better,” “much better,” and “comparable.” To quantify the results, we convert the results into ratings. We assign a score of 1 to the worst scheme, and the other schemes are assigned a score of 2, 3, or 1 if it is better than, much better than, or comparable to this one, respectively. Thus, for each 5It is worth noting that diversity is not directly related to a user’s search requirements. Therefore, actually, our user test process is as follows. In the comparison of two ranking schemes, the ranking lists were shuffled and blended to keep a fair comparison. The users compared them by considering both search relevance and diversity (they only saw the images without the associated tags). They were asked to take search relevance and comprehensiveness into account. For search comprehensiveness, they were asked to imagine different search intentions when they posed these queries for themselves, and then it is better if the top results in the list cover more possibilities.

WANG et al.: TOWARDS A RELEVANT AND DIVERSE SEARCH OF SOCIAL IMAGES

837

variance (ANOVA) test [11] to statistically analyze the comparison. It partitions the observed rating scores into components corresponding to different explanatory factors, and it is able to test the significance levels of the rating differences with respect to the factors of ranking scheme and user. The number of total degrees of freedom is 59, and the numbers of degrees of freedom for ranking and user factors are 1 and 29, respectively. The five comparison results are illustrated in Tables II-VI, respectively. The results demonstrate the superiority of our approach over the other methods. The ANOVA test shows that the superiority is statistically significant and the difference of the evaluators is not significant. This further confirms the effectiveness of our approach. In the user study we also found several failure cases of our approach, such as the top images are irrelevant or not diverse enough. One major reason is the noises of tags which result in inaccurate relevance and semantic similarity estimation. Performing a tag refinement step [10], [14], [15] to reduce the noisy tags should further benefit our approach. C. Computational Cost According to the introduction in Sections III and IV, we can see that the computational costs of relevance estimation, semantic similarity computation, and the DRR algorithm all . However, the relevance and similarity estiscale as mation can be accomplished offline (an image-tag relevance matrix and a sparse image similarity matrix can be stored). We also do not need to generate the full ranking list using DRR. Actually, we can only generate the list of the first images with the proposed algorithm, and then the remaining images are simply ranked by relevance scores. In our experiments, it takes approximately 1.2 s to accomplish the ranking with 2000 images (Pentium4 3.0 G CPU and 2 G memory), and a study on web users shows that the tolerable waiting time for web information retrieval is about 2 s [16]. The process can still be speeded up by adopting several strategies. For example, we can rank the images in a piecewise manner, such as first ranking the most relevant 500 images with DRR and putting them on the top, and then ranking the next most relevant 500 images with DRR, and so forth. VI. APPLICATION IN WEB IMAGE SEARCH DIVERSIFICATION Although DRR is proposed for social image search, it is actually a flexible scheme and can be easily extended to be applied in other applications. Here, we demonstrate its application in web image search reranking to tackle the lack-of-diversity problem. A. Web Image Search Reranking With DRR

Fig. 6. Top results of different ranking methods of query “waterfall.” (a) Time-based ranking. (b) Relevance-based ranking. (c) AP-based diversifying. (d) Folding-based diversifying. (e) DDR with visual similarity. (f) DDR with semantic similarity.

comparison, there are 30 ratings. Since there will be disagreements among the evaluators, we perform a two-way analysis of

Nowadays, most of the popular web image search websites adopt the text-based search modes. The relevance of each web image is estimated by analyzing its related textual information, such as the image’s file name, ALT texts, captions, and surrounding texts on the web pages, and then images are ranked accordingly. However, near-duplicates widely exist on the Internet (for example, many images are proliferated and shared by multiple websites), and the lack-of-diversity problem is frequently encountered in web image search.

838

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

Fig. 7. Comparison of AP measurements of different approaches.

Fig. 8. Comparison of ADP measurements of different approaches. TABLE II THE LEFT SIDE OF EACH COLUMN ILLUSTRATES THE MEAN AND STANDARD DEVIATION VALUES OF THE RATING SCORES CONVERTED FROM THE USER STUDY ON THE COMPARISON OF DRR WITH SEMANTIC SIMILARITY AND TIME-BASED RANKING. THE RIGHT SIDE OF EACH COLUMN ILLUSTRATES THE ANOVA TEST RESULTS. THE p-VALUES SHOW THAT THE DIFFERENCE OF THE TWO RANKING SCHEMES IS SIGNIFICANT, AND THE DIFFERENCE OF USERS IS INSIGNIFICANT

TABLE III THE LEFT SIDE OF EACH COLUMN ILLUSTRATES THE MEAN AND STANDARD DEVIATION VALUES OF THE RATING SCORES CONVERTED FROM THE USER STUDY ON THE COMPARISON OF DRR WITH SEMANTIC SIMILARITY AND RELEVANCE-BASED RANKING. THE RIGHT SIDE OF EACH COLUMN ILLUSTRATES THE ANOVA TEST RESULTS. THE p-VALUES SHOW THAT THE DIFFERENCE OF THE TWO RANKING SCHEMES IS SIGNIFICANT, AND THE DIFFERENCE OF USERS IS INSIGNIFICANT

WANG et al.: TOWARDS A RELEVANT AND DIVERSE SEARCH OF SOCIAL IMAGES

839

TABLE IV THE LEFT SIDE OF EACH COLUMN ILLUSTRATES THE MEAN AND STANDARD DEVIATION VALUES OF THE RATING SCORES CONVERTED FROM THE USER STUDY ON THE COMPARISON OF DRR WITH SEMANTIC SIMILARITY AND AP-BASED DIVERSIFYING. THE RIGHT SIDE ILLUSTRATES THE ANOVA TEST RESULTS. THE p-VALUES SHOW THAT THE DIFFERENCE OF THE TWO RANKING SCHEMES IS SIGNIFICANT, AND THE DIFFERENCE OF USERS IS INSIGNIFICANT

TABLE V THE LEFT SIDE OF EACH COLUMN ILLUSTRATES THE MEAN AND STANDARD DEVIATION VALUES OF THE RATING SCORES CONVERTED FROM THE USER STUDY ON THE COMPARISON OF DRR WITH SEMANTIC SIMILARITY AND FOLDING-BASED DIVERSIFYING. THE RIGHT SIDE ILLUSTRATES THE ANOVA TEST RESULTS. THE p-VALUES SHOW THAT THE DIFFERENCE OF THE TWO RANKING SCHEMES IS SIGNIFICANT, AND THE DIFFERENCE OF USERS IS INSIGNIFICANT

TABLE VI THE LEFT SIDE OF EACH COLUMN ILLUSTRATES THE MEAN AND STANDARD DEVIATION VALUES OF THE RATING SCORES CONVERTED FROM THE USER STUDY ON THE COMPARISON OF DRR WITH SEMANTIC SIMILARITY AND WITH DRR WITH VISUAL SIMILARITY. THE RIGHT SIDE ILLUSTRATES THE ANOVA TEST RESULTS. THE p-VALUES SHOW THAT THE DIFFERENCE OF THE TWO RANKING SCHEMES IS SIGNIFICANT, AND THE DIFFERENCE OF USERS IS INSIGNIFICANT

Fig. 9. Top results of query “US Flag” before and after reranking. (a) Original results. (b) After reranking.

Here, we propose a reranking approach based on the previously introduced DRR algorithm to enhance the diversity of web image search results. Actually, we only need to replace the relevance and similarity estimation components in comparison with its implementation for social images. Different from social images, the relevance scores of web images are measured based on their initial ranking order. Specifically, given an initial image ranking list with length , we estias , where is a mate the relevance of the th image

monotonic decreasing function. In this work, we use a sigmoid function, i.e., (24) where . According to the equation, we can see that and . This means that the top images will be assigned high relevance scores and the score degrades as increases (we have utilized the order of the previous search re-

840

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

Fig. 10. Top results of query “moon” before and after reranking. (a) Original results. (b) After reranking.

Fig. 11. AP and ADP comparison of the results before and after reranking.

sults, and thus the DRR is used as a reranking approach here). For image similarity estimation, we directly adopt visual similarity since there is no tag information for web images, i.e., . It is worth mentioning that this approach is, of course, not optimal. In fact, we can also estimate web images’ relevance scores and similarities based on the information of surrounding text. However, this method is robust and flexible, as it is independent of the specific web image search technology. Experi-

mental results will show that this simple approach can already significantly improve search performance. B. Experimental Results We conduct an empirical study with a set of web images collected from a popular commercial image search engine. We first select 30 popular queries from a query log of the commercial image search engine and then collect the top 1000 images for each query. The queries are: US Flag, apple, basketball,

WANG et al.: TOWARDS A RELEVANT AND DIVERSE SEARCH OF SOCIAL IMAGES

841

TABLE VII THE LEFT SIDE OF EACH COLUMN ILLUSTRATES THE MEAN AND STANDARD DEVIATION VALUES OF THE RATING SCORES CONVERTED FROM THE USER STUDY ON THE COMPARISON OF WEB IMAGE SEARCH WITH AND WITHOUT RERANKING. THE RIGHT SIDE ILLUSTRATES THE ANOVA TEST RESULTS. THE p-VALUES SHOW THAT THE DIFFERENCE OF THE TWO RANKING SCHEMES IS SIGNIFICANT, AND THE DIFFERENCE OF USERS IS INSIGNIFICANT

bear, bull, chocolate, heart, lion, moon, msn, peony, sunflower, sunset, tiger, cat, panda, earth, background, spider, fish, animal, turtle, beach, school, sports, cow, football, nokia, map, and cowboy. In this way, we obtain a collection of 30 000 images. Analogous to social images, the relevance ground truth of each image with respect to the corresponding query is established by the voting of three human labelers. We also extract the 428-D features from each image to represent it. The parameter in (24) is empirically set to 100 for all queries. We compare the results before and after reranking. Figs. 9 and 10 illustrate the top 20 results of two typical queries, and Fig. 11 illustrates the AP and ADP comparison. Previously we have analyzed that in diverse relevance ranking for social image search the relevance will be significantly degraded if we apply visual similarity, but we can see that in Fig. 11 that here AP measurements only slightly degrade after reranking. This is because the original web image search maintains high relevance levels (the mean AP is about 0.8), and thus most images are relevant. Therefore, the relevance degradation factor is alleviated. Meanwhile, the ADP measurements are remarkably improved after performing reranking. This demonstrates our reranking approach successfully diversifies the original search results while keeping relevance. We have also conducted user study with the same 30 participants and the process introduced in Section V. Table VII illustrates the mean and standard deviations of the rating scores as well as the ANOVA test results. From the numbers we can see that the results after reranking are better than the original results and the difference is statistically significant. VII. CONCLUSION This paper proposes a DRR scheme for social image search, which is able to simultaneously take relevance and diversity into account. It leverages both visual information of images and the semantic information of tags. The ranking algorithm optimizes an ADP measure, which is generalized from the conventional AP measure by integrating diversity. Experimental results have demonstrated the effectiveness of the approach. In addition, we have also shown the application of the DRR scheme in web image search diversification. REFERENCES [1] D. Cai, X. He, Z. Li, W. C. Ma, and J. C. Wen, “Hierarchical clustering of WWW image search results using visual, textual and link information,” in Proc. ACM Multimedia Conf., 2004, pp. 952–959. [2] J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” in Proc. SIGIR Conf., 1998, pp. 335–336.

[3] R. Cilibrasi and P. M. B. Vitanyi, “The Google similarity distance,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp. 370–383, Mar. 2007. [4] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, pp. 972–976, 2007. [5] W. Goffman, “A searching procedure for information retrieval,” Inf. Storage Retrieval, vol. 2, pp. 73–78, 1964. [6] W. H. Hsu, L. S. Kennedy, and S. C. Chang, “Video search reranking via information bottleneck principle,” in Proc. ACM Multimedia Conf., 2006, pp. 35–44. [7] A. Jaimes, S. C. Chang, and A. C. Loui, “Detection of non-identical duplicate consumer photographs,” in Proc. ACM Multimedia Conf., 2003, pp. 16–20. [8] F. Jing, C. Wang, Y. Yao, K. Deng, L. Zhang, and W. C. Ma, “IGroup: Web image search results clustering,” in Proc. ACM Multimedia Conf., 2006, pp. 587–596. [9] L. S. Kennedy, S. F. Chang, and I. V. Kozintsev, “To search or to label? Predicting the performance of search-based automatic image classifiers,” in Proc. MIR Conf., 2006, pp. 249–258. [10] L. Kennedy, M. Slaney, and K. Weinberger, “Reliable tags using image similarity: Mining specificity and expertise from large-scale multimedia databases,” in WSMC ’09: Proc. 1st Workshop Web-Scale Multimedia Corpus, New York, 2009, pp. 17–24. [11] B. M. King and E. W. Minium, Statistical Reasoning in Psychology and Education. New York: Wiley, 2003. [12] R. H. V. Leuken, L. Garcia, X. Olivares, and R. Zwol, “Visual diversification of image search results,” in Proc. WWW Conf., 2009, pp. 341–350. [13] J. Li and J. Wang, “Real-time computerized annotation of pictures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 6, pp. 985–1002, Jun. 2008. [14] X. R. Li, C. G. M. Snoek, and M. Worring, “Learning tag relevance by neighbor voting for social image retrieval,” in Proc. MIR Conf., 2008, pp. 180–187. [15] D. Liu, M. Wang, L. Yang, X.-S. Hua, and H. Zhang, “Tag quality improvement for social images,” in Proc. ICME Conf., 2009, pp. 350–353. [16] F. F. C. Nah, “A study on tolerable waiting time: How long are web users willing to wait,” Behavior Inf. Technol., vol. 23, no. 3, pp. 153–163, 2004. [17] G. Qi, X. Hua, Y. Rui, J. Tang, Z. Zha, and H. Zhang, “A joint appearance-spatial distance for kernel-based image categorization,” in Proc. CVPR Conf., 2008, pp. 1–8. [18] S. Robertson, “The probability ranking principle in IR,” J. Documentation, vol. 33, no. 294, pp. 294–304, 1977. [19] Y. Rui and T. S. Huang, “Relevance feedback: A power tool for interactive content-based image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 644–655, Sep. 1999. [20] K. Song, Y. Tian, T. Huang, and W. Gao, “Diversifying the image retrieval results,” in Proc. ACM Multimedia Conf., 2006, pp. 707–710. [21] S. H. Srinivasan and N. Sawant, “Finding near-duplicate images on the web using fingerprints,” in Proc. ACM Multimedia Conf., 2008, pp. 881–884. [22] A. Sun and S. S. Bhowmick, “Image tag clarity: In search of visualrepresentative tags for social images,” in WSM ’09: Proc. 1st SIGMM Workshop on Social Media, New York, 2009, pp. 19–26. [23] B. Wang, Z. Li, M. Li, and W. C. Ma, “Large-scale duplicate detection for web image search,” in Proc. ICME Conf., 2006, pp. 353–356. [24] M. Wang, X.-S. Hua, J. Tang, and R. Hong, “Beyond distance measurement: Constructing neighborhood similarity for video annotation,” IEEE Trans. Multimedia, vol. 11, no. 3, pp. 465–476, Mar. 2009. [25] M. Wang, K. Yang, X.-S. Hua, and H.-J. Hong, “Visual tag dictionary: Interpreting tags with visual words,” in Proc. ACM Workshop Web-Scale Multimedia Corpus, 2009. [26] K. Q. Weinberger, M. Slaney, and R. Van Zwol, “Resolving tag ambiguity,” in MM ’08: Proc. 16th ACM Int. Conf. Multimedia, New York, 2008, pp. 111–120.

842

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 8, DECEMBER 2010

[27] L. Wu, X. C. Hua, W. C. M. N. Yu, and S. Li, “Flickr distance,” in Proc. ACM Multimedia Conf., 2008, pp. 31–40. [28] C. Zhai, W. W. Cohen, and J. Lafferty, “Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval,” Inf. Process. Manag., pp. 10–17, 2006. [29] C. Zhai and J. Lafferty, “A risk minimization framework for information retrieval,” Inf. Process. Manag., pp. 31–55, 2006. [30] W. L. Zhao and C. W. Ngo, “Scale-rotation invariant pattern entropy for keypoint-based near-duplicate detection,” IEEE Trans. Image Process., vol. 18, no. 2, pp. 412–423, Feb. 2009. [31] J. Zhu, S. C. H. Hoi, M. R. Lyu, and S. Yan, “Near-duplicate keyframe retrieval by nonrigid image matching,” in Proc. ACM Multimedia Conf., 2008, pp. 41–50.

Meng Wang (M’09) received the B.S. and Ph.D. degrees from the University of Science and Technology of China (USTC), Hefei, China, in 1999 and 2008, respectively. From 2008 to 2010, he was with Microsoft Research Asia, Beijing, China, as a Research Staff Member. In 2010, he joined AKiiRA Media Systems, Palo Alto, CA, as a Research Scientist. He has authored or coauthored over 60 papers and book chapters. He served as an editorial board member, guest editor, or program committee member of numerous international journals and conferences. His current research interests include multimedia content analysis, management, search, mining, and large-scale computing. Dr. Wang is a member of the Association for Computing Machinery. He was the recipient of the Best Paper Award at the ACM International Conference on Multimedia 2009 and the Best Paper Award at the International Multimedia Modeling Conference 2010.

Kuiyuan Yang received the B.E. degree from the University of Science and Technology of China in Automation, Hefei, China, in 2007. He is now with the Media Computing Group, Microsoft Research Asia, Beijing, China, as a Research Intern. His research interests include computer vision, multimedia, and machine learning, especially content-based image/video retrieval, analysis, management, and sharing.

Xian-Sheng Hua (M’05) received the B.S. and Ph.D. degrees from Peking University, Beijing, China, in 1996 and 2001, respectively, both in applied mathematics. Since 2001, he has been with Microsoft Research Asia, Beijing, China, where he is currently a Lead Researcher with the Media Computing Group. He has authored or coauthored more than 180 publications in these areas and has more than 50 filed patents or pending applications. He is now an Adjunct Professor with the University of Science and Technology of China, Hefei, China, and serves as an associate editor of ACM Transactions on Intelligent Systems and Technology, an editorial board member of Advances in Multimedia and Multimedia Tools and Applications, and an editor of Scholarpedia (Multimedia Category). He also has successfully served or is serving as program chairs, workshop organizers, demonstration chairs, tutorial chairs, special session chairs, senior TPC members, and PC members of a large number of international conferences. He has invented and shipped more than six technologies into Microsoft mainstream products. His current research interests are in the areas of video content analysis, multimedia search, management, authoring, sharing, mining, advertising and mobile multimedia computing. Dr. Hua is a senior member of the Association for Computing Machinery, a member of the Visual Signal Processing and Communications Technical Committee (VSPC TC) and Multimedia Systems and Applications Technical Committee (MAS TC) of the IEEE Circuits and Systems Society. He is an associate editor of the IEEE TRANSACTIONS ON MULTIMEDIA. He was the recipient of the Best Paper Award and Best Demonstration Award at ACM Multimedia Conference in 2007, the Best Poster Award at the 2008 IEEE International Workshop on Multimedia Signal Processing, the Best Student Paper Award at the ACM Conference on Information and Knowledge Management in 2009, and the Best Paper Award at the International Conference on MultiMedia Modeling in 2010. He was also the recipient of the 2008 MIT Technology Review TR35 Young Innovator Award and named as one of the “Business Elites of People under 40 to Watch” by Global Entrepreneur.

Hong-Jiang Zhang (F’04) received the B.S. degree from Zhengzhou University, Henan, China, in 1982, and the Ph.D. degree from the Technical University of Denmark, Lyngby, Denmark, in 1991, both in electrical engineering. From 1992 to 1995, he was with the Institute of Systems Science, National University of Singapore, where he led several projects in video and image content analysis and retrieval and computer vision. From 1995 to 1999, he was a Research Manager with Hewlett-Packard Laboratories, Palo Alto, CA, where he was responsible for research and development in the areas of multimedia management and intelligent image processing. In 1999, he joined Microsoft Research Asia, Beijing, China, where he is currently the Managing Director of Advanced Technology Center. He has coauthored/coedited four books, over 350 papers and book chapters, numerous special issues of international journals on image and video processing, content-based media retrieval, and computer vision, as well as over 60 granted patents. Dr. Zhang is a Fellow of the Association for Computing Machinery. He currently serves as the Editor-In-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA and is on the editorial board of the PROCEEDINGS OF IEEE.

Social Image Search with Diverse Relevance Ranking - Springer Link

Text Detection from Natural Scene Images: Towards a ...

Towards Semantic Search

Towards Measuring and Mitigating Social Engineering Software ...

Towards Measuring and Mitigating Social ... - Roberto Perdisci

Towards Measuring and Mitigating Social Engineering Software ...

Searching for Cognitively Diverse Tests: Towards ...

MEDIA IMAGES AND THE SOCIAL CONSTRUCTION OF REALITY.