Lei Zhang at Microsoft Research

Viewer
Transcript

Similarity Space Projection for Web Image Search and Annotation Ying Liu1,2*, Tao Qin1,3*, Tie-Yan Liu1, Lei Zhang1, Wei-Ying Ma1 1

Microsoft Research Asia, 5F, Sigma Center, No. 49, Zhichun Road, Beijing, 100080, P. R. China 2 Gippsland School of Computing and Info. Tech, Monash University Australia, 3842 3 Dept. Electronic Engineering, Tsinghua University, Beijing, 100084, P. R. China Web. Most of the images in the Web, however, are not categorized properly according to their contents. Hence, the Web can be viewed as a large, unstructured image database [5][17] and Web image search has been actively explored and developed in academic as well as commercial areas [10]. There are commercial search engines available such as Google Image Search [23], Yahoo Image Search [25], AltaVista Image [21], specialized Web image search engines such as Ditto [22], PicSearch [26]. There are also systems developed by academic researchers including WebSeek[14], WebSeer[4], Image Rover[12], iFind[11]. In addition, many algorithms have been developed intending to improve Web image search performance [3][5][7][17][10].

ABSTRACT Web image search has been explored and developed in academic as well as commercial areas for over a decade. To measure the similarity between Web images and user queries, most of the existing Web image search systems try to convert an image to textual keywords by analyzing the textual information available (such as surrounding text and image filename) with or without leveraging image visual features (such as color, texture, shape). In this way, the existing systems transform “Web images” to the “query (text)” space so as to compare the relevance of images to the query. In this paper, we present a novel solution to Web image search similarity space projection (SSP). This algorithm takes images and queries as two heterogeneous object peers, and projects them into a third Euclidean “similarity space” in which their similarity can be directly measured. The rule of projection guarantees that in the new space the relevant images are kept close to the corresponding query and those irrelevant ones are away from it. Experiments on realworld Web image collections showed that the proposed algorithm significantly outperformed traditional information retrieval models (such as vector space model) in the application of image search. Besides Web image search, we demonstrate that this algorithm can also be applied to image annotation scenario, and has promising performance. Thus, this algorithm unifies Web image search and image annotation into same framework.

Web image search systems can be categorized in terms of the user queries that they support. There are mainly two types of image queries: keywords-based and content-based. Though efforts have been made to provide content-based searching using image contents such as color feature [4][14], text-based search is still the prevailing choice as it is more convenient and straightforward for users. Hence, we can say that in most systems, the ‘user query’ space is in fact ‘textural keywords’ space. Web image search systems can also be classified into different categories in terms of how images are represented. As Web images usually come with HTML source code including textural descriptions, many Web image search systems are text-based and the presentation of images includes filename, caption, surrounding text, etc [5][16][23][25]. These systems measure the similarity between an image and a user query by estimating the probability that the corresponding textual information of the image is relevant to the query. Considering that the textual information associated with Web images might be noisy and incomplete, some systems have been designed to improve Web image search performance by leveraging visual image features [4][7][14][18]. In these systems, learning tools such as image clustering [7] and Support Vector Machines are used to associate Web images with the concepts expressed by user queries. In addition, some other systems try to make use of other information sources such as link structure [3], user feedback [11] to further improve the performance of Web image search.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval –Retrieval Models, Search Process.

General Terms Algorithms, Design, Experimentations, Performance, Theory

Keywords Similarity Space Projection, Web Image Search, Image Annotation.

1. INTRODUCTION Due to the explosive growth of the World-Wide Web (the Web), nowadays we can easily access a huge amount of images from the

In summary, from the above description, we have the following conclusions. 1) The basic idea of most existing Web image search systems is to convert Web images to textual keywords by means of either surrounding text extraction or image annotation in order to measure the similarity between images and user queries. In other words, ‘Web image’ is transformed to ‘user query’ space in order to measure their similarity. 2) More and more attention is being paid to leveraging multiple information sources for image representation, in order to boost Web image search performance. However, there are still some challenging problems in this aspect. For example, considering the ‘semantic gap’ between the descriptive power of

*

The work was performed when the first and the second authors were interns of Microsoft Research Asia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR’05, November 10-11, 2005, Singapore. Copyright 2005 ACM 1-59593-244-5/05/0011…$5.00.

49

[23]. Intending to explore better use of HTML document for Web search, Cheng and Ethan developed a relevance model combining 53 textural features obtained from HTML source code including page-level features (URL, title, metadata), image element features (URL, identifier, alternative text), and textural content of links, objects, etc [1]. Given a set of user queries and Web images, each image’s relevance to a query was evaluated by human raters as ‘Relevant’ or ‘Not Relevant’. Logistic regression technique is then used to create a relevance model that uses the HTML features to predict the human ratings of relevance. In [16], the authors study the effectiveness of HTML metadata (textual content and structure) for Web image search. The system uses AltaVista search engine to find HTML documents matching the textural query. These documents are then analyzed using a set of 8 clues (such as image filename, the value of the ALT attribute) to decide whether the images they contained match the query. In [10], to re-order the images retrieved from an image search engine, a relevance model is developed to estimate the probability that an image is relevant to the query by analyzing the HTML documents linking to the image.

visual image features and the rich semantics in user queries, is it effective to directly transform ‘Web images’ into ‘user query’ space? In addition, how to efficiently integrate multiple types of features into one model? In attempt to search for solutions to the aforementioned problems, in this paper, we present a novel framework which takes image and query as two heterogeneous object peers and projects both of them into a third space( so called the ‘similarity space’) in which their similarity can be directly measured. The algorithm is designed to guarantee that in the ‘similarity space’, the relevant images are kept close to the corresponding query and those irrelevant ones are away from it. We refer to this algorithm as ‘similarity space projection (SSP)’. In SSP, images and queries are represented using multiple types of features. Note that the philosophy of SSP is quite different from the previous Web image search systems, in that queries are no longer treated as conceptual labels but have their own features and can be manipulated. When applying SSP to the Web image search scenario, given a new query, the system can project it into the ‘similarity space’ and measures its distance to the Web images to find those relevant ones. Besides, SSP can also be used in the application of image annotation. Supposing we have a collection of user queries, for a coming new image, SSP can find the queries relevant to it, with the query terms and others words relevant to the selected queries serving as the annotation keywords. Thus, SSP unifies Web image search and image annotation into a same framework. Experimental results on real-world web data collection confirm the effectiveness of the proposed algorithm for Web image search as well as image annotation.

Though being still the prevailing choice to index Web images, textual information might be incomplete or ambiguous in describing the actual image content, for example, filenames may be misleading and surrounding text might not describe the content of an image due to page layout consideration [10]. Considering this, some systems have been designed to improve Web image search performance by leveraging other information sources such as visual image features, link structure and user feedback [3][4] [7][11][14][18]. In these systems, relevance feedback technique [11] or other learning tools such as clustering [3][7] used for image annotation, are explored to relate Web images with the concepts expressed by user queries. For example, in [7] a bootstrapping framework is presented to automatically annotate Web images using a pre-defined set of concepts. The system adopts a co-training approach involving two classifiers based on visual and textual evidences, to bootstrap the learning process. The iFind system [11] indexes Web images by their low-level visual features, high-level semantic features collected from the Web, and image annotations learned from user feedbacks. In [3], a hierarchical clustering technique using visual, textual and link information is proposed. This algorithm can organize the search results returned by existing image search engines into different semantic categories, so that the users can quickly find the desired images.

The rest of the paper is organized as follows. Section 2 reviews the related work in Web image search. In section 3, we describe the details of the proposed algorithm. Section 4 explains the implementation issues such as how the training data are collected and processed. Experimental results and results discussions are given in Section 5. Finally, Section 6 concludes this paper.

2. RELATED WORK Believing that image contents can enrich Web image search compared with using texts only as in traditional documents search systems, researchers developed Web image systems such as WebSeek[14] and WebSeer[4] which integrate keyword-based search and image-feature-based search. WebSeek categorizes the database images into a hierarchy of topics (such as travel, Europe, France) by analyzing URLs and HTML tags. Users can search the topics in the database or search using image color features. The WebSeer[4] system uses image content as well as associated text information from HTML data to classify Web images into categories such as photographs, portraits and computer-generated drawings. The HTML metadata includes image filenames, image captions, the text of hyperlink, etc. Image content provides cues such as color depth, image size.

As we can see, existing Web image search systems try to transform ‘Web image’ to ‘user query (textual keywords)’ space to measure the similarity between image and query. As mentioned before, here are two challenging questions. The first is “Is it effective to directly transform ‘Web images’ into ‘user query’ space, especially when visual image features are included?” The second is “how to efficiently integrate multiple types of features into one model?” In attempting to find solutions to the above questions, this paper presents a novel solution to Web image search - Similarity Space Projection (SSP). SSP projects both ‘Web image’ and ‘user query’ into a so called ‘similarity space’. In this space, the similarity between Web image and user query can be directly measured. In addition, this algorithm integrates multiple types of features in image and query representation.

However, as users usually prefer keyword-based search interface, most current Web image systems are text-based and they try to explore textural information such as filename, caption, surrounding text to represent images [5][16][23][25]. To measure the similarity between image and user query, these systems estimate the probability that the corresponding textual information is relevant to the query. To find images matching user’ query, Google Image search engine analyzes the textural factors of the image such as the surrounding text, the image caption, to determine image content

Besides Web image search, this algorithm can be applied to image annotation scenario for auto-illustration purpose. Hence, SSP efficiently unifies Web image search and image annotation into same framework. Our initial experimental results demonstrate the

50

effectiveness of the proposed algorithm for both Web image search and image annotation.

the feature selection to be query-dependent so that the projection operations can be query-independent.

3. SIMILARITY SPACE PROJECTION

3.2 The Algorithm

Different from the existing Web image search systems which try to transform ‘Web images’ into ‘query (keyword)’ space, our idea is to project both ‘query’ and ‘image’ into a third space - the ‘similarity space’, in which their similarity can be measured. We refer to this algorithm as ‘similarity space projection (SSP)’. The details of SSP are given below.

With a set of queries and a set of images (including both relevant and irrelevant images) as training data, we intend to obtain two operators f and g, to project images and queries into the similarity space respectively. In the ‘similarity space’, we want the relevant images close to the corresponding query and those irrelevant ones away from the query. Note that here both images and queries refer to their feature vectors after feature selection.

3.1 Definitions and Preprocessing

Figure 1 explains our basic idea. The triangles represent images relevant to query q1 (dashed circle), but irrelevant to q2 (solid circle). The pentagons represent images relevant to query q2, but irrelevant to q1. After being projected into the ‘similarity space’ using operators f and g, the triangles are close to query q1, but away from q2, while the pentagons are away from q1 but close to q2.

First, we define some notations to be used. Suppose we have a set of m images and n queries. In our algorithm, images and queries are represented by their features (the details of feature extraction will be given in Section 4). The queries are described by their textual features, while the images are represented by both their textural and visual features. We can generate a global vocabulary containing N0 keywords, D = {D1 ,..., DN } , which includes 0

all the words related to the n queries as well as those related to the m images. With the global vocabulary available and knowing which words in the vocabulary are used to describe the queries, we can easily obtain the initial feature of each query qi0 ∈ R N 0 , i = 1,..., n . Similarly, we can obtain the initial textual feature of each image xit ∈ R N0 , i = 1,..., m . Then for the next step, we perform ‘feature selection’ to select those keywords closely related to a query/image for its feature representation. For each query, we design a corresponding ‘feature selection’ operator, T (i ) , i = 1,..., n. T (i ) is a k∗N0 matrix with each row containing only one element with value ‘1’ and the rest are all set to ‘0’. Using T (i ) , we can select from the vocabulary those keywords really relevant to query i. Then we have the new query feature

qi = T (i ) qi0

Figure 1. Illustration of the SSP model r

Suppose we have n queries, q 1 , q2, ..., qn . For qi , xij is the feature o

representation of the jth relevant image, and xij represent the jth irrelevant image. In the ‘similarity space’, we use Euclidean distance to measure the distance between images and queries. For example, as the projection of image xijr in the similarity space is f ( xijr ) and

(1)

For each image, there are n feature vectors with the textual feature part of which corresponding to one of the n queries. For example, using T (i ) , we perform ‘feature selection’ on the initial textual feature of image j to obtain the new textual feature of this image as

xijt = T ( i ) x tj

the projection of query qi in the similarity space is g (qi ) , the distance between image xijr and query qi is defined as

(2)

[ f ( xijr ) − g (qi )]T [ f ( xijr ) − g (qi )]

Thus, the final image feature corresponding to query i is

 xijt  xij =  v  , i = 1,..., n; j = 1,..., m .  x j  t

(4)

Obviously, operators f and g are the key to the performance of SSP. In our approach, we obtain them from the training data using the following optimization method. We denote the sum of the distance (in fact, the square of the distance) between all the relevant images and query qi as

(3) v

where xij is the textural feature of image j (to query i ) and x j the

Di( relevant ) = ∑ [ f ( xijr ) − g ( qi )]T [ f ( xijr ) − g ( qi )] ,

visual feature. Note that there are three reasons to conduct the above feature selection. 1) De-noise. The global vocabulary is usually very large and may contain millions of words. Among these words, only a small amount of them are meaningful to an individual query/image, while others may be noisy. 2) Dimension reduction. This will also greatly reduce the dimension of the query/image feature and hence improve computation efficiency. 3) Generalization. As we know, queries are quite different from each other. Generalization over queries has been an important problem. In our approach, we design

(5)

j

and the sum of the distance between all the irrelevant images and this query as (6) Di(irrelevant ) = ∑ [ f ( xijo ) − g ( qi )]T [ f ( xijo ) − g ( qi )] j

To keep the relevant images close to the query and those irrelevant ones away from it, we should minimize Di( relevant ) and maximize Di( irrelevant ) at the same time. This can be done by minimizing the

following cost function

51

Ji ( f , g) =

Di( relevant )

3.3 Unifying Web Image Search and Image Annotation

(7)

Di( irrelevant )

Thus, for all the queries in our training set, we have the following objective function: min J ( f , g ) = min ∑ J i ( f , g ) = min ∑ i

i

∑

[ f ( xijr ) − g ( qi )]T [ f ( xijr ) − g ( qi )]

j

∑ [ f ( xijo ) − g ( qi )]T [ f ( xijo ) − g ( qi )]

From the description in Section 3.2, SSP can project both images and queries into the ‘similarity space’ so that we can directly measure their similarity. The projection operators A and B can be obtained using training data collected from the Web. In Web image search scenario, given a user query, SSP can project it into the ‘similarity space’ using operator B. By calculating the distance between the query and the Web images in the database using equation (4), we can find those images relevant to the query. Similarly, for a coming new image, we can also find out those queries relevant to it. As queries are in fact textual keywords. It is natural to think of using this property of SSP for image annotation purpose. Though Web images usually come with textual information, such information is often found to be incomplete or ambiguous in describing the actual image content. Suppose we have a large number of user queries in our training set, which form a vocabulary including most of the keywords we need for image annotation. Then, we can annotate a new image using the relevant queries selected from the query set. In addition, other keywords describing the selected queries can also be included in the annotation keywords list. Thus, by taking images and user queries as two object peers, SSP effectively unifies Web image search and image annotation into same framework.

(8)

j

Note that f and g can be any operators. For simplicity, we use two linear operators A and B in the following deductions. In this way, the objection function in (8) can be re-written as min J ( A, B ) = min ∑ J i ( A, B ) = min ∑ i

i

∑ ( Axijr − Bqi )T ( Axijr − Bqi ) j

∑ ( Axijo − Bqi )T ( Axijo − Bqi )

(9)

j

To solve this optimization problem, we adopt the gradient descent method [2]. This method alternates between the determination of the descent directions ∆A and ∆B , and the selection of the step sizes t1 and t 2 . The descent directions ∆A and ∆B are obtained using equations (10) and (11).  

∆A = −   

+2 ∑ 

∑  ∂J = −2∑  j ∂A i

∆B = −

−∑ i

   

∑ j



∑ ( Axijo − Bqi )T ( Axijo − Bqi ) 

j

 ( Axijo − Bqi )T ( Axijo − Bqi )   



2

(10)



∑ ( Axijo − Bqi )[ xijo ]T ∑ ( Axijr − Bqi )T ( Axijr − Bqi ) 

j

   

i

  

 ( Axijr − Bqi )[ xijr ]T   



j



2

∑( Axijo − Bqi )T ( Axijo − Bqi ) 

j

  

∑ ∂J = 2∑ j ∂B i

( Bqi − Axijr   

 ) qi    T



∑ ( Axijo − Bqi )T ( Axijo − Bqi ) 

j



2

∑ ( Axijo − Bqi )T ( Axijo − Bqi ) 

j



4. IMPLEMENTATION ISSUES 4.1 Query Selection

(11)



Many systems tested their algorithms with a small carefully selected query set. For example, in [10], 6 test queries are used, Birds, Food, Fish, Fruits and Vegetables, Sky, Flowers. In [7], 15 keywords are used for Web image annotation including tiger, lion, dog, cat, mountain, sunset, etc. Such keywords carefully selected can’t fully represent the user query semantics in real-world Web image search scenario.

∑ ( Bqi − Axijo ) qiT ∑ ( Axijr − Bqi )T ( Axijr − Bqi )  

j

  

∑



j

( Axijo − Bqi )T

( Axijo − Bqi

j

 ) 

2

The outline of the algorithm is shown in Table 1. For simplicity concern, we set the starting points of A and B using pseudo-identity matrix. Suppose A is a k*m matrix ( k ≤ m ), then A=[I O1], in which I is a k*k identity matrix and O1 is a k*(m-k) zero matrix. Similarly, if B is a k*n matrix ( k ≤ n ), then B=[I O2] with I being a k*k identity matrix and O2 a k*(n-k) zero matrix. In our implementation, the following stopping criterion is used ∆A ≤ ε , ∆B ≤ ε

where ε is a small constant. We use

According to the statistics in [15], popular user queries come from the following areas: computing, entertainment, games, holidays, shopping, travel, sports. By also referring to the statistics of the top user queries in Google search in recent years [24], we collected 90 user queries. A survey including 8 students was done to select queries satisfying the following two conditions. First, the query must be popular among users. Second, users prefer to search the query by images. The first condition is applied to guarantee that the queries we selected are not too specific for general use. The second condition is applied to make sure that we can download enough images for each query.

(12)

ε =0.001 in our experiments.

Table 1. Gradient descent method for SSP Given starting points of A and B Repeat 1. 2.

Calculate the gradient descent direction * ∆A Perform line search* to select a step size t1 > 0

3. 4. 5.

Update A = A + t1 ∗ ∆A Calculate the gradient descent direction ∆B Perform line search to select a step size t2 > 0

6.

Update B = B + t 2 ∗ ∆B

Finally, 40 queries were selected from 6 categories: pets, holidays, shopping, travel, entertainment, sports. Here are some examples of the selected queries: tiger, cat, pets, rose flower, wedding bouquet, Christmas gift, mini ipod, digital camera, Chinese clothes, oriental clothes, Indian clothes, Great wall, Bali island, Great Pyramid, world wonder, Yaoming, NBA, Michael Jordan, Sushi, Chinese tea and so on.

until the stopping criterion is met.

4.2 Web Image Collection Sending each query to Yahoo search engine, we downloaded images from the top 25 returned pages. A total of 1000 pages and all the images referred to were downloaded. Those images of very small

* For details, please refer to Section 2 of Chapter 9 in [2]

52

where cl is the number of occurrence of the (l-p)th frequent word in

size (with either the width or the height less than 60 pixels) or very extreme aspect ratio (larger than 4 or smaller than 0.25) are discarded as they are usually ‘decorative’ images such as borders, bullets. Finally, a total of 3205 images were retained. For each query, there are images relevant to it and images irrelevant. The number of images obtained is not the same for different queries.

the top 10 pages returned for qi , and

l = p +1

(i )

then Tlj

As aforementioned, existing Web image search systems try to ‘transform’ Web images into ‘textual keywords’ in order to measure their similarity with user queries. In such systems, Web images are represented by their features (such as textural feature, visual feature), while queries do not have any feature. In our system, images and queries are taken as two object peers and both have their feature vectors. Below we explain how to extract image and query features. For each query, besides the query terms in it, we also extract the words appearing in the first 10 pages returned by Yahoo to describe it. The stop words which do not contribute to the semantics of the query are discarded, such as ‘a’, ‘an’, ‘the’, ‘when’, ‘where’. Table 2 gives a few examples with the words describing the query in descending order of their frequencies.

Great Pyramid

= 1. Otherwise, Tlj(i ) = 0.

For each image, we also use the frequency of the terms in the surrounding text as its textural feature. After applying the feature selection operator T(i), we normalize these selected terms using similar method to that used in (13). Finally, by adding a 64 dimensional visual image feature, we get the full feature representation of the image corresponding to query qi. The visual image feature we use is a combination of the 44-dimension banded auto-correlogram [20], 6-dimensional color moment feature in LUV color space, and 14 dimensional color texture moment [19]. Performing the aforementioned feature extraction and selection for all the 40 queries and the 3205 images referred to, we obtain our training feature set. Then, using the SSP algorithm described in Section 3, we can obtain the two operation operators A and B accordingly.

Table 2. Examples of query and the words describing it

YaoMing

is the normalized

frequency. After this, we construct the feature selection operator T(i) as follows. If the word wil is the jth word in the global vocabulary, that is, , wil = D j , wil ∈Wi , D j ∈ D,1 ≤ l ≤ k ,1 ≤ j ≤ N0

4.3 Query Feature and Image Feature

Query Chinese clothes

cj k

∑ cl

Keywords extracted Chinese, Clothes, Clothing, Dress, Cheongsam, Qipao, Oritental, Jewelry, Silk, Tradition, … Yao, Ming, NBA, Players, Basketball, YaoMing, Career, ESPN, Match, Seasonal, Sportsline, Chinese, Rockets, Houston, Shanghai,… Pyramid, Great, Khufu, Giza, Egypt, Buildings, Wonders, World, Seven, Ancient, Mystery,…

4.4 Relevance Matrix A relevance matrix R describing the relevance of image j ( j=1,…,m) to query i (i=1,…,n) is obtained manually and used as the ground truth for performance test. For image j (j=1,…,m ), a relevance score R ji with value ‘1’ or ‘0’ is given to describe whether it is relevant or irrelevant to query i (i=1,…,n). Three human raters are asked to do this, and the majority result is used as the final decision. For example, if two out of the three raters consider an image relevant to a query, then the relevance score is ‘1’. Note that whether an image is relevant to a given query or not is judged by the visual content of the image as well as its surrounding text. For example, it is hard to tell if a picture of a holiday inn is relevant to the query ‘Bali island’ or not. If its surrounding text explains that the holiday inn is on ‘Bali island’, then it is considered relevant to the query.

For each image, we extract the words appearing in its surrounding text excluding those non-meaningful stop words. Table 3 lists a few examples. It is clear that some images are accompanied by rich textural information, while the textural description of some others is incomplete to describe the image content. Then, we formed the global vocabulary D = {D1 ,..., DN } and conducted feature selection for 0

each query and image.

5. EXPERIMENTAL RESULTS

Table 3. Examples of image and the keywords extracted from its surrounding text

In this section, we present the experimental results of SSP for Web image search as well as image annotation.

5.1 Web Image Search Warriors, Row, Reconstructed, Xi'an, Terra-Cotta

Yaoming, NBA, Leaders, Ranked, Seventh, Field, Goal, Shots, Game

To evaluate the performance of our SSP algorithm for Web image search, we use ‘Precision of the top 10 retrieved images’ (P@10), which is widely used for the evaluation of retrieval performance [1]. Sphinx, Overview

P @10 = N r / 10

where Nr represents the number of relevant images in the top-10 images retrieved. And to avoid the bias caused by the division of training/testing set, we adopt the leave-one-out methodology [25]. That is, each query in our query set is used for performance test, and the rest are used for training purpose. Then, the average performance of SSP in all these runs is used as the final measure of the algorithm.

Define the target dimension of the feature selection as k. For query qi (i=1, …, n), there are k words considered including the p query terms and the top (k-p) frequent words in descending order of their frequency. These k words form a set Wi = {wi1 , wi 2 ,...wik } . Then we have the k-dimensional qi = {qi1 , q i 2,... qik } with 1/ p, j ≤ p  q ij =  c j , else k  ∑ cl l =  p+1

(14)

feature of this query

For comparison, we need also test some traditional Web image search algorithms. However, it is not reasonable for us to compare the performance of our algorithm directly with commercial image search engines such as Yahoo image search. The reason is that the

(13)

53

Figure 3(a) shows the UI we designed for Web image search. The user can input the query and select the search method by clicking the radio button corresponding to either ‘SSP’ or ‘Baseline’. Then the system will display the retrieved images in descending order of their distance to the query.

dataset we used is much smaller than that used by these search engines. Comparing the P@10 values of two methods is not reasonable when the datasets are quite different. Hence, we implemented a reference algorithm by our own over our experimental dataset. Considering that many existing Web image search engines use textual information to index images and traditional IR (information retrieval) techniques such as vector space model [1][13] are still popularly used in this area, we use a vector space model which calculates the cosine similarity between images and user queries [1] as the reference algorithm. For simplicity, we refer to it as ‘Baseline’. In the ‘Baseline’ algorithm, the vocabulary D is the same as what we describe in Section 4.3. Suppose all the query terms of query i form a word set WiU with totally Li words. The feature of this query

For the query ‘tiger’, all the 10 images returned by SSP are relevant. The ‘Baseline’ method has 6 out of the 10 retrieved images relevant to the query while irrelevant images such as Chinese clothes with tiger pattern and Computer software icon ‘Mac OS X Tiger’ are also selected. For the query ‘Chinese clothes’, the ‘Baseline’ method returns 5 relevant images, while SSP has 10 relevant images returned.

u i = {u i1 ,..., u iN 0 } is calculated as

0, if D j ∉WiU  uij =  , j = 1,..., N 0 1/ Li , else

(15)

That is, if word Dj in the vocabulary is one of the query terms, then the jth element in the feature vector is non-zero; otherwise, it is set to zero. Based on the words contained in the surrounding text of an image, in a similar way, we can obtain the image feature pj j = 1,..., m . The

Baseline

cosine similarity between image p j and query qi is defined as < ui , p j > < ui ,ui >< p j , p j >

(16)

For our dataset with 40 miscellaneous queries, statistics show that SSP provides better search performance than ‘Baseline’ for 23 (57.5%) out of the 40 queries. For 10 (25%) out of the 40 queries, both SSP and ‘Baseline’ have almost the same P@10 value, and only for 7 (17.5%) queries, ‘Baseline’ outperforms SSP. Figure 2 compares the performance of SSP and ‘Baseline’ for each of the 40 queries. As the x-axis is the P@10 for the ‘Baseline’ algorithm and y-axis is that of SSP, the more the points above the diagonal line x=y, the better the performance of SSP over ‘Baseline’. Note that the P@10 values might be same for different queries, hence there are less than 40 points in Figure 2. On average, we achieve P@10 of 0.77 using SSP and the ‘Baseline’ method provides an average P@10 of 0.688. Hence, there is a relative performance gain of about 11.9%.

SSP (a) Retrieval results for the query “tiger”

Baseline

SSP

Figure 2. SSP~ Baseline for all the 40 queries

(b) Retrieval results for the query “Chinese clothes”

As examples, Figure 3 displays the top 10 returned images for queries ‘tiger’ and ‘Chinese clothes’, using SSP and the ‘Baseline’ method respectively.

Figure 3. Web image search examples

54

5.2 Image Annotation For each image in the dataset, using SSP, we can project it into the ‘similarity space’ and then calculate its distance to each query in the query set. Those queries with smaller distance to the image are returned as ‘selected queries’. As known from the description in Section 4.3, each query is described not only by the query terms but also the frequent words extracted from the relevant pages returned by Yahoo. Therefore, we can include all the words describing the ‘selected queries’ for the purpose of image annotation. In this way, for those images with incomplete or noisy textual description, we can provide additional keywords to describe its content more accurately and in more details.

Figure 4. Image annotation UI

Table 4 gives a few examples using SSP for image annotation. The first column of the table displays the image to be annotated. The second column lists the keywords extracted from its surrounding text. In the third column of the table, we list the top 5 ‘selected queries’ followed by the top 10 frequent words describing the first query. Note that as there are only 2-4 queries relevant to an image in our dataset, we need not look at the tail of the returned query list. For the images in Table 4, all the relevant queries are included in the top 5 queries returned. The relevant queries are highlighted using bold letter and those irrelevant ones are italic.

Below the horizontal line across the page, the left side shows the image to be annotated and the middle part lists the keywords extracted from its surrounding text. The right side displays the annotation result which includes two parts: selected queries and the annotation keywords. The top 5 ‘selected queries’ are listed right below the title ‘Annotation Result’. Following the 5 ‘selected queries’, we give the top 10 frequent words describing the first selected query.

To explain the annotation results, we take the last image in Table 4 as example. This image is downloaded for the query ‘Terra-cotta warrior’. The relevant queries in the dataset are ‘Terra-cotta warrior’, ‘China Travel’, ‘World Wonder’. The top 5 ‘selected queries’ include all the three relevant queries and two other queries ‘Cat’ and ‘Beijing’ which are irrelevant. The surrounding text of this image include only two keywords ‘ King’s’, ‘Horses’ telling us that it is a picture of ‘the king’s horses’. Obviously these two keywords are not informative enough to describe the content the image. Using SSP, we find more words related to this image, such as ‘Terra’, ‘Cotta’, ‘Warrior’, ‘Qin’, ‘China’, ‘Xian’, ‘Army’, ‘Museum’. Then, even if we have never heard of the famous ‘Terra-cotta warrior’ in Xian, China, we find no problem relating the image with a museum in China displaying terra-cotta made army warriors of an ancient emperor.

In this section, we first analyze the computation complexity of SSP and then discuss some future works.

5.3 Discussions 1) Computation Complexity. Since the training process in SSP can be done offline, to apply the algorithm for real-world use, the most critical part of computation load comes from the testing process. For example, in Web image search scenario, for a new query we have to perform the following steps to find relevant images: a. Extract the feature of the query; b. Apply ‘feature selection’ to this query and all the images; c. Project the query and all the images into the ‘similarity space’; d. Calculate the distance between the query and all the images to find the relevant images. The response time hence might be too long for real-world applications. To tackle this problem, one possible solution is as follows. Given a new query, we could first select K (e.g., K=1000) images in whose surrounding text the query terms appear frequently. Then we apply SSP to these K images and the query. In this way, the complexity of SSP can be greatly reduced.

Table 4. Examples of image annotation results Image

Surrounding text Chinese, Clothes, Fan, Cheongsam

Sphinx, Overview

King’s, Horses

Annotation results

2) Future work. The idea we present in this paper is novel and our initial experimental results on real-world Web data collection demonstrate the effectiveness of the proposed algorithm. However, the study in this paper is still of small scale and further work needs to be done to improve its performance. There are a few ways we are considering to improve the performance of SSP.

Chinese clothes, Chinese silk, Oriental clothes, Chinese tea, Beijing Chinese, Clothes, Oriental, Clothing, Dress, Cheongsam, Qipao, Jewelry, Silk, Tradition… Great pyramid, Egypt, Great wall, World wonder, Beijing Pyramid, Great, Khufu, Pyramids, Giza, Egypt, Buildings, Wonders, World, Seven…

a. In our future work, a larger set of queries and a larger image set will be used for training to make the results more robust and more practical for real-world use. b. We will investigate the benefit we could get by applying nonlinear projection operators other than the present linear operators.

Terra-cotta warrior, China Travel, World wonder, Cat, Pets Terra, Cotta, Warriors, Qin, China, Chinese, Xian, Army, Emperor, Museum…

c. We will study how much visual image features contribute to the performance of SSP compared with textual image features. d. Better feature representation of images and queries is another possible way to improve the performance of SSP. Besides the surrounding text, we will include other textural information such as image filename in image description. In addition, query feature can

Figure 4 demonstrates the UI we designed for image annotation using SSP.

55

be extended by including the relevant queries obtained through query log analysis.

[8]

e. It may also worth the effort to investigate the effect of different types of query on the performance of the algorithm.

[9]

6. CONCLUSIONS

[10]

This paper presents SSP (similarity space projection), a novel and unified framework for Web image search and image annotation. SSP projects images and queries into the ‘similarity space’ in which their similarity can be directly measured. This is different from most existing systems which try to convert images to ‘query space’ (texts). The rule of projection guarantees that in the ‘similarity space’, the relevant images are kept close to the corresponding query and those irrelevant ones are away from it. In addition, by taking images and queries as two heterogeneous object peers, this algorithm effectively unifies Web image search and image annotation into same framework. Our initial experimental results on real-world Web data collection demonstrate the effectiveness of the proposed algorithm.

[11]

[12]

[13]

[14]

What to be noted is that this study is still of small scale and further work is to be done to improve the performance of the proposed algorithm.

[15]

7. REFERENCES [1] [2]

[3]

[4]

[5] [6]

[7]

Baeza-Yates, R., Ribeiro-Neto,B. Modern Information Retrieval. Addison Wesley, 1999. Boyd, S., and Vandenberghe, L. Convex Optimization, Cambridge Univ. Press, Cambridge, U.K., available at http://www.stanford.edu/~boyd/cvxbook.html, 2003. Cai, D., He, X.F., Li, Z.W., Ma, W.Y., and Wen, J.R. Hierarchical clustering of WWW image search results using visual, textual and link analysis. ACM Multimedia, Oct 1016,2004. Charles Frankel, Michael J Swain, Vassilis Athitsos. WebSeer: an image search engine for the World Wide Web. Technical Report: TR-96-14, 1996 Cheng Thao, Ethan V.Munson. A relevance model for Web image search. WDA2003, UK, August 3, 2003 David Forsyth, David Blei, and Michael I. Jordan, Matching words and pictures. Journal of Machine Learning Research, Vol 3, pp 1107-1135, 2003 Feng, H.M., Shi, R., Chua, T.S. A bootstrapping framework for annotating and retrieving WWW images. ACM Multimedia, 2004.

[16]

[17] [18] [19] [20] [21] [22] [23] [24] [25] [26]

56

Huang J., Kumar S. R., Mitra M., Zhu W. J. and Zabih R. Image indexing using color correlograms. IEEE Conf. on Computer Vision and Pattern Recognition, pp762-765, 1997. Kearns, M., and Ron, D.. Algorithmic stability and sanitycheck bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999. Lin, W.H., Jin, R., and Hauptmann, A. Web image retrieval re-ranking with relevance model. WI'03, pp242-248, 2003. Lu, Y., Hu, C.H., Zhu, X.Q., Zhang, H-J., and Yang, Q. A unified framework for semantics and feature based relevance feedback in image retrieval systems. ACM Multimedia 2000. Sclaroff, S., Cascia, M.L., and Sethi, S. Unifying textual and visual cues for content-based image retrieval on the World Wide Web. Computer Vision and Image Understanding, 75(1/2), pp86-98, 1999. Singhal, A.. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001. Smith, J.R., and Chang, S.F. WebSeek: visually searching the Web for content, IEEE Multimedia, 4(3):12-20, 1997. Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David Grossman, Ophir Frieder, Hourly analysis of a very large topically categorized Web query log. SIGIR, pp321-328, 2004. Tsymbalenko, Y., and Munson, E.V. Using HTML metadata to find relevant images on the Web. Proc. of Internet Computing 2001, Vol.II, pp.842-848, June 2001. Yanai, K. Image collector II: an over-one-thousand-imagegathering system. WWW2003, Budapest Hungary, 2003. Yanai, K. Web image mining toward generic image recognition. WWW, May 2003 Yu H., Li M., Zhang H. and Feng J. Color texture moment for content-based image retrieval. ICIP, September, 2002. Zhang L., Lin F. and Zhang B. A CBIR method based on color-spatial feature. TENCON'99, pp166-169, 1999. Altavista image: http://www.altavista.com/images Ditto: http://ditto.com/ Google image search: http://images.google.com http://www.google.com/press/zeitgeist.html Yahoo image search: http://images.yahoo.com PicSearch: http://www.picsearch.com

Lei Zhang at Microsoft Research

... and large-scale data mining. His years of work on large-scale, search-based image annotation has generated many practical impacts in multimedia search, ...

Download PDF

576KB Sizes 1 Downloads 379 Views

Report

Lei Zhang at Microsoft Research

Recommend Documents