Word Tempate for ETRA submissions

Viewer
Transcript

Content-based Image Retrieval Using a Combination of Visual Features and Eye Tracking Data Zhen Liang1*, Hong Fu1, Yun Zhang1, Zheru Chi1, Dagan Feng1,2 1 Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong, China 2 School of Information Technologies, The University of Sydney, Sydney, Australia Abstract Image retrieval technology has been developed for more than twenty years. However, the current image retrieval techniques cannot achieve a satisfactory recall and precision. To improve the effectiveness and efficiency of an image retrieval system, a novel content-based image retrieval method with a combination of image segmentation and eye tracking data is proposed in this paper. In the method, eye tracking data is collected by a nonintrusive table mounted eye tracker at a sampling rate of 120 Hz, and the corresponding fixation data is used to locate the human’s Regions of Interest (hROIs) on the segmentation result from the JSEG algorithm. The hROIs are treated as important informative segments/objects and used in the image matching. In addition, the relative gaze duration of each hROI is used to weigh the similarity measure for image retrieval. The similarity measure proposed in this paper is based on a retrieval strategy emphasizing the most important regions. Experiments on 7346 Hemera color images annotated manually show that the retrieval results from our proposed approach compare favorably with conventional content-based image retrieval methods, especially when the important regions are difficult to be located based on visual features. CR Categories: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Relevance feedback, Search process; H.5.2 [Information Interfaces and Representation]: User interfaces Keywords: eye tracking, content-based image retrieval (CBIR), visual perception, similarity measure, fixation

1

Introduction

Due to an exponential growth of digital images in a daily basis, content-based image retrieval (CBIR) has been a very active 1

email: [email protected] [email protected] [email protected] [email protected] 1,2 email: [email protected]

research topic since the early 1990s. In CBIR, image contents are characterized for searching similar images to the query image. Usually, low-level features, such as colors, shapes and textures, are used to form a feature vector to represent images. A similarity measurement between images is usually computed based on the distance between the corresponding feature vectors. However, low-level features are quite often not sufficient enough to describe an image, and the gap between low-lever features and high-level semantic concepts becomes a major difficulty that hinders a further development of CBIR systems [Smeulders et al. 2002]. As an attempt to reduce the semantic gap and to improve retrieval performance, Region-Based Image Retrieval (RBIR) techniques have been proposed. In an RBIR system, local information is extracted from the whole image and used for image retrieval. The basic rational is that one who searches for similar images is normally interested in visual objects/segments and the features extracted from the whole image may not properly represent the characteristics of the objects. The standard process of an RBIR system includes three steps: (1) segmenting an image into a set of regions; (2) extracting the features from segmented regions, which are known as “local features”; and (3) measuring the similarity between the query image and every candidate image in terms of local features. Many recent algorithms are focusing on improving the efficiency and effectiveness of image segmentation, feature extraction and similarity measurement in the RBIR system [Tsai et al. 2003; Lv et al. 2004; Marques et al. 2006; Wang et al. 2006]. On the other hand, sometimes the segmentation process may fail to produce objects if they are not salient based on their visual features although these objects carries important semantic information. One of the most influencing RBIR approaches is to integrate a process of manually selecting important regions and indicating feature importance into a system to overcome the problem mentioned above [Carson et al. 1999]. However, these bring a huge burden to users and are not convenient at all. In 2003, Parkhurst and Niebur pointed out that eye movements under natural viewing conditions are determined by Selective Visual Attention Model (SVAM) [Parkhurst and Niebur, 2003]. The SVAM is composed of two stages: a bottom-up procedure with low-level features and a top-down process guided by a high-level understanding of the image. For integrating the topdown processing in an RBIR system, eye tracking technique can provide a more natural, convenient and imperceptible way to understand the user’s intention instead of asking him/her to manually select the ROIs. It has been found that longer-duration and more frequent fixations appear on the objects in a scene [De Graef et al. 1990; Henderson and Hollingworth, 1998]. Therefore, the relative gaze duration could be utilized to improve retrieval performance by weighing the corresponding hROI.

In this paper, a model of using a combination of visual features and eye tracking data is proposed to reduce the semantic gap and improve retrieval performance. The flowchart of our proposed model is shown in Figure 1. After eye tracking data are collected by a non-intrusive table mounted eye tracker and the image is segmented by the JSEG algorithm, the fixation data is extracted and used to locate the hROIs on the segmented image. The relative gaze duration in each hROI is also computed to weigh the importance. The selected hROIs are treated as important informative segments/objects and used in the image retrieval. The rest of this paper is organized as follows. Eye tracking data acquisition is described in Section 2. In Section 3, the JSEG algorithm is explained. Then we discuss how to construct an image retrieval model with eye tracking data in terms of region selection, feature extraction, weight calculation as well as similarity measurement in Section 4. In Section 5, experimental results are reported with a comparison of our new approach with the conventional image retrieval methods. Finally, a conclusion is drawn and future work is outlined in Section 6.

Figure 1The flowchart of our proposed model.

(a) Original images

(b) Segmented maps

(c) Eye tracking data acquisition Figure 2 representative images (a) with the corresponding segmentation results (b) and eye tracking data acquisition (c).

2

Eye Tracking Data Acquisition

A non-intrusive table-mounted eye tracker, Tobii X120, was used to collect eye movement data in a user-friendly environment with a high accuracy (0.5 degree) at 120Hz sample rate. The freedom of head movement is 30x22x30 cm. Before each data collection, a calibration will be conducted under a grid of nine calibration points for minimizing errors in eye tracking. Fixations (location and duration) were extracted from the raw eye tracking data using a criterion of fixation radius (35 pixels) and minimum fixation duration (100 ms). Each image in the 7346 Hemera color image database is viewed by a normal vision participant within 5 seconds. The viewing distance is approximately 68 cm away from the display with a display resolution of 1920 x 1200 pixels. The corresponding subtended visual angel is about 41.5º x 26.8º.

3

Image Segmentation

A state-of-art segmentation algorithm, JSEG [Deng and Manjunath 2001], is used in this paper to segment images into regions

based on low-level features (color & texture). The image segmentation step is similar as the human bottom-up processing that can help locate objects and boundaries in image retrieval system. In our experiment, images are downsized into a maximum width/length of 100 cm with a fixed aspect ratio before segmentation that can reduce the computational complexity and increase the retrieval efficiency. Figure 2(b) gives some segmentation results.

4

Image Retrieval Model

A novel image retrieval model using a combination of image segmentation results and eye tracking data is proposed in this section. The aim of image segmentation step is to simulate the bottom-up processing that coarsely interpret and parse images into coherent segments based on the low-level features [Spekreijse 2000]. [Rutishauser et al. 2004] has demonstrated that the bottom-up attention is partially useful for object recognition. But it is not sufficient. In the second stage of the human visual attention, top-down processing, one or a few of objects are selected from the whole image for a more thorough analysis [Spekreijse 2000]. The selection procedure is not only guided by elementary features but also by human understandings. Thus, if an image retrieval strategy could incorporate bottom-up and top-down mechanisms, the accuracy and efficiency will be largely improved. Eye tracking technique provides us with an important signal to understand which region(s) is concerned by the user or which object(s) in the image is the target the user wants to search for. Fixation is one of the most important features in eye tracking data, which can tells us where the subject’s attention points are and how long one fixates on each attention point. Thus, in our proposed model, the fixation data is used to define the hROIs on the segmented image, and the relative gaze duration in each hROI is treated as the corresponding region significance.

4.1

Selection Process of Human’s ROI

Here, we use eye tracking data to locate the observer’s interesting regions in an image, and an importance value for each segmented region is defined as the relative gaze duration on the region. Some example images with their eye tracking data are shown in Figure 2(c). Suppose that an image I is composed of N segmented regions (Eq. (1)), and the relative gaze duration on the image I is D. I = S1 , … , Si , … , SN ,

(1)

where 𝑆𝑖 is the 𝑖th segmented region. A concept of importance value is introduced to show the degree of the observer’s interest on the region. Assume that the relative gaze duration on the segmented region 𝑆𝑖 is 𝑑𝑖 , a corresponding importance value 𝐶𝑖 can be calculated by Eq. (2). The value will be 0 if there is no fixation on the region. 𝑑 𝑁 𝐶𝑖 = 𝑖 , and (2) 𝑖=1 𝐶𝑖 = 1 . 𝐷 𝑁 where 𝐷 = 𝑖=1 𝑑𝑖 . As shown in Figure 3(a), the popping-out process of human ROIs consists of the following steps. Step 1: Scale eye tracking data to the segmented map size. Step 2: Determine whether a segmented region is a hROI or not. Step 3: If all segmented regions in the image are processed, terminate the procedure; otherwise, go to Step 2. Figure 3(b) gives some examples to show the selection results in terms of weighting maps. The higher of

the importance values, the redder region in the map. In the next image retrieval step, the selected hROIs are treated as important informative segments/objects, and the corresponding importance values are used as the region significance to weigh the similarity measure.

s 𝐼𝑞 , 𝐼𝑐 =

.

(7)

When the query image and a candidate image are identical, the distance in Eq. (7) is zero. Thus, for a query image, a smaller distance indicates that there are more matched regions in the candidate image. In the other words, the corresponding image is more relevant to the query one.

5 5.1

(a) (b) Figure 3 Selection process of hROIs (a) and weighting maps (b) (the value in the weighting map is the corresponding important value o f the region).

𝑚 𝑛 𝑖=1 𝑗 =1 𝑚 𝑖𝑗 𝑤 𝑖𝑗 𝑠𝑖𝑗 𝑚 𝑛 𝑖=1 𝑗 =1 𝑚 𝑖𝑗 𝑤 𝑖𝑗

Experimental Results and Discussion Image Database and Evaluation Criterion

The retrieval experiments are conducted on the 7,346 Hemera color images annotated by keywords manually. Figure 4 shows example images from a few categories. The evaluation criterion for image retrieval performance applied here is not simply by labeling images as “relevant” or “irrelevant”, but based on the ratio of matched keywords between the query image and a database image returned. Suppose that the query image and retrieval image have M and N keywords, respectively, with P matched keywords, then the semantic similarity is defined as 𝑃 𝑠 query image, retrieval image = . (8) (𝑀+𝑁)/2

4.2

Feature Extraction

Color and texture properties are extracted from the selected hROIs for similarity measure. For the color property, the HSV color space is used because it approximates the human perception [Paschos 2001]. For the texture property, the Sobel operator is used to produce the edge map from the gray-scale image. A feature set including an 11 x 11 x 11 HSV color histogram and a 1 x 41 texture histogram of the edge map is used to characterize the region.

4.3

Similarity Measure

(a)

(b)

(c) (d) Figure 4 Example images in the image database. (a) Asian architecture; (b) Close-up; (c) People; (d) Landmarks.

An image is represented by several regions in the image retrieval system. Suppose that there are m ROIs from the query image, 𝐼𝑞 = 𝑅𝑞1 , … , 𝑅𝑞𝑚 , and n ROIs from a candidate image, 𝐼𝑐 = {𝑅𝑐1 , … , 𝑅𝑐𝑛 }. As discussed in Section 4.1, the corresponding region weight vectors of the query image and the candidate image are 𝑊𝑞 = {𝐶𝑞1 , … , 𝐶𝑞𝑚 } and 𝑊𝑐 = {𝐶𝑐1 , … , 𝐶𝑐𝑛 }, respectively. The similarity matrix among the regions of two images is defined as

(a)

(b)

𝑆 = 𝑠𝑖𝑗 = 𝐸 𝑅𝑞𝑖 , 𝑅𝑐𝑗 , 𝑖 = 1, … , 𝑚; 𝑗 = 1, … , 𝑛. , (3) where 𝐸 𝑅𝑞𝑖 , 𝑅𝑐𝑗 is the Euclidean distance between the feature vectors of region 𝑅𝑞𝑖 and 𝑅𝑐𝑗 . The weight matrix, which indicates the importance of the corresponding region similarity measure in the similarity matrix, is defined as 𝑊 = 𝑤𝑖𝑗 = 𝐶𝑞𝑖 𝐶𝑐𝑗 , 𝑖 = 1, … , 𝑚; 𝑗 = 1, … , 𝑛. . (4) To find the most similar region in the candidate image, a matching matrix is defined as 𝑀 = 𝑚𝑖𝑗 , 𝑖 = 1, … , 𝑚; 𝑗 = 1, … , 𝑛. , (5)

5.2

where 𝑚𝑖𝑗 =

(c) (d) Figure 5 Average semantic similarity vs. the number of images returned for four themes of images shown in Figure 4.

1 if 𝑗 = 𝑗∗ and 𝑗∗ = arg min𝑗 𝑠𝑖𝑗 ,

𝑖 = 1, … , 𝑚. (6) 0 otherwise In the matching matrix, there is only one element is 1 in each row and the others are 0. The value of 1 represents the corresponding 𝑠𝑖𝑗 in the similarity matrix is the minimum in the row. Thus, the distance between two images in the proposed image retrieval model is defined as

Performance Evaluation of Image Retrieval

The performances of our proposed image retrieval model on different types of query images (Figure 4) are shown in Figure 5, compared with the following three methods: 1) Global based: retrieval based on the Euclidean distance of the global color and texture histograms of two images; 2) Attention based: attentiondriven image retrieval strategy proposed in [Fu et al. 2006]; 3) Attention object1 based: retrieval using the first popped-out

object only in the attention-driven image retrieval strategy [Fu et al. 2006]; The fixation-based image retrieval system is one proposed in this paper. Figure 5 shows the results of image retrieval by using the above mentioned four image retrieval methods for different image classes. We can see that our fixation-based method is significantly better than the other two in “Asian Architecture” and “People” image classes in terms of average semantic similarity. For the other two image classes “Closeup”, and “Landmarks”, our method is better when the number of return images is not large (20 or smaller), suggesting that our method can have a better effectiveness.

5.3

Discussion

Our proposed model achieves a better retrieval performance than the other three image retrieval methods when the objects are hidden in the background (the low-level features of the objects are not conspicuous) or there are multi-objects in the image. For example, an image shown in Figure 6 (left), “man working out in the gym”, Fu et al.’s model places a higher importance value on the white ground and the other part is treated as the background, while the man and the woman in the corner are considered as the two most important hROIs in our model. A comparison of the selection of HOIs on the example image for the fixation-based and attention-based approaches is shown in Figure 6 (right). On the other hand, in the global based image retrieval, all the information is mixed together, which cannot distinguish objects from the image with different significances especially when the objects are hidden in the background. In our method, the important information in the image can be extracted and well ranked based on the human visual attention process. For example, for the image with a cow in the grass background (Figure 2), the grass has a much larger area than the cow. As a result, the global based image retrieval prefers to retrieving images that also have green objects and/or the green background. On the contrary, our method identifies the cow as the most important object in the image and accordingly the retrieval performance is much improved.

favorably with conventional CBIR methods, especially when the important regions are difficult to be located based on the visual features of an image. Future work to be carried out includes collecting eye tracking data during the relevance feedback process and the refinement on both the feature extraction and weight computation.

Acknowledgements This work is supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project code: PolyU 5141/07E) and the PolyU Grant (Project code: 1-BBZ9).

References CARSON, C., THOMAS, M., BELONGIE, S., HELLERSTEIN, J.M., AND MALIK, J. 1999. BlobWORLD: A System for Region-Based Image Indexing and Retrieval. Proc. Visual Information Systems, 509516. DENG, Y., AND MANJUNATH, B. 2001. Unsupervised Segmentation of Color-Texture Regions in Images and Videos. IEEE Trans. Pattern Anal. Mach. Intell., vol. 23(8), 800-810. FU, H., CHI, Z., AND FENG. D. 2006. Attention-Driven Image Interpretation with Application to Image Retrieval. Pattern Recognition, Vol. 39(9), 1604-1621. DE GRAEF, P., CHRISTIAENS, D., AND D’YDEWALLE, G. 1990. Perceptual Effects of Scene Context on Object Recognition. Psychological Research, Vol. 52, 317-329. HENDERSON, J. M., AND HOLLINGWORTH, A. 1998. Eye Movements During Scene Viewing: An Overview. in: Eye Guidance While Reading and While Watching Dynamic Scenes, Underwood, G. (Ed.). Elsevier Science, Amsterdam, 269-293. LV, Q., CHARIKAR, M., AND LI. K. 2004. Image Similarity Search With Compact Data Structure. In Proceedings of The ACM International Conference on Information and Knowledge Management, 208-217. MARQUES, Q., MAYRON, L., BORBA, G., AND GAMBA, H. 2006. Using Visual Attention to Extract Regions of Interest in the Context of Image Retrieval. In Proceedings of ACM Annual Southeast Regional Conference, 638-643. PARKHURST, D. J., AND NIEBUR, E. 2003. Scene Content Selected by Active vision. Spatial Vision, Vol. 16(2), 125-154. PASCHOS, M. 2001. Perceptually Uniform Color Spaces for Color Texture Analysis: An Empirical Evaluation. IEEE Trans. Image Process, Vol. 10(6), 932-937.

Figure 6 Fixation-based vs. attention-based selection where the value below the left image is the corresponding importance value.

6

Conclusion

In this paper, we report our study on imitating the human visual attention process for CBIR by combining the image segmentation and eye tracking techniques. JSEG algorithm is used to parse the image into homogeneous sub-regions and eye tracking data is utilized to locate the hROIs on the segmented image. In the similarity measurement step, each hROI is weighed by the fixation duration on each hROI as the importance value to emphasize the most important regions. Retrieval results on 7346 Hemera color images show that our proposed approach compare

RUTISHAUSER, U., WALTHER, D., KOCH, C., AND PERONA, P. 2004. Is Bottom-Up Attention Useful for Object Recognition? CVPR2004, Vol. 2, 37-44. SMEULDERS, A. W. M., WORRING, M., AND SANTINI, S. 2002. Content-Based Image Retrieval At the End of the Early Years. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22(12), 1349-1380. SPEKREIJSE, H. 2000. Pre-attentive and Attentive Mechanism in Vision. Perceptual Organization and Dysfunction. Vision Search, Vol. 40, 1179-1638. TSAI, C. F., MCGARRY, K., AND TAIT, J. 2003. Image Classification Using Hybrid Neural Network, In Proceedings of The ACM SIGIR Conference on Research and Development in Information Retrieaval, 431-432. WANG, X. Y., HU. F. L. AND YANG, H. Y. 2006. A Novel Regions-of-Interest Based Image Retrieval Using Multiple Features. In Proceedings of The Multi-Media Modeling International Conference, Vol. 1, 377-380.

ETRA.0330_04_19-09-2016.pdf

Style guidelines for CAVW paper submissions

Dossier requirements for referral, ASMF and NAP submissions ...

sound blend word cards for word work.pdf

'Manual SEO Enabled 500 Link Exchange Submissions ...

Mahmood Farooqui v. State WRITTEN SUBMISSIONS ON BEHALF ...

Hurd - stamped - Outline of Submissions 17 September 2014.pdf ...