Eye Movement as an Interaction Mechanism for Relevance Feedback in a Content-Based Image Retrieval System Yun Zhang*1,2 1 School of Computer Science Northwestern Polytechnical University, Xi’an, Shaanxi, China
†
‡
¶
§
Hong Fu 2, Zhen Liang 2, Zheru Chi 2 2 Centre for Multimedia Signal Processing Department of Electronic and Information Engineering The Hong Kong Polytechnic University Hong Kong, China
3
Dagan Feng 2,3, School of Information Technologies The University of Sydney Sydney, Australia
Abstract
ver, the subjective nature of human annotation adds another dimension of difficulty in managing image database.
Relevance feedback (RF) mechanisms are widely adopted in Content-Based Image Retrieval (CBIR) systems to improve image retrieval performance. However, there exist some intrinsic problems: (1) the semantic gap between high-level concepts and low-level features and (2) the subjectivity of human perception of visual contents. The primary focus of this paper is to evaluate the possibility of inferring the relevance of images based on eye movement data. In total, 882 images from 101 categories are viewed by 10 subjects to test the usefulness of implicit RF, where the relevance of each image is known beforehand. A set of measures based on fixations are thoroughly evaluated which include fixation duration, fixation count, and the number of revisits. Finally, the paper proposes a decision tree to predict the user’s input during the image searching tasks. The prediction precision of the decision tree is over 87%, which spreads light on a promising integration of natural eye movement into CBIR systems in the future. CR Categories: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Relevance feedback, Search Process; H.5.2 [Information Interfaces and Representation]: User interfaces Keywords: Eye Tracking, Relevance Feedback (RF), ContentBased Image Retrieval (CBIR), Visual Perception
CBIR is an alternative solution to retrieve images. However, after years of rapid growth since 1990s [Flickner et al.1995], the gaps between low level features and semantic contents of images holds back the progress and has entered a plateau phase. Such gaps can be concretely outlined into three aspects: (1) image representation (2) similarity measure (3) user’s interaction. Most of the image representations are based on intuitiveness of the researchers and the fulfillment of mathematics, instead of human’s eye behavior. Do the features extracted reflect humans’ understanding of the image’s content? There is no clear answer to this question. Similarity measure is highly dependent on the features and structures used in image representation. Moreover, developing better distance descriptors and refining similarity measures are also very challenging. User interaction can be a feasible approach to answer the question and to improve the image retrieval performance. In the Relevance Feedback (RF) process, the user is asked to refine the searching by providing explicit RF, such as selecting Areas-of-Interest (AOIs) from the query image, or to tick positive and negative samples from retrieves. In the past few years, many articles reported that RF can help to establish the association between the low-level features and the semantics of images and to improve the retrieval performance [Liu et al.2006; Dacheng Tao et al.2008].
1
However, the explicit feedback is laborious for the user and limited in complexity. In this paper, we propose an eye movement based implicit feedback as a rich and natural source to replace the time-consuming and expensive explicit feedback. As far as we know, there are just a few preliminary studies on implementing some general eye movement features in image retrieval. One is from Oyekoya and Setntiford’s work [Oyekoya and Stentiford.2004; Oyekoya and Stentiford.2006]. They made an investigation into the fixation duration and found that they are different on images with/without a clear AOI. The other work was reported by Klami et al. [Klami et al. 2008]. They proposed nine-feature vectors from different forms of fixations and saccades and used a classifier to predict one relevant image from four candidates.
Introduction
Numerous digital images are being produced everyday from digital cameras, medical devices, security monitors, and other image capturing apparatus. It has become more and more difficult to retrieve a desired picture even from a photo album on a home computer because of the exponential increase in the number of images. Most traditional and common methods of image retrieval based on metadata, such as textual annotations or user-specified tags, have become the industry standard for retrieval from large image collections. However, manual image annotation is time-consuming, laborious and expensive. Moreo*email:
[email protected]: † email:
[email protected] ‡ email:
[email protected] ‖ email:
[email protected] § email:
[email protected]
Different from the previous work, the study reported in this paper attempts to simulate a more real and complex image retrieval situation and to quantitatively analyze the correlation between users’ eye behavior and target images (positive images). In our experiments, the images come from a wide variety of web sources, and in each task, the query image and the numbers of positive images are varied from time to time. We evaluated the significance of fixation durations, fixation counts, and the number of revisits to provide a systematic interoperation of the user’s attention and effort allocation in eye movements, laying a
concrete and substantial foundation to involve natural eye movement as a robust RF source [Zhou and Huang. 2003]. The rest of the paper is organized as follows. Section 2 introduces experimental design and setting for relevance feedback tasks and the corresponding eye movement data collecting. In Section 3, we report our thorough investigation on using fixation duration, fixation count and the numbers of revisits for the prediction of relevant images. These factors are performed with the ANOVA test to reveal their significances and interconnections. Section 4 proposes a decision tree model to predict the user’s input during the images searching tasks. Finally, we conclude the results and propose the future work.
2
Design of Experiments
2.1
Task Setup
We study an image searching task which reflects kinds of activities occurring in a complete CBIR system. In total, 882 images are randomly selected from 101 object categories. The image set is obtained by collecting images through the Google image search enginee [Li 2005]. The design and examples of the searching task interface is shown in Fig. 1. On the top left is the query image. Twenty candidate images are arranged as a 4x5 grid display. All of the images are from 101 categories such as landscapes, animals, buildings, human faces, and home appliances. The red blocks in Fig. 1(a) denotes the locations of positive images in Fig. 1(b) (Class No. 22: Pyramid). The others are negative images and their image classes are different from each other. That is to say, apart from the query image’s category, no two images in the grid are from the same category. The candidate images in one searching stimulus are randomly arranged. Query Image
(a)
Class No 01 Negative
Class No 22 Positive
Class No 22 Positive
Class No 75 Negative
Class No 64 Negative
Class No 56 Negative
Class No 38 Negative
Class No 17 Negative
Class No 100 Negative
Class No 12 Negative
Class No 45 Negative
Class No 22 Positive
Class No 06 Negative
Class No 77 Negative
Class No 91 Negative
Class No 13 Negative
Class No 69 Negative
Class No 22 Positive
Class No 22 Positive
Class No 28 Negative
3
Analysis of Gaze Data in Image Searching
Raw gaze data are preprocessed by finding the fixations with the built-in filter provided by Tobii Technology. The filter maps a series of raw coordinates to a single fixation if the coordinates stay sufficiently long within a sphere of a given radius. We used an interval threshold of 150 ms and a radius of 1 º visual angle.
3.1
Fixation Duration and Fixation Count
The main features used in eye tracking related information retrieval are fixations and saccades [Jacob and Karn.2003]. Two groups of derived metrics stem from the fixation: fixation duration and fixation count are thoroughly studied to support the possibility of inferring the relevance of images based on eye movements [Goldberg et al.2002; Gołofit 2008]. Suppose that FDP(m) and FDN(m) are the fixation durations on the positive and the negative images observed by subject m, respectively; FCP(m) and FCN(m) are the fixation counts on the positive and the negative images observed by subject m, respectively; Then in our searching task, FDP(m) and FDN(m) are defined as FDP
(b)
Figure 1. Image searching stimulus. (a) the layout of the searching stimulus with 5 positive images; (b) an example. Such a simulated relevance feedback task asks each participant to use his eye to locate the positive image on each stimulus. On locating the positive image, the participants select the target by fixating on it for a short period of time with the eye. A set of the task are composed of 21 such stimulus whose positive image number are varied from 0 to 20. Thus, the set of task contains 21x21 = 441 images and the total number of the negative images and positive images are equal (210 images each).
2.2
Ten participants took part in the study, four females and six males in an age range from 20 to 32 all with an academic background. All of them are proficient computer users, and half of them have had experience of using an eye tracking system. Their visions are either normal or correct-to-normal. The participants were asked to complete two sets of the above mentioned image searching tasks and the gaze data are recorded with a 60 Hz sampling rate. Afterwards the participants were asked to indicate which images they have chosen as positive images to ensure the accuracy of a further analysis on their eye movement data. The eye tracker is non-intrusive and allows a 300x220x300 mm free head movement space. Different candidate images and the locations of positive images are ensured in and between each set of the task. In other words, no two images are the same and no two stimuli have the same positive image locations. This is to reduce the memory effects and to simulate the natural relevance feedback situation.
Apparatus and Procedure
Eye tracking data is collected by the Tobii X120 eye tracker, whose accuracy is α 0.5° and drift β 0.3°. Each candidate image has a resolution of 300 x 300 pixels and thus an image stimulus has 1800 x 1200 pixels. Each of stimuli is displayed on the screen with a viewing distance of 600 mm and the screen’s resolution is 1920x1280 pixels and the pixel pitch is h = 0.264 mm. Hence the output uncertainty is just R tan α β /h = 30 pixels, which has ensured the error of gaze data no larger than 1% area of each candidate image.
FDN
= =
∑,
∑,
∑,
,
sgn
FD
,
,
sgn 1
FD ∑,
,
1
,
,
sgn
(1) sgn
,
1,
where 0,1, … ,20 denotes the image candidate in each searching stimulus interface; 1,2, … ,21 denotes the stimulus in each searching task (it also represents the numbers of positive images in the current stimulus); 1,2 denotes the task set, 1,2, … ,10 represents the subject and sgn(x) is the signum is the fixation duration on the function. Consequently, FD i-th image candidate of the j-th stimulus of the k-th task from subject m, and 1 if subject 0 if subject
,
regards cadidate image as positive regards cadidate image as negative
In the similar manner, FCP(m) and FCN(m) are defined as FCP FCN
= =
∑,
∑,
,
sgn
FC ∑,
,
,
sgn
FC ∑,
,
1
,
, 1
sgn
sgn
,
,
(2)
where FC is the fixation counts on the i-th image candidate of the j-th stimulus of the k-th task from subject m. The two pairs of fixation-related variables were monitored and recorded
during the experiment. The average value and standard deviation of ten participants are summarized in Table 1. Table 1 Statistics on the fixation duration and fixation count on positive and negative images Sub. 1 2 3 4 5 6 7 8 9 10 AVG.
FDP(m) 1.410±1.081 1.332±0.394 2.582±1.277 0.805±0.414 1.154±0.484 1.880±0.926 0.987±0.397 0.704±0.377 1.125±0.674 1.101±0.444
FDN(m) 0.415±0.481 0.283±0.247 0.418±0.430 0.356±0.328 0.388±0.284 0.402±0.338 0.166±0.283 0.358±0.254 0.329±0.403 0.392±0.235
FCP(m) 2.5±1.9 2.7±1.4 5.6±3.3 2.4±1.2 2.6±1.4 3.0±1.9 1.7±0.8 2.2±1.1 3.0±2.0 2.7±1.3
FCN(m) 1.3±1.3 1.2±0.9 1.7±1.5 1.5±1.2 1.5±1.0 1.4±1.0 0.6±0.7 1.3±0.9 1.4±1.5 1.5±0.8
1.308±0.891
0.351±0.345
2.8±2.0
1.3±1.1
Analysis of variance (ANOVA) tests are performed to find out whether there are discriminating visual behaviors between the observation of positive and negative images. Given the individual difference in eye movements, we designed two groups of twoway ANOVA among three factors: test subject, fixation duration and fixation count. The results are shown in Table 2. Table 2 ANOVA test results among three factors: test subject, fixation duration and fixation count. Factor (A) Test Subjects (B) Fixation Duration Factor (A) Test Subjects (B) Fixation Count
GROUP I Levels Test result 10 levels F(9,9) = 1.26, p < 0.37 (10 subjects) 2 levels F(1,9) = 32.84, p < 0.0003 (FDP & FDN) GROUP II Levels Test result 10 levels F(9,9) = 2.03, p < 0.15 (10 subjects) 2 levels F(1,9) = 28.28, p < 0.0005 (FCP & FCN)
As illustrated in Table 2, both fixation duration and fixation count revealed significant effects to positive and negative images during simulated relevance feedback tasks. Concretely speaking, the fixation durations on each positive image from all the subjects (1.30 seconds) are longer than those on negative image (0.35seconds). Correspondingly, the analysis of fixation count produces similar results that subjects visit more times on a positive image (2.8) than on a negative one (1.3). On the other hand, the variations of different subjects have no significant effects on both groups. (In GROUP I, 0.37 > α = 0.05; in GROUP II, 0.15 > α = 0.05).
3.2
Number of Revisits
A revisit is defined as the re-fixation on an AOI previously fixated. Much human computer interaction and usability research shows that re-fixation or revisit on a target may be an indication of special interest on the target. Therefore, the analysis of revisit during the relevance feedback process may reveal the correlation between the eye movement pattern and positive image candidates. Figure 2 shows a general status of the overall visit frequency (no. of revisits = no. of visits - 1) throughout the whole image search-
ing task. We can see that (1) some of the candidate images are never visited, which indicates the use of pre-attentive vision at the very beginning of the visual search [Salojärvi et al. 2004]. During the pre-attentive process, all the candidate images have been examined to decide the successive fixation locations; and (2) in our experiments, revisits happen both on positive images and negative images. The majority of them have just been visited once, while some of them are revisited during the image searching. The Number of Visits ‐‐ Histogram 2500
2149
2000 1500 878
1000
403
500
306
119
65
80
0 No Visit 1 times 2 times 3 times 4 times 5 times > 6 times
Figure 2 The total revisit histogram. The X-axis denotes the number of re-fixation and Y-axis is the corresponding count (unit: millisecond). Table 3 Overall revisits on positive and negative images
A1
1
2
3
4
5
6
>7
A2 A3 A4
549 329 878
196 110 306
88 31 119
55 10 65
34 3 37
13 2 15
27 1 28
A5
63%
64%
74%
85%
92%
87%
100%
A1 = the number of revisits on an image candidate; A2 = revisit counts on positive images; A3 = revisit counts on negative images; A4 = the total number of revisits; A5 = the percentage of the total revisits occurring to positive images. To compare with Oyekoya and Setntiford’s work [2006], we investigate whether the variance of revisit counts has a different effect between positive and negative image candidates over all the participants (as shown in Table 3). When revisits counts ≥ 3 times, the result of one-way ANOVA is significant with F(1,8) = 5.73, p < 0.044. That is to say, the probability of revisits on a positive image is increased with revisits counts. For example, when an image is revisited more than three times, it has a very high probability (over 74%) to be a positive image candidate. As a result, the number of revisit is also a feasible implicit relevance feedback to drive an image retrieval engine.
4
Feature Extraction and Results
The primary focus of this paper is on evaluating the possibility of inferring the relevance of images based on eye movement data. The features such as fixation duration, fixation count and the number of revisit have shown discriminating power between positive and negative images. Consequently, we composed a simple set of 11 features , ,…, , an eye movement’s vector to predict the positive images from each returned 4x5 image candidates set in the simulated relevance feedback task, where 1,2, … ,20 denotes the numbers of positive images in the current stimulus; 1,2, … ,10 represents the subject , , … , are listed in Table 4, where FD /FC . 1, … ,20 and FL Table 4 Features used in relevance feedback to predict positive images
NO.
Description
Features FD FC FL R
Fixation duration on i-th image inside 4x5 image candidate set interface Fixation Count on i-th image inside 4x5 image candidate set interface FL FD /FC Fixation Length on i-th image inside 4x5 image candidate set interface Revisit numbers happened on i-th image inside 4x5 image candidate set interface
Different from Klami et al.’s work [Klami et al. 2008], we use a decision tree (DT) as a classifier to automatically learn the prediction rules. The data set mentioned in Section 2 is divided into a training and a testing sets to evaluate the prediction accuracy. Two different methods are used to train the DT, which are illustrated in Table 5 (prediction precisions are 87.3% and 93.5%, respectively), and an example of predicted positive image from 4x5 candidates set is shown in Figure 3. Table 5 Training methods and testing results of decision trees Method
I
Training Data Set
1,2, … 5
Testing Data Set
5,6, … 10
Prediction Precision
87.3%
Method
II
Training Data Set
1,3,5 … 19
Testing Data Set
2,4,6 … 20
Prediction Precision
93.5%
color, texture, shape, and spatial information, to human attention, such as AOIs. As a result, eye tracking data can be a rich and new source for improving image representation [Lei Wu et al. 2009]. Our future work is to develop an eye tracking based CBIR system in which human beings’ natural eye movements will be effectively exploited and used in the modules of image representation, similarity measurement and relevance feedback.
Acknowledgments The work reported in this paper is substantially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project code: PolyU 5141/07E) and the PolyU Grant (Project code: 1-BBZ9).
References DACHENG TAO, XIAOOU TANG AND XUELONG LI. 2008. Which Components are Important for Interactive Image Searching Circuits and Systems for Video Technology, IEEE Transactions on 18, 3-11. . FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY, J., HUANG, Q., DOM, B., GORKANI, M., HAFNER, J., LEE, D., PETKOVIC, D., STEELE, D. AND YANKER, P. 1995. Query by Image and Video Content: The QBIC System. Computer 28, 23-32. . GOLDBERG, J.H., STIMSON, M.J., LEWENSTEIN, M., SCOTT, N. AND WICHANSKY, A.M. 2002. Eye tracking in web search tasks: design implications. In ETRA '02: Proceedings of the 2002 symposium on Eye tracking research \& applications, New Orleans, Louisiana, Anonymous ACM, New York, NY, USA, 51-58. GOŁOFIT, K. 2008. Click Passwords Under Investigation. Computer Security - ESORICS 2007 343-358. . JACOB, R. AND KARN, K. 2003. Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises. In The Mind's Eye: Cognitive and Applied Aspects of Eye Movement Research, HYONA, RADACH AND DEUBEL, Eds. Elsevier Science, Oxford, England. KLAMI, A., SAUNDERS, C., DE CAMPOS, T.E. AND KASKI, S. 2008. Can relevance of images be inferred from eye movements? In MIR '08: Proceeding of the 1st ACM international conference on Multimedia information retrieval, Vancouver, British Columbia, Canada, Anonymous ACM, New York, NY, USA, 134-140.
Figure 3 An example of predicted positive images from 4x5 candidates set in the simulated relevance feedback task. The query image is “hedgehog”, and DT model returned 8 predicted positive images (in red frames) based on the 11 features vector with 100% accuracy.
5
Conclusion and Further Work
An eye tracking system can be possibly integrated into a CBIR system as a more efficient input mechanism for implementing the user’s relevance feedback process. In this paper, we mainly concentrate on a group of fixation- related measurements which shows static eye movement patterns. In fact, the dynamic characteristics can also manifest human organizational behavior and decision processes, such as saccades and scan path, which reveal the pre-attention and cognition process of a human being while viewing an image. In our further work, we will continue to develop a more comprehensive study which includes both the static and dynamic features of eye movements. Originally, it is a unity of human’s conscious and unconscious visual cognition behavior, which can not only be used in relevance feedback, but also a new source of image representation. Human’s image viewing automatically bridge the low level features, such as
LEI WU, YANG HU, MINGJING LI, NENGHAI YU AND XIANSHENG HUA. 2009. Scale-Invariant Visual Language Modeling for Object Categorization. Multimedia, IEEE Transactions on 11, 286-294. . FEIFEI. LI ,Visual recognition: computational models and human psychophysics, Phd Thesis, California Institute of Technology, 2005. LIU, D., HUA, K., VU, K. AND YU, N. 2006. Fast Query Point Movement Techniques with Relevance Feedback for Content-Based Image Retrieval. Advances in Database Technology - EDBT 2006 700-717. . OYEKOYA, O. AND STENTIFORD, F. 2004. Exploring Human Eye Behaviour using a Model of Visual Attention. 17th International Conference on (ICPR'04) Volume 4, IEEE Computer Society, Washington, DC, USA, 945-948. OYEKOYA, O. AND STENTIFORD, F. 2006. Perceptual Image Retrieval Using Eye Movements. Advances in Machine Vision, Image Processing, and Pattern Analysis 281-289. . SALOJÄRVI, J., PUOLAMÄKI, K. AND KASKI, S. 2004. Relevance feedback from eye movements for proactive information retrieval. In Workshop on Processing Sensory Information for Proactive Systems (PSIPS 2004, Anonymous , 14-15. ZHOU, X.S. AND HUANG, T.S. 2003. Relevance feedback in image retrieval: A comprehensive review. Multimedia Systems 8, 536-544. .