Clustering and semantically filtering web images to create a large-scale image ontology S. Zinger 1, C. Millet, B. Mathieu, G. Grefenstette, P. Hède, P.-A. Moëllic Commissariat à l'Energie Atomique-LIST/Atomic Energy Agency of France-LIST LIC2M (Multilingual Multimedia Knowledge Engineering Laboratory), 18 Route du Panorama 92265, Fontenay aux Roses, France
[email protected], {milletc, mathieub, grefenstetteg, hedep, moellicp}@zoe.cea.fr ABSTRACT In our effort to contribute to the closing of the "semantic gap" between images and their semantic description, we are building a large-scale ontology of images of objects. This visual catalog will contain a large number of images of objects, structured in a hierarchical catalog, allowing image processing researchers to derive signatures for wide classes of objects. We are building this ontology using images found on the web. We describe in this article our approach for finding coherent sets of object images. We first perform two semantic filtering steps: the first involves deciding which words correspond to objects and using these words to access databases which index text found associated with an image (e.g. Google Image seeach) to find a set of candidate images; the second semantic filtering step involves using face recognition technology to remove images of people from the candidate set (we have found that often requests for objects return images of people). After these two steps, we have a cleaner set of candidate images for each object. We then index and cluster the remaining images using our system VIKA (VIsual KAtaloguer) to find coherent sets of objects. Keywords: web image retrieval, image indexing, clustering, image ontology, semantics.
1.INTRODUCTION The constantly increasing amount of information in the Internet requires development of effective image retrieval strategies. While text processing methods have been successfully applied in information search engines, a lot of work still has to be done in the area of web image retrieval [2]. The goal is to locate the most relevant images that correspond to a query to a search engine. Currently, most of the works consider image retrieval based on the surrounding text that can be found on web-pages. Textual and link information is used in this framework. There are many strategies allowing to analyze the text around images, to segment web-pages on blocks in order to better locate the information [1]. Then the images are clustered using the data received from the textual and link data. Low-level features of images are applied for reorganizing the clustering results [1]. In our work, we are mostly interested in using low-level features in order to index image web search results. Our motivation comes from the fact that the existing image search engines use text processing for image retrieval, and the results obviously need some improvements in the retrieval methods [2]. The images of the Internet can be divided on large semantic groups [4]. In this article we consider one semantic feature. It consists of the presence of human faces. Our choice is based on the observation of image search engines: when a query is a name of an object, in the resulting images one may often find photographs of people. Excluding these images with people will provide better input for clustering and therefore better image retrieval results. Our approach is developed to improve image search results of online engines and therefore contribute to web-based data mining.
1
This work was carried out during the tenure of a MUSCLE Internal fellowship.
2.IMAGE INDEXING AND CLUSTERING The algorithm used for indexing is inspired by the approach based on border/interior pixel classification [9]. The idea of this article consists of building two histograms for an image. One histogram takes into account only border pixels, the other – interior ones. It implies that the first step of the algorithm is pixels classification on interior and border ones. This indexing algorithm is fast, simple and provides information not only on colors of an image but also on sizes of constant color areas in an image. In general, the indexing we use is developed for broad image domains. It is an important property, because WWW images are very diverse. The indexing method described above is implemented in PIRIA (Program for the Indexing and Research of Images by Affinity) [6]. The indexing leads to a vector of 128 elements for each image. We use the Riemann distance as a similarity measure between images. We calculate an array containing all distances between all the considered images. The next step is to cluster retrieved images in order to find prototype images to illustrate the query words. To achieve this task, we used a k-SNN clustering algorithm (Shared Nearest Neighbor). This algorithm is based on ideas from [5], a complete description can be found in [3]. For each image, the algorithm considers only the k most similar neighbor images. The main idea of the algorithm is that the more common neighbors two images have, the more similar they are. Then, images that are most similar to their neighbors are considered as topic images. Topic images are used to create clusters, and images that are strongly linked to topic images are aggregated to clusters. Other images are unclustered, the amount of unclustered images depends on parameters used previously. In our application, image collection retrieved from the Internet is noisy. We expect the clustering algorithm to ignore offtopic images, which are supposed to be isolated, and to extract some highly coherent clusters, that will carry prototype images for query words. In this approach, the SNN clustering algorithm has two main advantages. First, it doesn't suppose a predefined number of clusters. Second, it doesn't force images to belong to a cluster. By changing the parameters, we can adjust the focus of extracted clusters, and the amount of unclustered images. This clustering algorithm is linear in time and in space for the number of images. The main bottleneck of the system is the computing of images similarities.
3.REMOVING IMAGES WITH FACES Faces are a main topic of pictures that can be found on the Internet, and usually these pictures are not annotated as “face”, but with other keywords that are not related to face. Thus, when submitting a query to an image retrieval system based on surrounding text, most of the time (if not always) some found images are faces, and they are not relevant to the given query. Table 1 shows statistics about the proportion of faces obtained when querying AlletheWeb pictures finder. Except for glasses, these objects are not directly related to faces, whereas the average proportion of faces is 10% (glasses excepted). This proportion is high enough and cannot be ignored. Therefore, removing faces is a necessary step in our process. We expect to remove noise from an image set by deleting the images with faces. Some clusters may partly consist of images containing faces. It worsens the results since our goal is to create clusters of images that are representative for the given query. So we prefer to delete the images containing faces and only after that perform clustering. The algorithm used is the one based on a multi-stage AdaBoost proposed by Paul Viola and Michael Jones in 2001 [10] with the improvements of [8]. This algorithm is capable of processing images very rapidly, at different scales looking for a specific object (faces in our application). For color pictures, we validate the results of this algorithm with skin color detection: a detected face is validated if more than 30% of its pixels are skin colored.
Query
Number of images retrieved
Proportion of faces
Armchair
459
6%
Boat
399
23%
Glasses
415
52%
Knife
440
10%
Mug
472
8%
Tree
412
11%
Wristwatch
418
2%
Table 1: Proportion of faces on the Internet for seven queries.
False alarms do not have a strong influence: it is better to remove good images than to keep bad images, and the amount of images in the Internet is vast and quickly increasing. Table 2 represents the evaluation of our face detector performance.
Query
Precision (%)
Recall (%)
Armchair
19
38
Boat
51
40
Glasses
81
56
Knife
28
44
Mug
19
48
Tree
23
37
Wristwatch
14
55
Table 2: Performance of the face detector on web image search results.
The difficulty of face detection on web images consists of the high complexity of them. For example, often faces are turned, and our face detector does not recognise profiles.
4.EXPERIMENTAL RESULTS Our VIKA system (Fig. 1) uses a web image search engine to acquire images (Fig. 2) for a given query and then indexes and cluster these images using the algorithms described above. It is also possible to indicate the amount of images to be acquired. It is also possible to detect and remove images with faces before indexing. We evaluated the VIKA system on the seven sets of images presented in Table 1. The images of each set were manually sorted twice. At first, we marked the images containing faces.
Figure 1: Functional scheme of VIKA system.
Figure 2: VIKA (VIsual KAtaloguer) system
At second, we selected the images representative for a given object. The criteria applied for the manual selection of relevant images are the following: -
the whole object is present on the image,
-
the object occupies the biggest part of the image,
-
the object is pictured in its “habitual” form (for example, chair – from a side, and not from above).
A result of VIKA’s performance is in Figure 3.
Figure 3: The clustered results of web image search for the query “chair”; images of chairs form clusters, representative for the object “chair”.
The left image on Figure 4 shows the present web image search results curently available, and the right image is the results we would like to get. The images on this right image form clusters in the VIKA system. Tables 3 and 4 present our experimental evaluation of the VIKA system comparing to Alltheweb. The amount of images in clusters we consider as the retrieved images. We calculate precision and recall for VIKA before and after removing images with people in order to estimate the use of the face detector. The recall and precision of VIKA is compared to Alltheweb.
We are primarily concerned by the precision, because it coresponds to the number of relevant images inside clusters. Precision shows us how successful the system is in constructing sets of coherent images – we need them for building a large scale image ontology. While VIKA obviously outperforms Alltheweb for the query armchair (18% of improvement), we can see that for other queries the improvement is much smaller. One of the reasons for it is the manual sorting of images on relevant and irrelevant ones. The worst performance of VIKA is detected for queries when manual sorting was difficult. For example, only the images containing entire trees were considered as relevant, while images with some branches of trees were marked as irreleant.
Figure 4: Left image - web image search results in the original order by the web search engine , right image – desirable results (clusters from VIKA).
Query
Precision (%)
Recall (%)
Alltheweb
VIKA
Alltheweb
VIKA
armchair
58
77
44
58
boat
14
16
42
48
glasses
10
10
69
63
knife
26
24
61
55
mug
69
74
54
58
tree
21
20
53
51
wristwatch
64
66
43
44
Table 3: Alltheweb and VIKA performances before removing images with faces.
5.CONCLUSIONS The presented work is oriented versus content-based web image retrieval. We use image indexing and clustering in order to reorganize and improve web image search results. The applied indexing algorithm is oriented on the colors of an image and partially on the shape of objects on the image. The advantages of the clustering method include its ability to work with multidimensional data and form clusters of different size and shape. This clustering also determines unclustered images. It is an important property since one always finds irrelevant images in web image search results, and we are interested in isolating such images classifying them as unclustered.
Query
Precision (%)
Recall (%)
Alltheweb
VIKA
Alltheweb
VIKA
armchair
59
77
37
49
boat
14
18
38
48
glasses
11
12
43
46
knife
32
32
54
54
mug
75
78
45
47
tree
22
22
52
53
wristwatch
64
66
44
45
Table 4: Alltheweb and VIKA performances after removing images with faces.
During the experiments we noticed that search results often contain pictures of people. So we introduce a semantic feature presence of a human face - for improving the results. Both indexing and clustering methods have been implemented and tested on the images retrieved from the web. One can see that clustering the images from the web provides the results which are better organized. The next step is to keep only relevant images. Still, the problem of classifying the clusters is to be resolved in the future work. We also have to make a list of queries for which removing people’s photographs is not needed. For example, the query “fireman” should provide photographs of people and therefore images with faces must be kept. More experiments are to be done in the framework presented in this article. Other indexing methods as well as different clustering parameters can be explored. We are interested in exploring the issue of the choice of proper indexing method depending on the query. For example, we use a color correlogram to index images with man-made objects, but for indexing images with trees we prefer a method based on texture. Our future research will be concentrated on building a large scale image ontology. This work includes web-based data mining, text and image processing. We would like to explore the possibility of building a visual dictionary using web image retrieval.
REFERENCES [1] Cai, D., He, X., Li, Z., Ma, W.-Y., and Wen, J.-R. Hierarchical Clustering of WWW Image Search Results Using Visual,
Textual and Link Information. In Proceedings of the 12th annual ACM international conference on Multimedia, New York, NY, USA, 2004, 952-959. [2] Deselaers, T., Keysers D., and Ney, H. Clustering Visually Similar Images to Improve Image Search Engines. In Informatiktage 2003 der Gesellschaft für Informatik, Bad Schussenried, Germany, November 2003.
[3] Ertöz L., Steibach M. And Kumar V. (2001). Finding topics in Collections of Documents: A Shared Nearest Neighbor
Approach. Proc. Of Text Mine'01, Workshop on Text Mining, First SIAM International Conference on Data mining. [4] Frankel, C., Swain, M. J., Athistos, V. WebSeer: An Image Search Engine for the World Wide Web. Technical Report
TR-96-14, University of Chicago, IL, USA, 1996. [5] Jarvis R. A. And Patrick E. A. (1973). Clustering Using a Similarity Measure Based on Shared Nearest Neighbors.
IEEE Transactions on Computers, vol. C22, No. 11. [6] Joint, M., Moëllic, P. A., Hède P. and Adam, P. PIRIA: a general tool for indexing, search, and retrieval of multimedia
content. Journal of Image Processing: Algorithms and Systems III (Proceedings of the SPIE, Volume 5298), 2004, 116125. [7] Kherfi, M. L., Ziou, D., and Bernardi, A. What is Behind Image Retrieval from the World Wide Web? In Proceedings of the International Conference on Web-Based Communities, Lisbon, Portugal, March 2004. [8] Lienhart, R., Kuranov, A., Pisarevsky, V. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapide Object Detection. Microprocessor Research Lab Technical Report, May 2002. [9] Stehling, R. O., Nascimento, M. A., Falcão, A. X. A Compact and Efficient Image Retrieval Approach Based on Border/Interior Pixel Classification. In Proceedings of the eleventh international conference on Information and knowledge management. (McLean, Virginia, USA, 2002). ACM Press, New York, NY, 2002, 102-109. [10] Viola, J., Jones, M. Robust Real-time Object Detection. . L. Reasoning about naming systems. Second Int’l Workshop on Statistical and Computational Theories of Vision – Modeling, Learning, Computing and Sampling. Vancouver, Canada, 2001.