VDictionary: Automatically Generate Visual Dictionary ...

Viewer
Transcript

VDictionary: Automatically Generate Visual Dictionary via Wikimedias Yanling Wu1,2 , Mei Wang2,3 Guangda Li2 , Zhiping Luo2 , Tat-Seng Chua2 , and Xumin Liu1 1

College of Information Engineering, Capital Normal University, Beijing, China, [email protected], 2 School of Computing, National University of Singapore, Singapore, {wuyanlin, wangmei, luozhipi, chuats}@comp.nus.edu.sg, 3 School of Computer Science and Technique, Donghua University, China

Abstract. This paper presents a novel system to automatically generate visual explanation by exploiting the visual information in Wikimedia Commons and the automatic image labeling techniques. Sample images and the sub object based training data are obtained from Wikimedia Commons. Then propose an image labeling algorithm to extract salient semantic sub object. Each sub object is assigned to a semantic label. In this way, different semantic-level visual references are provided in our system.

1

Introduction

Compared to text, visual information, such as image, contains richer information. Recently, many online visual dictionaries leverage the visual world [1] for their application purpose. Visual dictionary is designed to quickly answer questions related to real-world objects. Fig. 1 illustrates an example about object “car” from the famous website Visual Dictionary Online [1]. However, the textual definitions are always developed by professional experts in such websites, so that they need great time and labors. Therefore to impair its scalability and completeness. Region-based image annotation algorithms have been used to automatic visual dictionary construction. However, their performances may largely depend on that of image segmentation techniques, which are very fragile and erroneous process [2]. In addition, how to select salient sub objects for a given image to annotating is a complicated problem. Most image annotation algorithms heavily rely on the training data [6]. It is difficult to obtain a region-based annotation training set. Nowadays, some communities websites, such as Wikimedia [3], are popular in daily life. Fig. 2 illustrates an example of “transport” hierarchy obtained from Wikimedia Commons. As can be seen, information from Wikimedia Commons has the following characteristics: 1) Textual concepts and images are well organized by professional users; 2) Images in a given object category is relatively

II

“pure”; and 3) Visual information of the sub objects, which can be used as the training data for automatic sub object annotation also is provided by Wikipedia. We present a novel system to automatically generate visual explanation. This system utilizes web source Wikimedia Commons and statistical image labeling techniques.

Fig. 1. The dictionary online.

2

Fig. 2. The transport hierarchy from Wikimedia.

The Highlight of the System

Fig. 3 shows the framework of the proposed system. First, we feed each given object in a dictionary into Wikimedia Commons, and then obtain sample images S in this category and the images in its subcategory as the sub object annotation training data T . We employ SIFT detector and descriptor [4] to determine salient local point and compute the description respectively. It can be observed that dense salient feature points often appear at the location where interest object parts exist. We then exploit mean shift clustering (MSC) [5] to automatically generate sub objects for the image in S, and then exploit k-NN classifier to assign label for each sub object based on T . Automatic Sub Object Generation: Since the location of the meaningful sub object is usually at the place where the feature points clusters into a dense area. MSC is always used to find dense regions in the data space. Therefore, to obtain meaningful image representation, we employ MSC to find the semantic patches where local salient regions are densely distributed based on their geometric positions. After clustering, we can obtain a set of cluster centers. By Defining two thresholds α1 and α2 . α1 is used to refine clusters, α2 is used to filter the isolate local feature points. Taking the cluster center as the center, the scale of the farthest points as the radius, we can obtain the image patches, where interest sub objects always exist. Automatic Semantic Region Generation: We assign the sub object label for each patch as the sub objects have been obtained. For a given training set T , we use k-NN classifier to assign the sub object label for each sub object patch.

III

Fig. 4. System interface. Fig. 3. System Framework.

Demonstration Interface: Fig. 4 shows the interface of the system. It consists of two panels: control panel and image panel. In the control panel, the image selected button ”Browse” allows the user to select an example image or a given object to be explained. To click the “Show Image” button, the example image or the selected representative image for the object will be presented in the image panel. By clicking “Sub Object Generation” the visual panel will present the image on which the semantic parts are marked. While clicking “Show Annotation” the textual annotation of each sub object will be labeled. In the image panel, interactive zooming is supported to show the detailed information of the image.

3

Conclusion

This paper presented an automatic visual dictionary system. It can help users find visual answers automatically.

References 1. Visual Dictionary Online, http://visual.merriam-webster.com/ 2. Gao, Y., Fan, J., Luo, H., Xue, X., Jain, R.: Automatic Image Annotation by Incorporating Feature Hierarchy and Boosting to Scale up SVM Classifiers. ACM Multimedia (2006) 3. Wikimedia, http://commons.wikimedia.org/wiki/Main Page 4. Lowe, D.G.: Distinctive image features fromscale-invariant keypoints. International Journal of Computer Vision. 60, pages 91-110 (2004) 5. Wang, P., Lee, D., Gray, A., and J. Rehg.: Fast mean shift with accurate and stable convergence. International Conference on Artificial Intelligence and Statistics. 604– 611 (2007) 6. Wang, M., Hua, X.S., Tang, J., Hong, R.: Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation. IEEE Transactions on Multimedia. 11 (3), 465–476 (2009)

Visual tag dictionary: interpreting tags with visual words