Scale-Invariant Visual Language Modeling for Object ...

Viewer
Transcript

286

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 2, FEBRUARY 2009

Scale-Invariant Visual Language Modeling for Object Categorization Lei Wu, Yang Hu, Student Member, IEEE, Mingjing Li, Senior Member, IEEE, Nenghai Yu, and Xian-Sheng Hua, Member, IEEE

Abstract—In recent years, “bag-of-words” models, which treat an image as a collection of unordered visual words, have been widely applied in the multimedia and computer vision fields. However, their ignorance of the spatial structure among visual words makes them indiscriminative for objects with similar word frequencies but different word spatial distributions. In this paper, we propose a visual language modeling method (VLM), which incorporates the spatial context of the local appearance features into the statistical language model. To represent the object categories, models with different orders of statistical dependencies have been exploited. In addition, the multilayer extension to the VLM makes it more resistant to scale variations of objects. The model is effective and applicable to large scale image categorization. We train scale invariant visual language models based on the images which are grouped by Flickr tags, and use these models for object categorization. Experimental results show they achieve better performance than single layer visual language models and “bag-of-words” models. They also achieve comparable performance with 2-D MHMM and SVM-based methods, while costing much less computational time. Index Terms—Computer vision, content-based image retrieval, image classification, visual language model.

I. INTRODUCTION ATEGORIZING images that contain different objects is an important open problem for multimedia retrieval and computer vision. While humans can readily recognize thousands of object categories, it remains difficult for computers to do so. The task is challenging because the appearance of objects belonging to the same category may vary due to changes in scale, illumination, viewpoint, occlusion and cluster. On the other hand, objects of different categories may resemble each other, which further complicates the problem. In recent years, the success of information retrieval techniques in text analysis has aroused much interest in applying them to computer vision tasks. To imitate the expression of

C

Manuscript received April 12, 2008; revised October 06, 2008. Current version published January 16, 2009. This work was supported in part by the National Natural Science Foundation of China (60672056) and by the Specialized Research Fund for Doctoral Program of China for Higher Education (20070358040). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Jiebo Lu. L. Wu, Y. Hu, M. Li, and N. Yu are with the MOE-Microsoft Key Laboratory of Multimedia Computing and Communication (MCC), Department of Electrical Engineering and Information Science, University of Science of Technology of China, Hefei 230026, China (e-mail: [email protected]; yanghu@ustc. edu; [email protected]; [email protected]). X. S. Hua is with Microsoft Research Asia, Beijing, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TMM.2008.2009692

text document, images are firstly represented by sets of local appearance or shape descriptors [1]–[4], which are extracted from images either densely, at random, or sparsely according to some local salience criterias [5]. Then an image can be viewed as a collection of visual words which are drawn from a vocabulary constructed by vector quantization of the local descriptors. This “bag-of-words” representation allows the leverage of text data mining techniques to analyze images. While some work applied discriminative training [6] (e.g., maximum entropy framework and SVM) to images represented by histograms of visual words, more interests have been devoted to use generative probabilistic models to recognize object and scene classes. In [7], Csurka et al. applied a Naïve Bayes classifier to visual categorization. They assumed that each visual word in an image is chosen independently from a multinomial distribution specific to the class of the images. Unlike Naïve Bayes, which assumes that each image is generated from only one topic, latent space analysis assumes that an image may contain multiple latent topics. Two typical models that inherit this assumption are the probabilistic Latent Semantic Analysis (pLSA) of Hoffmann [8], and the Latent Dirichlet Allocation (LDA) of Blei et al. [9]. Both methods have been exploited for visual categorization [10]–[15]. A desirable property of these generative models is that it is possible to achieve object localization or segmentation by investigating the topic posteriors of the visual words as shown in [10]. These “bag-of-words” models [16], [17], however, assume that the local appearance descriptors are independent with each other conditioned on the object class or the latent topic. Although this assumption greatly simplifies the computations, the ignorance of the spatial co-occurrence of local features may reduce the performance of the algorithms. Objects with different appearance but similar statistics of visual words tend to be confused. For example, a permutation of the patches of a motorbike image will generate the same word frequency as the original image. However, it should by no means be recognized as a motorbike. To tackle this problem, some related work proposed to consider the co-occurrence of local descriptors. In [18], Lazebnik et al. included the probability of co-occurrence of pairs of visual words in the feature function when building the maximum entropy classifier. Sivic et al. [15] augmented the visual vocabulary with “doublets” which are pairs of visual words co-occurring within a local spatial neighborhood. And inspired by the correlograms of (quantized) colors which have been used as global feature for indexing and classifying images, Savarese et al. [19] proposed to use correlograms of pixels to model

1520-9210/$25.00 © 2009 IEEE Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

WU et al.: SCALE-INVARIANT VLM FOR OBJECT CATEGORIZATION

the typical spatial correlations between visual words in object classes. The related work that captures the co-occurrence of multiple (more than two) neighboring descriptors including the hyper-features proposed by Agarwal and Triggs [20] and the visual phrases of Yuan et al. [21]. The hyper-features is a multilevel visual representation. The histogram of descriptors in a local region is regarded as the descriptor of the region and this new descriptor is then used to build local histogram characterizing a larger region. And visual phrases are spatial co-occurrent patterns obtained using frequent item set mining technique. These two works, although sharing similar intuition with ours, did not explicitly model the conditional relation among contextual descriptors, which is exactly what we do in this work. In [22], Li and Wang proposed to use two-dimensional (2-D) multiresolution hidden Markov models ( MHMM) to model images, which also described the spatial dependency between neighboring patches. However, the 2-D MHMMs is quite complex compared to our visual language model. Due to the existence of hidden layer, the iterative EM algorithm is required in the 2-D MHMM method, which greatly reduces the efficiency of their approach. On the other hand, some other methods [23]–[25] adopted the Markov random fields (MRF) [26] and conditional random fields (CRF) [27] to model the dependencies between image pixels. The target of these work is to realize image segmentation and region level image labeling, which is different from ours. They took region level labels as input and trained models that could classify pixels so as to distinguish different regions in an image. In our work, we only collect image level labels and the task is to categorize images that contains different objects. We propose a light weighted supervised context modeling method, which incorporates the contextual dependency in a more efficient statistical language model to represent the objects. The approach we proposed avoids the iterative EM algorithm while keeping similar performance as other state-of-the-art supervised methods for the image categorization task. This advantage is demonstrated by experiment. An extremely important contribution of this work is the multilayer extension of the visual language model. The uniformly distributed and equally sized patches, although simplify the modeling of conditional relation between nearby visual words, are sensitive to scale change of objects. Objects belonging to the same categories but with different scale may generate completely different set of visual words. Therefore, it is reasonable to introduce multiresolution patches to describe the image. Then the conditional distributions between patches with different sizes are exploited. In addition to address the scale variance of objects, this multilayer visual language model also captures the co-occurrence of pair of patches with different level of proximity. As a result, more information is taken by this multilayer model and significant performance improvement over single layer model has been achieved. The rest of the paper is organized as follows. In Section II, we introduce the single-layer visual language modeling method in details. The multilayer visual language model is presented in Section III. The experiments on object categorization and the results are discussed in Section IV. We conclude the paper in Section V.

287

II. SINGLE-LAYER VISUAL LANGUAGE MODELING Inspired by the successful application of SLM in text classification, we propose to build visual language models (VLM) for image classification. There are theoretical parity between nature language and visual language. Nature language consists of words, and visual language consists of visual words. In nature language, there is grammar, which restricts the words’ distribution and order. In an image, which is divided into patches, there are absolutely some constraints on how the patches are combined to form a meaningful object. Random combination of the patches cannot generate a meaningful image. Visual language modeling is based on the assumption that there are implicit visual grammars in a meaningful image. To build visual language models, we should make some preparations. Firstly, a text document consists of words, each of which has its own semantic meanings, while an image consists of pixels. A pixel alone does not make any sense at all. A group of pixels together contain some semantic meanings. Thus, a larger unit is required to represent the content of an image. Here, we adopt local patches, each of which covers a sub-window of an image. Secondly, considering the variance of pixels within a local patch, the number of different patches is huge even if the patch size is small. Therefore, local patches should be properly quantized into a limited number of visual words, typically via clustering or hash coding. Lastly, images are 2-D signals while text documents are 1-D in nature. We can assume that each visual word is conditionally dependent on all its neighbors. However, there are too many parameters that are needed to be estimated. For example, assume that the size of visual vocabulary is n and only 4-neighbors are considered, the number of conditional probabilities is . A large number of training images are required to estimate these parameters. To simplify the computation, we assume that an image is generated though a Markov process and its visual words are generated from top to bottom in the image, and from left to right in the same row. This assumption is crucial to simplify the computation, however, we can also assume the contrary, say, “generated from bottom to top, and from right to left”. This contrary assumption equals to image rotation or flipping. In this experiment, we have already normalized the images by rotation, flipping and scaling, thus the simple assumption from left to right is proper. A. Image Representation To model the visual language, we should first represent the image by a set of visual words. To make the spatial relations between patches simple, we extract local patches from uniform grid, i.e., we divide the image into a matrix of non-overlapping equally sized patches. For each patch, we use an 8-dimensional texture histogram to describe it. In order to make the texture histogram resistant to rotation variance, we first determine the average gradient direction D of a patch. Let be the pixel in th row and th column of a patch, we have

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

(1) (2) (3)

288

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 2, FEBRUARY 2009

Then the patch is rotated to make the average direction horizontal to the right. Starting from the average direction, we de. The angle between fine eight texture directions any two directions is set to be 45 . Then we calculate the gradient of each pixel and quantize its direction into the eight predefined directions. The texture histogram of kth patch is obtained by summarizing the gradient magnitudes in each direction (4) (5) (6)

is the total number of pixels in a patch. We use a hash coding scheme as [28] to construct the visual vocabulary from the local texture histograms. This hash code has been proved to be quite efficient and effective for large-scale duplicate image detection. The advantage of the proposed feature is low complexity. It saves time as well as storage. The comparison with SIFT feature is conducted and discussed in the experiment section. From the comparison, we find for each image patch, the storage in the proposed texture feature is eight dimensions, while SIFT representation needs 128 dimensions. The computational time of the textural histogram is 1/20 of that of SIFT. The performance of the proposed method also outperforms the SIFT feature. The explanation is that the SIFT feature contains much redundant information, which maybe not helpful in representing a simple image patch. The efficiency and effectiveness of the texture histogram and SIFT are also discussed in the experiment section. B. Language Model Training According to the previously mentioned assumptions, the training process will model the following conditional probability (7) The calculation of this conditional probability is still not efficient enough. Inspired by 2-D HMM [29] used in face recognition, we assume that each patch depends only on its immediate vertical and horizontal neighbors. Although there may be some statistical dependency on visual words with larger interval, the description of this dependency will make the model too complex to implement. In this paper, we will ignore this kind of dependency, just as the language model does for text classification. According to how much dependency information is considered in the model, we propose three kinds of visual language models, i.e., unigram, bigram, and trigram. In the unigram model, the visual words in an image are regarded independent with each other. In the bigram model, the dependency between two neighboring words is considered. In the trigram model, a word is assumed to depend on both the word on the left and the word above it.

These three models are expressed in (8)–(10), respectively, as (8) (9) (10) represents the visual word at row , column in the where word matrix. In the following, we will discuss in details about the training process for the three models. 1) Unigram Model: For each category, a unigram model characterizes the distribution of individual visual words under the category. In the training process, we calculate using

(11) is the frequency of the word in category where . To avoid zero probability which would cause the classifier to fail, we assign a small prior probability to each unseen word in the category. Accordingly, the amount of this prior probability should be discounted from the probabilities of the words appearing in the category, so that the sum of probabilities is 1. The smoothed words distribution is represented by (12) as

(12)

where is the total number of words in the training set and is the number of words that do not appear in class . This probabilistic model tells how likely a word appears in an image belonging to that category. 2) Bigram Model: Unlike the unigram model, a bigram model assumes that each visual word is conditionally dependent on its left neighbor. So the training process is to learn the conditional probability by (13) as (13) where is the left neighbor of in the visual words matrix. However, the bigrams are usually quite sparse in the probability space. The maximum likelihood estimation is usually biased higher for observed samples and biased lower for unobserved samples. Thus, a smoothing technique is needed to provide better estimation of the infrequent or unseen bigrams. Instead of just assigning a small constant prior probability, we adopt a more accurate smoothing method [30], which combines back-off and discounting [31], as shown in (14)–(16), at the bottom of the next page. A back-off method is represented in (14) and (15), and discounting is represented in (16). If a bigram does not appear in the category, the back-off method is applied to calculate the bigram model from the unigram model. is called back-off factor. If bigram

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

WU et al.: SCALE-INVARIANT VLM FOR OBJECT CATEGORIZATION

289

appears in category , the discounting method is used to deis called press the estimation of its conditional probability. the discounting coefficient

(17) (18) where is the number of times a bigram appears and is the number of visual words that appear times in the category. 3) Trigram Model: The above two modeling processes are almost the same as the statistical language models used in text classification, However, the trigram model for visual language is different from that for natural language processing. In text , analysis, trigram is a sequence of three words while in visual document, which is a two dimensional matrix, we assume each word is conditionally dependent on its left neighbor and the word above it. So these three words form a . The training process of a trigram trigram model is illustrated in the following equation:

(19) For the same reason as the bigram model, the calculation of the back-off method is introduced in (20)–(21), and the discounting method for the trigram method is represented in (22)—both are shown at the bottom of the page. in (20), also shown at the bottom of the page, represents the trigram . The spatial correlation between

visual words is captured by the conditional probabilities of trigrams. In summary, the training procedure is as follows: 1) divide each training image into patches; 2) generate a hash code for each patch to form a visual document; 3) build visual language models for each category by calculating the conditional distribution of unigram, bigram and trigram. It is worth noting that not all visual words are useful for classification. Therefore, we introduce a feature selection process during visual language training. Words are selected according to their term frequency (TF) and inverse document frequency (IDF) (23) (24) measures the frequency a word appears in imwhere ages belonging to category and reflects the frequency in the dataset. The tf-idf weight evaluates of word how important the word is to category . For each category, we select the words with tf-idf weight bigger than a threshold. And the words from different categories are combined. This approach can depress the influence of random background and reduce the size of vocabulary. C. Image Classification In the classification phase, each test image is first transformed into a matrix of visual words in the same way as the training

(14) (15) (16)

(20)

(21)

(22)

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

290

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 2, FEBRUARY 2009

process. The image is assigned to the most probable category by maximizing the posterior probability (25) represents a novel visual document. The category with the is maximum probability over all categories chosen. For the unigram model, visual words in the document are assumed independent to each other. Thus the classification process can be transformed into the form of (26)

Fig. 1. (a) Scaling problem and (b) multilayer image representation.

(26) For bigram model, words are dependent to its left neighbor. So the classification is formulated as the following maximizing process:

(27) Accordingly, a trigram model-based classifier is illustrated by the subsequent equation

the visual words in the same way as the monolayer language model training process. For multilayer visual language modeling method, three assumptions are made corresponding to the multilayer unigram model (M-unigram), multilayer bigram model (M-bigram), and multilayer trigram model (M-trigram), respectively. Assumption 1: In the m-unigram model, visual words on different layers are independent with each other. Assumption 2: In the m-bigram model, each visual word only depends on its left neighbor in the same layer. Assumption 3: In the M-trigram model, each visual word depends on its left neighbor and the word above it in the same layer. The training processes of these three models are formulated as the following three equations correspondingly. For the multilayer unigram model

(28) (29) III. MULTILAYER VISUAL LANGUAGE MODELING For the multilayer bigram model A. Scaling Problem One problem with the previous visual language model is its sensitivity to the scale variations of the object. (One of the biggest may imply that there are other big problems.) The same object or scene with different scales may generate completely different visual matrixes. To make the model more resistant to scale variation of objects, we introduce multilayer extension to the visual language model, denoted as MVLM. Instead of extracting image patches of a single size, we extract different sizes of patches from an image, which could catch object characters in different scales. The visual language model built on these patches models the spatial co-occurrence relationship between them. The basic idea of scale invariant modeling method is to train the language model based on various scales of patches; so that given any image, the words conditional distribution of the object region can be best matched. For a multilayer patch representation, the patches on the same layer are of the same size, while those on a higher layer are twice of the size. For example, we use 4 4 for the first layer, 8 8 for the next, and so on. The first layer is called the base layer. Other layers are called extended layers as shown in Fig. 1. All these patches are transformed into

(30) (31) For the multilayer trigram model

(32) (33) represent any three words in the vocabulary, is the th layer. and are the same definition as in monolayer VLM. m is the number of layers. For each document in Category C, we divide it into m layers, and count n-grams on all layers. The parameter is determined experimentally.

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

WU et al.: SCALE-INVARIANT VLM FOR OBJECT CATEGORIZATION

TABLE I CATEGORIES AND DATA DISTRIBUTION OF CALTECH DATASET

The multilayer visual language modeling method has brought many favorable properties to VLM. Since the patches are of various sizes, the object scaling is no longer a problem with VLM. Moreover, MVLM does not increase the computational time in classification phase. Since the additional computational cost is introduced in the training process, the classification process takes the same time as the monolayer VLM. During training, the models are built under various scales. During classification, we use the scheme to compare base layer of test image with all layer of models to find a best match. So the classification process is exactly the same as the monolayer VLM classifier. IV. EXPERIMENTS We conduct six experiments to evaluate the performance of the multilayer visual language model. In the first experiment, we compare the performance of the multilayer visual language model with some other state-of-the-art methods, including both unsupervised methods such as pLSA [17] and LDA [9], and supervised methods like the 2-D MHMM [22] and SVM-based method. In the second experiment, we compare the multilayer and monolayer visual language models. In the third, we analyze the influence of the parameter settings on the performance of the model. The fourth experiment compares the effectiveness of texture histogram with other types of local descriptors. In the fifth, we exploit the robustness of the model. Finally, we apply the multilayer visual language modeling (MVLM) method to classify the real world photos from the popular photo sharing website Flickr. A. Experiment A: Comparison With Other Methods In order to compare the classification results, we use the same dataset with [17]. The detailed information about the dataset is shown in Table I. In our classification experiments, we compare the performance of the multilayer trigram model (M-trigram) with another four state-of-the-art image classification methods, pLSA, LDA, 2-D MHMM, and SVM-based approaches. For all

291

the approaches, local texture histogram is used as the descriptor for each patch, and the shared codebook generated by hash coding method. The size of the codebook is 256 for all methods, so that we can use a compact 8-bit code to index visual words. For the proposed method, the class of the image is calculated by maximizing the posterior probability. We fix the number of patches in each layer. On the base layer, the image is partitioned into 160 160 equal-sized patches. On the first extended layer, there are 80 80 equal-sized patches, and so on. The number of layers is 3. For pLSA and LDA approaches, the number of latent topics in the experiment is equal to the number of image categories in the dataset. The images are classified by the latent topics in pLSA or LDA model, which is the same setting as the Bag-of-words approach. The topics in LDA model are drawn from the Dirichlet distribution with the same prior. For the 2-D MHMM method, we also adopt the multiple resolutions [22]. We implement the 2-D MHMM based on the HMM toolbox by Murphy [32]. As the same setting in [23], we set the number of resolutions to 3, the number of states at the lowest resolution is 3; and those at the two higher resolutions are 4. For fair comparison, each block is represented by the same texture feature as that used in the proposed method. We have tried our best to make a fair comparison between different models. For the SVM-based method, we use the LibSVM to train a classifier based on the histograms of visual words in an image. Five-folds cross validation is adopted in the evaluation process. For each run, 80% of the images are used for training and 20% for test. The average performance over the five runs are calculated as the average accuracy. The accuracy and classification time are compared in Fig. 2. M-trigram represents the multilayer trigram model. Comparing with pLSA method, M-trigram gains 40% in accuracy while uses only 1/10 time. Comparing with LDA method, M-trigram outperforms by around 30%, while consumes only 1/60 of the time. It is worth noting that the time for pLSA and LDA are quite different since they are using totally different solving processes. For pLSA, the EM algorithm is adopted. For LDA, the Gibbs sampling is adopted. For Gibbs sampling, the computational cost is large because there should be sufficient number of rounds to get the reliable results. Comparing with the state-of-the-art supervised image classifiers, the proposed method achieves comparative accuracy while uses only 1/10 of the time used by 2-D MHMM method, and 1/20 time of SVM-based method. Therefore, the incorporation of spatial context among local features by this light-weighted model is necessary and rewarding. B. Experiment B: Model Comparison In this experiment, we intend to compare the multi-layer visual language modeling method with monolayer language modeling methods. Here, 3-layer trigram model is compared with single-layer unigram and trigram models. The number of patches in the base layer of the 3-layer trigram model is 160 160. For the mono-layer models, we fix the patch number to be 80 80. F1 measure is used as the performance measurement. Other experimental setting is the same with Experiment A. According to Fig. 2, “1G”,“2G”,“3G”, and “M3G” represent Unigram, Bigram, Trigram, and M-Trigram respectively. The M-Trigram model outperforms single layer Trigram model by

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

292

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 2, FEBRUARY 2009

Fig. 2. Classification comparison with pLSA and LDA, and between different VLMs (a) Accuracy, (b) Computational cost, (c) Different VLM.

Fig. 3. Parameter sensitivity. (a) Influence of patch number. (b) Influence of layer number.

around 14% in F1 measure which demonstrates the advantage of introducing multilayer extension to the visual language model. C. Experiment C: Parameter Property In this experiment, we try to test the influence of the number of patches and the number of layers on the performance of multilayer trigram model. In Fig. 3(a), we fix the number of layers to be 3 and change the number of patches in the base layer from 20 (20 20) to 160 (160 160). We can see that the classification accuracy increases with the increase of the number of patches. However, there isn’t much gain when the number of patches in the base layer is larger than 160 (160 160). In Fig. 3(b), we fix the number of patches in the base layer to be 160 (160 160) and vary the number of layers from 1 to 4. According to this figure, there is a sharp increase of classification accuracy when we add an additional layer to a single-layer model. D. Experiment D: Feature Comparison In this experiment, we compare the local texture histogram feature with other commonly used local features. The images are divided into the same pattern of patches, which are represented in four kinds of features respectively, RGB pixel color, HSI pixel color, SIFT feature, and local texture histogram. The performance comparison between these four kinds of features are evaluated by the accuracy of the classification, which is shown in Fig. 4(b). From the comparison, we find both SIFT and texture histogram features are relatively robust over different concept categories. While texture histogram feature obviously outperforms SIFT on categories “airplane”, “car”, “motorbike”, and “face”. The explanation is that the SIFT feature contains much

Fig. 4. (a) Influence of training noise and (b) influence of local feature types. In (a), the horizontal axis is the proportion of the error labels (noise) in the training set and the vertical axis is the classification accuracy.

redundant information, which not only makes a complex representation of a simple image patch, but also even disturbs the consistent description of local patches. So for image local patches, which are less informative than the image regions, simple local texture histogram feature is more proper for representation. Besides, the storage in the proposed texture feature for each patch is eight dimensions, while SIFT representation needs 128 dimensions. We also find that the simple local texture histogram feature can well capture the semantic objects even without the complex interest region detection process. Fig. 6 shows that most of the local features, which frequently appear in the corresponding concept, are located in the region of semantic objects in photos. This result also explains that seemingly naive visual language modeling methods can truly capture the semantics of the objects although there is no object segmentation or foreground detection process. E. Experiment E: Robustness In this experiment, we show that the multilayer visual language model is resistant to the noise in training data. The training data is usually vulnerable to noise such as the error of manual label or when the data is automatically collected from the web. Therefore, it is important for a model to be insensitive to noisy samples in the training set. Fig. 4(a) shows the performances of the multilayer trigram model when different amount of label error are introduced. We can see that even when 30% of the labels are error in the training set, the model still performs almost as well as on the noise free data. This is because the classification of an image mainly depends on the visual words with high frequencies. The frequencies of the words from noisy samples are relatively low in a category, which therefore only introduces limited influence on the model performance. F. Experiment F: Classification on Real World Data To evaluate the performance of the model on real world images, we collect 14405 photos from Flickr. The photos are classified into 20 categories, as shown in Table I. For images with multiple objects, we classify them according to the most salient objects. For example, Fig. 5(a) is classified into the accordion category rather than face and Fig. 5(b) is classified into building rather than car. M-layer trigram with 160 160 patches in the base layer is evaluated. We report the result of five

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

WU et al.: SCALE-INVARIANT VLM FOR OBJECT CATEGORIZATION

293

Fig. 5. Flickr photo classification results. Category “eagle” and “ketch” are left empty, as the number of training samples for these two categories is too small to train the language model. (a) Precision. (b) Recall.

regions of semantic objects in the image. The experiments demonstrate that by adopting the spatial context the multilayer visual language model outperformed the single layer model. It is also superior in computational time than other state-of-the-art supervised methods like 2-D MHMM and SVM. ACKNOWLEDGMENT This work was performed when L. Wu and Y. Hu were interns in Microsoft Research Asia. REFERENCES Fig. 6. Illustration of the high frequent visual words given each object category. The small rectangles in these figures show the local patches which are represented by frequent visual words of the corresponding object. Different colors represent patches from different layers. From this illustration, we find that most of the high frequent local features are located in the regions of semantic objects. This also explains the reason that the simple grid kind of local feature, which avoided the complex interest region detection process, can still effectively represent the semantic objects.

fold cross validation on this dataset in Fig. 5. Both precision and recall of the classification are shown. V. CONCLUSION In this paper, we propose the scale invariant visual language model. This model incorporates the spatial context into statistical language model to represent the objects. The supervised information of the image level labeling is adopted to learn the object categories. It regards images as well-arranged visual words, which depend on their contextual neighbors, and represent the visual semantic of the object by the spatial arrangement of the low-level visual features. From this aspect, this model is superior to the previous “bag-of-words” models, which ignore the spatial contextual information in the modeling process. The higher order statistics of visual words are proved more discriminative for distinguishing different object categories than the simple visual word distribution. To tackle the scale variance of objects, the multilayer extension to the visual language model is adopted to calculate the conditional probabilities of visual words with multiple scale image patches. The study of the frequencies of the visual words in the sample images shows that most of the effective words captures the

[1] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. Int. Conf. Computer Vision (ICCV’99), 1999, pp. 1150–1157. [2] R. Marée, P. Geurts, J. Piater, and L. Wehenkel, “Random subwindows for robust image classification,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR’05), Jun. 2005, vol. 1, pp. 34–40. [3] A. Opelt, A. Pinz, and A. Zisserman, “A boundary-fragment-model for object detection,” in Proc. European Conf. Computer Vision, (ECCV’06), 2006. [4] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-sensitive integrated matching for picture libraries,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 9, pp. 947–963, Sep. 2001. [5] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,” in Proc. European Conf. Computer Vision (ECCV’02), 2002, pp. 128–142. [6] J. Bi, Y. Chen, and J. Wang, “A sparse support vector machine approach to region-based image categorization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’05), 2005, pp. 1121–1128. [7] C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka, “Visual categorization with bags of keypoints,” in Proc. European Conf. Computer Vision (ECCV’04) Int. Workshop on Statistical Learning in Computer Vision, 2004. [8] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. 22nd Annu. ACM Conf. Research and Development in Information Retrieval, Aug. 1999, pp. 50–57. [9] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” J, Mach. Learn. Res., vol. 3, no. 5, pp. 993–1022, 2003. [10] L. Cao and L. Fei-Fei, “Spatially coherent latent topic model for concurrent object segmentation and classification,” in Proc. IEEE Int. Conf. Computer Vision (ICCV’07), 2007. [11] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’05), 2004, vol. 12, p. 178. [12] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object categories from google’s image search,” in Proc. 10th IEEE Int. Conf. Computer Vision, (ICCV’05)., 2005, vol. 2, pp. 1816–1823. [13] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. V. Gool, “Modeling scenes with local descriptors and latent aspects,” in Proc. 10th IEEE Int. Conf. Computer Vision (ICCV’05), 2005, vol. 1, pp. 883–890.

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

294

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 2, FEBRUARY 2009

[14] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman, “Using multiple segmentations to discover objects and their extent in image collections,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’06), 2006, pp. 1605–1614. [15] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, “Discovering objects and their localization in images,” in Proc. 10th IEEE Int. Conf. Computer Vision (ICCV’05), 2005, vol. 1, pp. 370–377. [16] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’05), 2005, vol. 2, pp. 524–531. [17] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, “Discovering object categories in image collections,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV’05), 2005. [18] S. Lazebnik, C. Schmid, and J. Ponce, “A maximum entropy framework for part-based texture and object recognition,” in Proc. 10th IEEE Int. Conf. Computer Vision (ICCV’05), 2005, vol. 1, pp. 832–838. [19] S. Savarese, J. Winn, and A. Criminisi, “Discriminative object class models of appearance and shape by correlatons,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’06), 2006, pp. 2033–2040. [20] A. Agarwal and B. Triggs, “Hyperfeatures-multilevel local coding for visual recognition,” in Proc. IEEE Conf. European Conf. Computer Vision (ECCV’06), 2006. [21] J. Yuan, Y. Wu, and M. Yang, “Discovery of collocation patterns: From visual words to visual phrases,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR’07), 2007. [22] J. Li and J. Z. Wang, “Automatic linguistic indexing of pictures by a statistical modeling approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1075–1088, Sep. 2003. [23] F. Cohen, Z. Fan, and M. Patel, “Classification of rotated and scaled textured images using gaussian markov random field models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 2, pp. 192–202, Feb. 1991. [24] D. Larlus and F. Jurie, “Combining appearance models and markov random fields for category level object segmentation,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR’08), 2008. [25] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, Multiscale Conditional Random Fields for Image Labeling 2004. [26] S. Z. Li, Markov Random Field Modeling in Image Analysis.. New York: Springer-Verlag, 2001. [27] J. D. Lafferty, A. Mccallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. Eighteenth Int. Conf. on Machine Learning (ICML 01).. San Francisco, CA: Morgan Kaufmann, 2001, pp. 282–289. [28] B. Wang, Z. W. Li, M. J. Li, and W. Y. Ma, “Duplicate detection for web image search,” in Proc. IEEE Int. Conf. Multimedia & Expo (1CME’06), 2006. [29] H. Otluman and T. Aboulnasr, “Low complexity 2-D hidden Markov model for face recognition,” in Proc. IEEE Conf. Int. Symp. Computer Architecture, 2000. [30] P. Clarkson and R. Rosenfeld, “Statistical language modeling using the CMU-cambridge toolkit,” in Proc. Eurospeech’97, Rhodes, Greece, 1997, pp. 2707–2710. [31] S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Trans. Acoust., Speech, Signal Process., vol. 35, no. 3, pp. 400–401, 1997. [32] K. Murphy, Hidden Markov Model Toolbox for Matlab [Online]. Available: http://www.cs.ubc.ca/murphyk/Software/HMM/hmm.html 2005

Lei Wu received the Bachelor’s degree in Special Class for the Gifted Young (SCGY) in 2005 from the University of Science and Technology of China, Hefei, where he is currently pursuing the Ph.D. degree in electronic engineering and information science. From 2006 to 2008, he was a Research Intern at Microsoft Research Asia, Beijing, China. His research interests include machine learning, multimedia retrieval, and computer vision. He received a Microsoft Fellowship in 2007.

Yang Hu (S’08) received the Bachelor’s degree from the University of Science and Technology of China, Hefei, in 2004. She is currently pursuing the Ph.D. degree in the Electronic Engineering and Information Science Department of the University of Science and Technology of China. Since August 2005, she has been a Research Intern at Microsoft Research Asia, Beijing, China. Her current research interests are in multimedia information retrieval, computer vision, and machine learning.

Mingjing Li (SM’08) received the B.S. degree in electrical engineering from the University of Science and Technology of China, Hefei, in 1989, and the Ph.D. degree in pattern recognition from the Institute of Automation, Chinese Academy of Sciences, in 1995. His research interests include pattern recognition and multimedia search.

Nenghai Yu is currently a Professor in the Department of Electronic Engineering and Information Science of University of Science and Technology of China (USTC). He is the Executive Director of MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, and the Director of Information Processing Center at USTC. He graduated from Tsinghua University, Beijing, China, and obtained his M.Sc. Degree in electronic engineering in 1992, and then he joined in USTC and worked there until now. He received his Ph.D. Degree in Information and Communications Engineering from USTC, Hefei, China, in 2004. His research interests are in the field of multimedia information retrieval, digital media analysis and representation, media authentication, video surveillance and communications etc. He has been responsible for many national research projects. Based on his contribution, Professor Yu and his research group won the Excellent Person Award and Excellent Collectivity Award simultaneously from the National Hi-tech Development Project of China in 2004. He has contributed more than 80 papers to journals and international conferences.

Xian-Sheng Hua (M’05) received the B.S. and Ph.D. degrees from Peking University, Beijing, China, in 1996 and 2001, respectively, both in applied mathematics. When he was with Peking University, his major research interests were in the areas of image processing and multimedia watermarking. Since 2001, he has been with Microsoft Research Asia, Beijing, China, where he is currently a Lead Researcher with the Internet Media Group. He is also an Adjunct Professor at the University of Science and Technology of China, Heifei. His current interests are in the areas of video content analysis, multimedia search, management, authoring, sharing, and advertising. He has authored more than 130 publications in these areas and has more than 30 filed patents or pending applications. Dr. Hua is a member of the Association for Computing Machinery (ACM). He serves as an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and is on the Editorial Board Member of Multimedia Tools and Applications. He received the Best Paper Award and Best Demonstration Award from ACM Multimedia in 2007 and the 2008 MIT Technology Review TR35 Young Innovator Award.

Authorized licensed use limited to: Nanyang Technological University. Downloaded on June 24, 2009 at 03:27 from IEEE Xplore. Restrictions apply.

Scale-Invariant Visual Language Modeling for Object ...

Index TermsâComputer vision, content-based image retrieval, ... leverage of text data mining techniques to analyze images. While some work applied ...

Download PDF

1MB Sizes 3 Downloads 310 Views

Report

Scale-Invariant Visual Language Modeling for Object ...

Recommend Documents