Beyond Distance Measurement: Constructing ...

Viewer
Transcript

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

465

Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation Meng Wang, Xian-Sheng Hua, Member, IEEE, Jinhui Tang, Member, IEEE, and Richang Hong

Abstract—In the past few years, video annotation has benefited a lot from the progress of machine learning techniques. Recently, graph-based semi-supervised learning has gained much attention in this domain. However, as a crucial factor of these algorithms, the estimation of pairwise similarity has not been sufficiently studied. Generally, the similarity of two samples is estimated based on the Euclidean distance between them. But we will show that the similarity between two samples is not merely related to their distance but also related to the distribution of surrounding samples and labels. It is shown that the traditional distance-based similarity measure may lead to high classification error rates even on several simple datasets. To address this issue, we propose a novel neighborhood similarity measure, which explores the local sample and label distributions. We show that the neighborhood similarity between two samples simultaneously takes into account three characteristics: 1) their distance; 2) the distribution difference of the surrounding samples; and 3) the distribution difference of surrounding labels. Extensive experiments have demonstrated the superiority of neighborhood similarity over the existing distance-based similarity. Index Terms—Neighborhood learning, video annotation.

similarity,

semi-supervised

I. INTRODUCTION

W

ITH rapid advances in storage devices, networks and compression techniques, large-scale video data become available to ordinary users. Hence content-based video retrieval becomes an increasingly active field. It is well known that a central problem in this field is the semantic gap between low-level features and high-level queries. Recent studies reveal that video semantic annotation is able to deal with this issue [13], [14]. By annotating a large set of concepts on the video dataset, the semantic gap can be bridged by mapping the high-level queries to these concepts. A typical approach to accomplishing automatic video semantic annotation is to apply machine learning methods. For

Manuscript received May 08, 2008; revised October 22, 2008. First published February 18, 2009; current version published March 18, 2009. A four-page shorter and simpler version of this paper is published by ACM Multimedia 2007. Compared with the preliminary version [31], we have enhancements in three aspects: 1) we provide more introductions to the existing related works; 2) we conduct more empirical evaluations; and 3) more discussions and analyses are provided. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Alan Hanjalic. M. Wang and X.-S. Hua are with Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]; [email protected]). J. Tang is with the National University of Singapore (e-mail: tangjh@comp. nus.edu.sg). R. Hong is with the University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2009.2012919

each concept, its annotation can be formulated as a binary classification task. However, typically learning-based methods need a large labeled training set in order to conquer the gap between low-level features and the to-be-annotated semantic concepts. As human annotation is labor-intensive and time-consuming (experiments prove that typically annotating 1 h of video with 100 concepts can take anywhere between 8 and 15 h [18]), several methods that can help reduce human effort have been proposed, such as semi-supervised learning, active learning and mining from the web. Semi-supervised learning is the most widely-applied approach to addressing this issue [8], [40]. By leveraging unlabeled data, semi-supervised learning is expected to build more accurate models than those that can be achieved by purely supervised methods. Recently, graph-based semi-supervised learning [5], [23], [41] has attracted great interests for its effectiveness and efficiency. These methods define a graph where the vertices are labeled and unlabeled samples and the edges reflect the similarities between the pairs. A labeling function is then estimated on the graph. The label smoothness over the graph is characterized in a regularization framework, which has a regularizer term and a loss function term. However, despite many different works dedicated to graph-based methods, the graph construction has not been sufficiently studied. For example, the similarity between two samples is often simply estimated based on their Euclidean distance and a smoothing factor, i.e.,

(1) However, we argue that similarities are not merely related to distances. We consider the following two intuitive examples: 1) Examples 1: Fig. 1(a) illustrates three samples , , and , and the Euclidean distance between and is equal and . Consequently, the distance-based to that between and is equal to that of and , i.e., similarity of . However, after observing the distribution of the neighbors around , , and in Fig. 1(b), we find that intuitively it is more rational to let since the sample distributions around and are much more similar than those around and (note that the local distribution around is much tighter). In Section IV we will conduct experiments on this synthetic dataset to prove the rationality of this intuition. We will show that if we apply distance-based similarity, graph-based methods will yield high classification error rates due to severe classification boundary shift.

1520-9210/$25.00 © 2009 IEEE

466

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

Fig. 1. Illustrate example shows that potentially similarity estimation can benefit from the exploitation of neighborhood sample distributions.

Fig. 2. Illustrate example shows that potentially similarity estimation can benefit from the exploitation of neighborhood label distributions.

2) Examples 2: We observe another illustrative example. In Fig. 2, the distance between the samples and is also equal to that between and , and accordingly the traditional distance-based similarity of and is equal to that of and , . Then we observe the labels of the i.e., neighbors around these three samples. We can see that it is also , since and are more rational to let more likely to belong to the same class, which follows from the fact that the surrounding label distributions of and are more similar than those around and . From these two examples, we can see that the traditional distance-based similarity can be enhanced by leveraging the information embedded in the local distributions of samples and labels. In this paper, we propose a similarity measure named neighborhood similarity which is able to explore this information. It consists of two components, namely, neighborhood sample similarity and neighborhood label similarity, which are developed based on local sample distribution and local label distribution, respectively. To explore sample distribution, we model the distribution of every sample’s neighborhood and then compute the pairwise Kullback–Leibler divergence of these distributions. To explore label distribution, we compute the difference of the label histograms around the corresponding samples. In this way, we will show that the neighborhood similarity between two samples simultaneously takes into account the following three characteristics: 1) their distance; 2) the distribution difference of the surrounding samples; and 3) the distribution difference of surrounding labels. The main contribution of this paper can be summarized as follows: 1) analyze the limitation of the traditional distance-based similarity;

2) propose the neighborhood sample similarity, which is able to take into account both the distance between samples and the difference of their surrounding local distributions. 3) propose the neighborhood label similarity, which considers the surrounding label distribution of each sample. The neighborhood similarity has been first introduced in our previous work [31]. However, the “neighborhood similarity” proposed in [31] is actually merely the neighborhood sample similarity in this work. Now we have further introduced neighborhood label similarity and integrated it with the neighborhood sample similarity, such that the local distributions of samples and labels can be simultaneously exploited. The organization of the rest of this paper is as follows. In Section II, we provide a review on the related works, including video annotation, graph-based semi-supervised learning and the similarity estimation issue. In Section III, we introduce the adopted graph-based semi-supervised methods. In Section IV, we detail the proposed neighborhood similarity. Experimental results are presented in Section V. Finally, we conclude the paper in Section VI. II. RELATED WORKS A. Video Annotation and Semi-Supervised Learning Automatic video annotation (also referred to as “video concept detection” [20], “video semantic analysis” [26], or “high-level feature extraction” [16]) aims to assign suitable concepts to video clips according to their content. Typically it can be accomplished by machine learning methods. A learning-based video annotation method works as follows. First, videos are segmented into short units such as shots and

WANG et al.: BEYOND DISTANCE MEASUREMENT: CONSTRUCTING NEIGHBORHOOD SIMILARITY FOR VIDEO ANNOTATION

sub-shots. Then, low-level features are extracted from each unit to describe its content. Video annotation is then formalized to learn a set of predefined concepts for each unit based on these low-level features. Since the to-be-annotated concepts may not be mutually exclusive (such as the concepts “street” and “outdoor”), a general scheme is to conduct a binary classification procedure for each concept. Given a concept, each unit is then annotated to be “positive” or “negative” according to whether it is associated with this concept. The National Institute of Standards and Technology (NIST) has established “high-level feature extraction” as a task in TREC Video Retrieval Evaluation (TRECVID) [1], [24], which aims to provide a benchmark for evaluating video annotation technologies. Many different learning methods have been used in this task [4], [20], [25]. For example, Amir et al. [4] have utilized a diverse set of learning methods for TRECVID 2005 high-level feature extraction task, including: support vector machine, Gaussian mixture models, maximum entropy methods, a modified nearest-neighbor classifier, and multiple instance learning. Naphade et al. [20] have presented a survey on the benchmark, where a great deal of different algorithms applied in this task can be found. As previously mentioned, training data insufficiency is an obstacle in video annotation, and this has turned the attention of many researchers to the topic of semi-supervised learning [8], [40]. However, many different semi-supervised learning algorithms proposed in machine learning community have not shown satisfying performance in video annotation. This can be attributed to the large variation and diversity of video data. Cohen et al. have analyzed that semi-supervised learning methods can only improve performance when the assumed prior model is correct [9], but it is difficult to accurately model video semantic concepts. In [27], Song et al. applied co-training to video annotation based on a careful split of visual features. In [33], Yan et al. pointed out the drawbacks of co-training in video annotation, and proposed an improved co-training style algorithm named semi-supervised cross-feature learning. Recently, graph-based semi-supervised learning methods, which is a nonparametric model and thus avoids the prior model assumption issue [29], have attracted great interests of the researchers in this community due to their effectiveness and efficiency shown in this application. In [15], He et al. adopted a graph-based method named manifold-ranking in image retrieval, and Yuan et al. [37] then applied the same algorithm to video annotation. Tang et al. proposed a graph-based method named kernel linear neighborhood propagation and demonstrated its effectiveness in video annotation [28]. Wang et al. developed a multigraph learning method, such that several difficulties in video annotation can be attacked in a unified scheme [30]. However, despite the extensive research on graph-based semi-supervised learning, the graph construction issue has not been sufficiently studied. Usually the graph edge weights are simply estimated based on the pairwise Euclidean distances. But in Section I we have shown that the distance-based similarity can be enhanced by leveraging the information embedded

467

in local sample and label distributions. The neighborhood similarity is thus proposed to address this issue, and extensive experiments will demonstrate its effectiveness. B. Similarity Estimation It is well known that similarity estimation plays a crucial role in many different machine learning algorithms, and extensive research has been dedicated to this topic for decades. In most cases, the similarity between two samples is estimated based on a distance measure, e.g., the widely-applied Minknowski distance [38], which measures the distance between two samples and as

(2) The similarity shown in (1) can be regarded as developed distance (i.e., ). Aggarwal et al. have based on the given a general study on Minkowski distance-based similarity and demonstrated that the optimal choice should depend on the dimension of feature space [3]. Sebe et al. [22] and Yu et al. [36] have clarified this problem in the maximum likelihood perspective, and show that the optimal distance measure depends on the data distribution. Distance metric learning, which is also an intensively studied topic in machine learning community, aims to construct a suitable distance metric based on training data [10], [32], [35]. But these methods are usually computationally intensive and prone to overfitting, especially when the training samples are limited and the dimensionality of feature space is high [35]. Several studies have also been conducted on similarity estimation for specific applications and features. For example, tangent distance has shown encouraging performance in handwritten character recognition [11] and earth mover’s distance (EMD) is superior for histogram features in image retrieval/annotation [21]. But these specific methods are developed based on certain domain knowledge, and contrarily the proposed neighborhood similarity is a general approach that can deal with generic learning tasks. III. GRAPH-BASED SEMI-SUPERVISED LEARNING In graph-based semi-supervised learning methods, a graph is defined where the vertices are labeled and unlabeled samples and the edges reflect the similarities between the pairs. These algorithms are all based on a label smoothness assumption, which requires the function to simultaneously satisfy the following two conditions: 1) it should be close to the truths on the labeled vertices, and 2) it should be smooth on the whole graph. Generally, these two conditions are presented in a regularization form. Here we consider the most two widely-applied graph-based methods, i.e., the Gaussian random fields (GRF) method [42] and the learning with local and global consistency (LLGC) method [39]. Consider a -class classification problem. There are labeled , samples

468

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

(3)

(4)

applied method is to estimate pairwise similarity according to Euclidean distance, e.g., (1). But as previously analyzed, similarities are not merely related to distances. This motivates our study on embedding more information into similarity measure such that the performance of graph-based methods can be improved. Fig. 3. Iterative solution for GRF.

IV. NEIGHBORHOOD SIMILARITY In this section, we first provide the definitions of neighborhood sample similarity and neighborhood label similarity respectively, and then we integrate these two similarities to obtain neighborhood similarity. A. Neighborhood Sample Similarity

Fig. 4. Iterative solution for LLGC.

and unlabeled samples . Let be the an affinity matotal number of samples. Denote by trix with indicates the similarity measure between and and is set to 0. Denote by a diagonal matrix with its -element equals to the sum of the th row of . Define an label matrix where is 1 if is a labeled sample and belongs to class , and 0 otherwise. Define an matrix , where is the confidence of with label . So the classification rule is assigning each sample a label . Then the GRF1 and the LLGC methods are formulated as (3) and (4) at the top of the page. The above two regularization frameworks both have two terms, which indicate the smoothness of the labels on the graph and the constraint of training data, respectively. These two methods have closed-form solutions, but more frequently they are solved in an iterative manner for efficiency. The iterative solution processes for GRF and LLGC are illustrated in Figs. 3 and 4, respectively. More detailed implementation issues of these two methods can be found in [37], [39], and [41]. Obviously the construction of similarity matrix is crucial to the performance of these algorithms. However, despite many different works dedicated to graph-based semi-supervised learning, this issue has not been sufficiently studied. A widely-

(1 )

1

(5) Of course, the real distribution of a certain sample’s neighborhood might be far from normal. However, the estimated covariance matrices of the normal distributions are sufficient to encode the structure information. We will see that this assumption gives rise to a much rational neighborhood sample similarity measure. the set of neighbors of . Then the covariance Denote by can be estimated by maximum likelihood as matrix

(6) Now we can compute the symmetric Kullback–Leibler (KL) divergence [17] between and as

(7)

=

GRF method proposed in [42] actually uses the constraint F Y for labeled samples i l , which equals to setting to in (3). Here we have relaxed this constraint [see (3)], and the studies in [6] have demonstrated that this modification will improve the learning performance. 1The

To exploit the difference between local sample distributions, we model the distribution of every sample’s neighborhood and then compute the pairwise divergence of these distributions. Here we use a normal distribution with mean vector to model the neighbors around , i.e.,

The neighborhood sample similarity between then defined based on this KL divergence.

and

is

WANG et al.: BEYOND DISTANCE MEASUREMENT: CONSTRUCTING NEIGHBORHOOD SIMILARITY FOR VIDEO ANNOTATION

Definition 1 (Neighborhood Sample Similarity): The neighis defined as borhood sample similarity between and

lowing discussion. Then the label histogram around sample is defined as

(8) matrices and Note that (7) involves the inversion of , and the computational cost scales as if we adopt full covariance matrix. Here we use diagonal covariance matrix to simplify the computation, i.e.,

469

(14) Definition 2 (Neighborhood Label Similarity): The neighis defined as borhood label similarity between and

(15) (9) where is the th entry of , and the th components of -dimensional vectors turns to tively. Then

and are and , respec-

is a smoothing factor. where Note that using solely neighborhood label similarity is meaningless, since it only utilizes the local label distribution and does not contain any information about the distance between the samples. But we can integrate it with neighborhood sample similarity, such that learning performance can be further improved. C. Neighborhood Similarity

(10) Obviously, the neighborhood sample similarity estimation can be decomposed into two items, i.e.,

The neighborhood similarity is defined as the integration of neighborhood sample similarity and neighborhood label similarity. Definition 3 (Neighborhood Similarity): The neighborhood is defined as similarity between and

(16) (11) where we see (12) and (13) at the bottom of the page. Obviously these two items are based on the weighted distance between the two samples [later we will show that this weighted distance-based similarity outperforms the traditional Euclidean distance-based similarity illustrated in (1)] and the difference between the neighborhood structures, respectively. B. Neighborhood Label Similarity To explore the label information for similarity estimation, we estimate the label histogram in the neighborhood of each sample. Suppose there are samples labeled as class in , . To simplify our analysis, we assume where that all unlabeled samples are labeled as “0” or “null” in the fol-

According to the previous analysis [see (11)], the neighborhood similarity can be decomposed into three items, i.e.,

(17) These three items indicate distance-based similarity, structure difference-based similarity and label distribution difference-based similarity, respectively. D. Discussion The neighborhood sample similarity and neighborhood label similarity can also be understood in the perspective of the

(12)

(13)

470

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

smoothness of probability densities and class probabilities, respectively. Note that all samples are i.i.d. generated from the probability density function, and thus by supposing that the values of the probability density in the local regions around two nearby samples are close, their neighborhood structures should be similar. Analogously, we can derive that the label distributions around nearby samples should be close with the smoothness assumption of class probabilities since the labels can be viewed as generated from the probability functions. As previously mentioned, the traditional graph-based semi-supervised learning methods are based on the assumption that the labels of nearby samples should be close. By applying neighborhood sample similarity, the assumption turns to that the labels of nearby samples with similar local structures are close, i.e., the assumption has been enhanced. That is why the neighborhood sample similarity is able to outperform the distance-based similarity. Now we observe the effect of the integration of neighborhood label similarity. and 1) Considering the smoothness of labels, if are close, then and are likely to becan be long to the same class, and thus enhanced by multiplying the neighborhood label sim. Otherwise, and are likely ilarity can to belong to different classes, and be reduced. Thus, integrating the neighborhood label similarity can enhance the discriminative abilities of the graphs. . Then for 2) Suppose an extreme case, namely, we have . This ineach , and consequently dicates that . Therefore, in this case the impact of neighborhood label similarity will be limited, and the proposed neighborhood similarity will degenerate to the neighborhood sample similarity. This implies that integrating the neighborhood label similarity will not degrade learning performance at least. It is also noteworthy that there is a chicken-and-egg dilemma in the neighborhood similarity estimation: on one hand, to construct neighborhood similarity, we need to identify the neighbors of each sample; on the other hand, to identify the neighbors, we need to know the appropriate similarity measure. In our work, we have first identified the neighbors around each sample using Euclidean distance and then construct neighborhood similarity measure based on these results. But we can also further refine the neighborhood of each sample based on the novel similarity measure and then deriving novel neighborhood similarity measure again, and this process can repeat. This process of course will increase computational cost proportionally.

V. EXPERIMENTS To evaluate the performance of the proposed neighborhood similarity in graph-based methods, we conduct experiments for

Fig. 5. “Left-right-circle” binary classification task with two training samples. (a) Labels of all samples. (b) Two training samples.

three different applications, including 1) toy problems; 2) handwritten digit recognition; and 3) video annotation. We compare the performance of the following six methods: 1) GRF with distance-based similarity [see (1)]; 2) GRF with neighborhood sample similarity; 3) GRF with neighborhood similarity; 4) LLGC with distance-based similarity; 5) LLGC with neighborhood sample similarity; 6) LLGC with neighborhood similarity. These six methods are denoted by GRF.DS, GRF.NSS, GRF.NS, LLGC.DS, LLGC.NSS, and LLGC.NS, respectively. The detailed implementation issues of GRF and LLGC can in GRF be found in [37], [39], and [41]. The parameter and LLGC are empirically set to 100 and 0.01, respectively. (neighborhood size) is simply set to 10 for The parameter all experiments if not specifically indicated, but in fact this parameter can also be tuned such that better performance can be obtained. A. Toy Problem We conduct binary classification experiments on two synthetic datasets. The first dataset, which is named “left-rightcircle”, is illustrated in Fig. 5. There are 130 two-dimensional samples that are uniformly distributed within two circles. For each class a training sample is labeled, as illustrated in Fig. 5(b). Another dataset, which is named “interior-exterior-circle” and contains 130 samples as well, is illustrated in Fig. 6. This dataset is more difficult for classification since many samples are close to the boundary. We randomly label three samples for each class, as illustrated in Fig. 6(b). The smoothing factors (including , , ) for these two classification tasks are all set to 0.1. The experimental results obtained by different methods for these two tasks are illustrated in Figs. 7 and 8, respectively. First, we observe the results for “left-right-circle”. From Fig. 7 it is clear that when applying the distance-based similarity, the classification boundaries yielded by GRF and LLGC are significantly biased towards the sparser class and thereby the performance of these two methods severely degrade. Classification accuracies obtained by these two methods are only 74.6% and 76.9%. On the contrary, by applying the neighborhood sample similarity, the problem is successfully alleviated and the

WANG et al.: BEYOND DISTANCE MEASUREMENT: CONSTRUCTING NEIGHBORHOOD SIMILARITY FOR VIDEO ANNOTATION

471

Fig. 6. “Interior-exterior-circle” binary classification task with six training samples. (a) Labels of all samples. (b) Six training samples.

Fig. 8. Performance comparison for six different methods on the “interior-exterior-circle” classification task.

Fig. 7. Performance comparison for six different methods on the “left-rightcircle” classification task.

sample similarity already achieves very high classification accuracy and this limits the space for improvement. Then we observe the results in Fig. 8. We can see that applying neighborhood sample similarity significantly outperforms the traditional distance-based similarity for “interior-exterior-circle” dataset as well. Furthermore, we can see that applying neighborhood similarity slightly outperforms merely using neighborhood sample similarity. This indicates that the neighborhood label similarity can help further improve performance even with merely six labeled samples. B. Digit Recognition

performance can be significantly improved. With merely two training samples, the GRF.NSS and LLGC.NSS attain a high classification accuracy of 98.5%, i.e., only two samples are misclassified. Applying the neighborhood similarity achieves the same results as neighborhood sample similarity. This means that neighborhood label similarity takes no effect in this task. This is attributed to two facts: 1) labeled samples are too limited (see the analysis in Section IV-D); and 2) the neighborhood

We conduct experiments on the “handwritten digit recognition” dataset from Cedar Buffalo [42]. The dataset contains 1100 samples for each digit from “0” to “9”. Each digit image is a 16 16 grid, namely represented by a 256-dimensional feature vector. Based on the dataset, we generate two classification tasks: 1) ten-way classification of all digits; and 2) even and odd digits classification. In the first task, there are ten classes and each class contains 1100 samples, and in the second task there

472

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

Fig. 9. Performance comparison of different methods for digit classification. (a) Comparison of SVM, GRF.DS, GRF.NSS1, GRF.NSS, and GRF.NS for ten-way digit classification. (b) Comparison of SVM, LLGC.DS, LLGC.NSS1, LLGC.NSS, and LLGC.NS for ten-way digit classification. (c) Comparison of SVM, GRF.DS, GRF.NSS1, GRF.NSS, and GRF.NS for even/odd digit classification. (d) Comparison of SVM, LLGC.DS, LLGC.NSS1, LLGC.NSS, and LLGC.NS for even/odd digit classification.

are two classes and 5500 samples for each class. We gradually increase the labeled data size , and perform 10 trials for each . In each trial we randomly select labeled samples and use the rest of the samples as testing data. Besides applying GRF and LLGC with different similarities, we also illustrate the results obtained by support vector machine (SVM) [7] for comparison. Since there is no reliable model selection approach when labeled samples are extremely few, we tune the following parameters to their optimal values: smoothing factors (including , , ) in all the graph-based semi-supervised learning for RBF kernel and the methods, the radius parameter trade-off between training error and margin in SVM model. Fig. 9 illustrates the performance comparison of the seven methods. From the figure we can see that the performances of GRF and LLGC have been significantly improved by applying the neighborhood similarity. Due to the small sizes of labeled set in these two tasks, the performance gaps between neighborhood sample similarity and neighborhood similarity are small in magnitude. In Section V-C, we will demonstrate more significant improvements from neighborhood sample similarity to neighborhood similarity in video annotation. To further investigate the neighborhood similarity, we also illustrate the results achieved by using solely the first item in neighborhood , in Fig. 9, which are indicated sample similarity, i.e., by GRF.NSS1 and LLGC.NSS1, respectively. Comparing (12) and (1), we can see that the only difference lies on the weights in (1) which are obtained from the local

structures. From the results we can see that the first item in neighborhood sample similarity can considerably outperform the traditional Euclidean distance-based similarity, and this indicates the effectiveness of the weights. We can also see that GRF.NSS and LLGC.NSS perform better than GRF.NSS1 and LLGC.NSS1, especially when labeled samples are limited. This indicates the positive impact of the second item in neighborhood sample similarity. We also study the stability of the neighborhood similarity with respect to the setting of neighborhood size . We increase from 5 to 20 and illustrate the performance of GRF.NS and LLGC.NS for the two classification tasks in Fig. 10. We also illustrate the performance of GRF.DS and LLGC.DS for comparison. We can see that the performance of GRF.NS and LLGC.NS will vary slightly with different . But using neighborhood similarity consistently outperforms distance-based similarity in all cases. This indicates the stability of the neighborhood similarity. C. Video Annotation To evaluate the performance of the proposed approach, we conduct experiments on the benchmark video corpus of TRECVID 2005. The dataset consists of 137 news videos recorded from 13 different programs in English, Arabic, and Chinese [1]. The videos are about 160 h in duration and they are segmented into 49 532 shots and 61 901 sub-shots. A key-frame

WANG et al.: BEYOND DISTANCE MEASUREMENT: CONSTRUCTING NEIGHBORHOOD SIMILARITY FOR VIDEO ANNOTATION

N

473

N

Fig. 10. Performance variation of neighborhood similarity-based methods with respect to neighborhood size . (a) Comparison of GRF.NS with different for ten-way digit classification. (b) Comparison of LLGC.NS with different for ten-way digit classification. (c) Comparison of GRF.NS with different for for even/odd digit classification. even/odd digit classification. (d) Comparison of LLGC.NS with different

N

N

is selected from each sub-shot, and from each key-frame we extract 225D block-wise color moment based on 5-by-5 division of the image and 75D edge distribution histogram. We annotate 39 concepts in the experiments, namely, the LSCOM-Lite concepts [19]. Several exemplary key-frames are illustrated in Fig. 11. Following the guideline in [34], we separate the dataset into four partitions, i.e., a “training set” with 90 videos, a “validation set” with 16 videos, a “fusion set” with 16 videos, and a “test set” with 15 videos. Details about the data partition can be found in [34]. As aforementioned, the annotation of each concept is considered as a binary classification problem. Thus for we obtain and by graph-based methods sample (see Section II), which are regarded as positive and negas ative scores. Here we define a relevance score of , where frequency is the percentage of positive samples in labeled set, i.e., . This setting follows from the fact that positive samples are usually less than negative ones in video concept learning, and

N

the distributions of negative samples are usually in a very broad domain. Thus positive samples should contribute more in this task. In fact, this setting is equivalent to duplicating copies for each positive training sample, so that they are balanced with negative ones. A noteworthy issue sparse in in our implementation is that we have made matrix GRF and LLGC methods by only keeping (we empirically to 20) largest values in each row. This is a frequently set used strategy in graph-based learning methods, which is able to significantly reduce the computational cost while retaining comparable performance. Obtaining , we rank the samples according to this relevance score and then evaluate the average precision (AP) measure on the “test set” [2]. We average the APs over all the 39 concepts to create the mean average precision (MAP), which can be regarded as the overall evaluation. We use the “validation set” to tune the smoothing factors in the algorithms by grid-search (for GRF.NS and LLGC.NS, we and ). simultaneously tune the two parameters Table I illustrates the MAP results obtained by using the two different feature sets. For comparison, we have also illustrated

474

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

Fig. 11. Exemplary key-frames of the 39 concepts.

TABLE I MAP RESULTS OBTAINED BY DIFFERENT METHODS FOR THE 39 LSCOM-LITE CONCEPTS

the results obtained by the Columbia374 SVM detectors with these features [34]. Furthermore, we have also linearly fused the results obtained by the two feature sets with the weights turned on the “fusion set”. From the Table I we can clearly see the superiority the neighborhood similarity. For each feature set, the neighborhood similarity has shown better performance than distance-based similarity and neighborhood sample similarity. The fused MAP results obtained by GRF.DS and LLGC.DS with color moment features are 0.360 and 0.354, respectively, whereas the MAPs obtained by GRF.NSS and LLGC.NSS are 0.374 and 0.373, respectively. Thus applying neighborhood sample similarity instead of the traditional distance-based similarity introduces relative improvements of 3.89% and 5.37%, respectively. Applying neighborhood similarity yields better results. The MAPs obtained by GRF.NS and LLGC.NS are 0.376 and 0.378, respectively. Thus the relative improvements

obtained by adopting neighborhood similarity are 4.44% and 6.78%, respectively. We also observe the impact of the settings of neighborhood size in this experiment. We increase from 5 to 40, and compare the performance of GRF.NS and LLGC.NS with GRF.DS and LLGC.DS for the two feature sets and the fused results. The results are illustrated in Fig. 12. We can see that in most cases, the performance curves of GRF.NS and LLGC.NS exhibit a “ ” shape.2 From Fig. 12 we can see that the superiority of neighborhood similarity over the distance-based similarity is rather consistent. In practical experiments, we can also choose to tune the parameter by grid-search, and this can help 2This can be understood with the tradeoff between bias and variance that is widely used to analyze neighborhood-style-model [12]: when neighborhood size increases, bias increases while the variance decreases.

N

WANG et al.: BEYOND DISTANCE MEASUREMENT: CONSTRUCTING NEIGHBORHOOD SIMILARITY FOR VIDEO ANNOTATION

475

achieve better performance (of course it needs much larger computational cost). VI. CONCLUSION In this paper, we have proposed a novel pairwise similarity measure, namely neighborhood similarity, which can be applied in similarity-based learning algorithms such as graph-based semi-supervised learning. The neighbor similarity measure exploits the difference of the local distributions of samples and labels. It consists of two components, i.e., neighborhood sample similarity and neighborhood label similarity. The neighborhood sample similarity is estimated based on the KL divergence of the local sample distributions, and the neighborhood label similarity is estimated based on the difference of the label histograms around the corresponding two samples. We have shown that the neighborhood similarity between two samples simultaneously takes into account three characteristics: 1) their distance; 2) the distribution difference of the surrounding samples; and 3) the distribution difference of surrounding labels Extensive experiments have demonstrated the effectiveness of the proposed similarity in different applications including toy problem, digit recognition and video annotation.

Fig. 12. Performance variation of neighborhood similarity-based methods with respect to neighborhood size . (a) Performance curves of GRF.DS, GRF.NS, LLGC.DS, and LLGC.NS with color moment features. (b) Performance curves of GRF.DS, GRF.NS, LLGC.DS, and LLGC.NS with edge distribution histogram features. (c) Fused performance curves of GRF.DS, GRF.NS, LLGC.DS, and LLGC.NS.

N

REFERENCES [1] TRECVID: TREC Video Retrieval Evaluation. [Online]. Available: http://www-nlpir.nist.gov/projects/trecvid. [2] TREC-10 Proceedings Appendix on Common Evaluation Measures. [Online]. Available: http://trec.nist.gov/pubs/trec10/appendices/measures.pdf. [3] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in Proc. Int. Conf. Database Theory, 2001. [4] A. Amir, J. Argillander, M. Campbell, A. Haubold, G. Iyengar, S. Ebadollahi, F. Kang, M. R. Naphade, A. Natsev, J. R. Smith, J. Tesic, and T. Volkmer, “IBM research TRECVID-2005 video retrieval system,” in Proc. TREC Video Retrieval Evaluation, 2005. [5] M. Belkin, L. Matveeva, and P. Niyogi, “Regularization and semi-supervised learning on large graphs,” in Proc. COLT, 2004. [6] Y. Bengio, O. Delalleau, and N. L. Roux, “Label propagation and quadratic criterion,” in Book Chapter in Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006. [7] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support Vector Machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm. [8] O. Chapelle, A. Zien, and B. Scholkopf, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006. [9] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang, “Semi-supervised learning of classifiers: Theory, algorithms and their application to human-computer interaction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 12, pp. 1553–1567, Dec. 2004. [10] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighborhood component analysis,” in Proc. Advances of Neural Information Processing, 2005. [11] T. Hastie and P. Simard, “Models and metrics for handwritten character recognition,” Statist. Sci., vol. 13, no. 1, pp. 54–65, 1998. [12] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag, 2001. [13] A. G. Hauptmann, “Lessons for the future from a decade of informedia video analysis research,” in Proc. ACM Int. Conf. Image and Video Retrieval, 2005. [14] A. G. Hauptmann, R. Yan, W. H. Lin, M. Christel, and H. Wactlar, “Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news,” IEEE Trans. Multimedia, vol. 9, no. 5, pp. 958–966, Aug. 2007. [15] J. R. He, M. J. Li, H. J. Zhang, H. H. Tong, and C. S. Zhang, “Manifoldranking based image retrieval,” in Proc. ACM Multimedia, 2004. [16] W. Kraaij and P. Over, “TRECVID-2005 high-level feature task: Overview,” in Proc. TRECVID. [Online]. Available: http://www-nlpir. nist.gov/projects/tvpubs/tv6.papers/tv6.hlf.slides-final.pdf. [17] S. Kullback, Information Theory and Statistics. New York: Wiley, 1959.

476

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 3, APRIL 2009

[18] C. Y. Lin, B. Tseng, and J. R. Smith, “VideoAnnEx: IBM MPEG-7 annotation tool for multimedia indexing and concept learning,” in Proc. Int. Conf. Multimedia & Expo, 2003. [19] M. R. Naphade, L. Kennedy, J. R. Kender, S.-F. Chang, J. R. Smith, P. Over, and A. Hauptmann, “A light scale concept ontology for multimedia understanding for TRECVID 2005,” in IBM Research Report RC23612 (W0505-104), 2005. [20] M. R. Naphade and J. R. Smith, “On the detection of semantic concepts at TRECVID,” in Proc. ACM Multimedia, 2004. [21] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” Int. J. Comput. Vis., vol. 40, no. 2, pp. 99–121, 2000. [22] N. Sebe, M. S. Lew, and D. P. Huijsmans, “Toward improved ranking metrics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1132–1143, Oct. 2000. [23] H. Shin, N. J. Hill, and G. Rätsch, “Graph-based semi-supervised learning with sharper edges,” in Proc. Eur. Conf. Machine Learning, 2006. [24] A. F. Smeaton, P. Over, and W. Kraaij, “Evaluation campaigns and TRECVid,” in Proc. ACM Workshop Multimedia Information Retrieval, 2007. [25] C. G. Snoek, M. Worring, J.-M. Geusebroek, D. C. Koelma, F. J. Seinstra, and A. W. M. Smeulders, “The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 1678–1689, Oct. 2006. [26] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proc. ACM Multimedia, 2005. [27] Y. Song, X. S. Hua, L. R. Dai, and M. Wang, “Semi-automatic video annotation based on active learning with multiple complementary predictors,” in Proc. ACM Int. Workshop Multimedia Information Retrieval, 2005. [28] J. Tang, X. S. Hua, G. J. Qi, Y. Song, and X. Wu, “Kernel based linear neighborhood label propagation for semantic video annotation,” in Proc. Pacific-Asia Conf. Kernel Discovery and Data Mining, 2007. [29] M. Wang, X. S. Hua, Y. Song, X. Yuan, S. Li, and H. J. Zhang, “Automatic video annotation by semi-supervised learning with kernel density estimation,” in Proc. ACM Multimedia, 2006. [30] M. Wang, X. S. Hua, X. Yuan, Y. Song, and L. R. Dai, “Optimizing multi-graph learning: towards a unified video annotation scheme,” in Proc. ACM Multimedia, 2007. [31] M. Wang, T. Mei, X. Yuan, and L. R. Dai, “Video annotation by graphbased learning with neighborhood similarity,” in Proc. ACM Multimedia, 2007. [32] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Advances of Neural Information Processing, 2006. [33] R. Yan and M. R. Naphade, “Semi-supervised cross feature learning for semantic concept detection in videos,” in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2005. [34] A. Yanagawa, S.-F. Chang, L. Kennedy, and W. Hsu, Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts 2007, Columbia University ADVENT Tech. Rep. #222-2006-8. [35] L. Yang, R. Jin, R. Sukthankar, and Y. Liu, “An efficient algorithm for local distance metric learning,” in Proc. AAAI Conf. Artificial Intelligence, 2006. [36] J. Yu, J. Amores, N. Sebe, P. Radeva, and Q. Tian, “Distance learning for similarity estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 451–462, Mar. 2008. [37] X. Yuan, X. S. Hua, M. Wang, and X. Wu, “Manifold-ranking based video concept detection on large database and feature pool,” in Proc. ACM Multimedia, 2006. [38] M. Zakai, “General distance criteria,” IEEE Trans. Inf. Theory, vol. IT-10, no. 1, pp. 94–95, Jan. 1964. [39] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Proc. Advances of Neural Information Processing, 2004. [40] X. Zhu, Semi-Supervised Learning Literature Survey, University of Wisconsin-Madison, Tech. Rep. (1530). [41] X. Zhu, “Semi-supervised learning with graphs,” Ph.D. dissertation, Carnegie Mellon Univ., Pittsburgh, PA, 2005. [42] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proc. Int. Conf. Machine Learning, 2003.

Meng Wang received the B.E. degree in the Special Class for the Gifted Young and the Ph.D. degree in the Department of Electronic Engineering and Information Science from the University of Science and Technology of China (USTC), Hefei, China, in 2003 and 2008, respectively. Since July 2008, he has been an associate researcher with Microsoft Research Asia. His current research interests include multimedia content analysis, computer vision, and pattern recognition.

Xian-Sheng Hua (M’05) received the B.S. and Ph.D. degrees from Peking University, Beijing, China, in 1996 and 2001, respectively, both in applied mathematics. When he was in Peking University, his major research interests were in the areas of image processing and multimedia watermarking. Since 2001, he has been with Microsoft Research Asia, Beijing, where he is currently a Lead Researcher with the internet media group. He is now an adjunct professor of University of Science and Technology of China. His current interests are in the areas of video content analysis, multimedia search, management, authoring, sharing, and advertising. He has authored more than 130 publications in these areas and has more than 30 filed patents or pending applications. Dr. Hua is a member of the Association for Computing Machinery. He serves as an Associate Editor for the IEEE TRANSACTIONS ON MULTIMEDIA and is an Editorial Board Member of Multimedia Tools and Applications. He won the Best Paper Award and Best Demonstration Award in ACM Multimedia 2007 and the Best Poster Paper Award in the 2008 IEEE International Workshop on Multimedia Signal Processing. He also won the 2008 MIT Technology Review TR35 Young Innovator Award.

Jinhui Tang (M’08) received the B.E. and Ph.D. degrees from the University of Science and Technology of China, Hefei, China, in 2003 and 2008, respectively, both in the Department of Electronic Engineering and Information Science. Since July 2008, he has been a research fellow in the School of Computing, National University of Singapore. His current research interests include content-based image retrieval, video content analysis, and pattern recognition. Dr. Tang is a recipient of the 2008 President Scholarship of Chinese Academy of Science,and a corecipient of the Best Paper Award in ACM Multimedia 2007.

Richang Hong received the Ph.D. degree from the University of Science and Technology of China, Hefei, China, in 2008. He is currently affiliated with University of Science and Technology of China and also as a Research Assistant in the School of Computing, National University of Singapore. From February 2006 to June 2006, he worked as a research intern in the Web Search and Data Mining group at Microsoft Research Asia. His current research interests include content-based image retrieval, video content analysis, and pattern recognition. Dr. Hong is a member of ACM.

Beyond Distance Measurement: Constructing ...

Color versions of one or more of the figures in this paper are available online ..... Of course, the real distribution of a certain sample's neighbor- hood might be far from ..... search, management, authoring, sharing, and advertising. He has ...

Download PDF

2MB Sizes 5 Downloads 197 Views

Report

Beyond Distance Measurement: Constructing ...

Recommend Documents