Abstract

834 dominant semantic concepts present in broadcast news video, covering objects (e.g., car, flag), scenes (e.g., outdoor, waterscape), locations (e.g., office, studio), people (e.g., person, crowd, military), events (e.g., people walking or running, people marching), and programs (e.g., weather, entertainment). These concepts were selected by using the input from a large group of video analysis researchers, knowledge representation experts and so on. Manual annotation of 449 visual concepts was carried out for a large data set in TRECVID 2005, which contained about 80 hours of news videos from three English channels, two Chinese channels and one Arabic channel. Recently, Yanagawa et al. [18] developed a set of baseline concept detectors called Columbia374-baseline for 374 visual concepts chosen from LSCOM ontology. Please refer to [18] for the complete listing of the 374 semantic concepts. In this paper, we present a direct application of Support Vector Machine with Augmented Features (AFSVM) [3] for video concept detection. For each visual concept, we learn an adapted classifier by leveraging the pre-learnt SVM classifiers of other concepts. AFSVM can utilize the pre-learnt classifiers associated with concepts like “Tennis” and “Basketball” when training a classifier for the concept “Athlete”. The solution of AFSVM is to re-train the SVM classifier using augmented feature, which concatenates the original feature vector with the decision value vector obtained from the pre-learnt SVM classifiers in the Reproducing Kernel Hilbert Space (RKHS). Comprehensive experiments are conducted on the TRECVID 2005 dataset using all the 374 visual concepts listed in [18]. Our experiments demonstrate the effectiveness of AFSVM for video concept detection.

In this paper, we present a direct application of Support Vector Machine with Augmented Features (AFSVM) for video concept detection. For each visual concept, we learn an adapted classifier by leveraging the pre-learnt SVM classifiers of other concepts. The solution of AFSVM is to retrain the SVM classifier using augmented feature, which concatenates the original feature vector with the decision value vector obtained from the pre-learnt SVM classifiers in the Reproducing Kernel Hilbert Space (RKHS). The experiments on the challenging TRECVID 2005 dataset demonstrate the effectiveness of AFSVM for video concept detection.

1. Introduction There is an increasing interest in developing new video concept detection techniques to classify the videos into high-level semantic concepts like “office”, “animal”, “person” and so on. Such video concept detection techniques have a broad range of applications including video summarization, content-based video retrieval and so on. Most of the video concept detection algorithms can be summarized in the following steps. Firstly, a large set of concept lexicons are defined. Then, a large corpus of labeled training samples are obtained by using the time consuming and expensive human annotation. Finally, robust classifiers (also called models or concept detectors) are learned from the training data, and then used to detect the presence of the concepts in any test video. Broadcast news videos contain rich information about objects, people, activities and events [17]. To promote progress in video concept detection from broadcast news videos, the TRECVID competition [1], a laboratory-style evaluation that models real world situations, has been held annually since 2001. More recently, Large-Scale Concept Ontology for Multimedia (LSCOM) lexicon [11] defined

2. Support Vector Machine with Augmented Features (AFSVM) In many video concept detection methods like [18], the concept detectors are learnt independently. However, it is well-known that some semantic concepts are correlated 1

with others. For example, it is beneficial to use the classifiers associated with the concepts like “Tennis” and “Basketball” to detect the visual concept “Athlete”. The previous work [7, 8, 13, 15] has shown that it is useful to exploit the inter-correlation among the visual concepts for semantic concept detection. Recently, the cross-domain learning (or domain adaptation) methods were proposed [5, 6, 9] to learn robust classifiers with only a limited number of labeled samples from the target domain by leveraging the prelearnt source classifiers. In this work, we use a similar idea to utilize the inter-correlation among concepts for the video concept detection problem. In our work, we learn an one-versus-all AFSVM classifier for each of the 374 visual concepts listed in [18] by leveraging a set of pre-learnt SVM classifiers of other concepts. At first, we train 374 baseline SVM classifiers for each visual concept using the corresponding training samples. Let us denote 𝑋 = (x𝑖 , 𝑦𝑖 )∣𝑙𝑖=1 as the training samples for learning the AFSVM classifier of one concept (e.g., “Athlete”), where 𝑦𝑖 ∈ {−1, 1} is the label of training sample x𝑖 ∈ ℝ𝑑 with 𝑑 being the feature dimension. Let us define a set of classifiers as 𝑓ℎ , ℎ = 1, ⋅ ⋅ ⋅ , 𝐻, which are the pre-learnt baseline SVM classifiers from the other 𝐻 = 373 concepts. Motivated by the semi-parametric SVM [16] and the existing cross-domain learning methods [6], we assume that the target classifier is in the following form: 𝑓ˆ(x) = w⊤ 𝜑(x) + 𝜷 ⊤ 𝜓(𝒇 (x)) + 𝑏, where 𝒇 (x) = [𝑓1 (x), . . . , 𝑓𝐻 (x)]⊤ is the vector of decision values on sample x from the pre-learnt classifiers 𝑓ℎ ’s, 𝜷 is the weight vector, 𝜓 is the nonlinear feature mapping function for 𝒇 (x), and w⊤ 𝜑(x) + 𝑏 is the decision function of the standard SVM with 𝜑 being another nonlinear feature mapping function. To minimize the structural risk, we propose the following objective function: 𝑙

∑ ) 1( ∥w∥2 + ∥𝜷∥2 + 𝐶 𝜉𝑖 , 2 𝑖=1

min

w,𝑏,𝜷,𝜉𝑖

(1)

𝑦𝑖 𝑓ˆ(x𝑖 ) ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, 𝑖 = 1, . . . , 𝑙.

s.t.

In contrast to semi-parametric SVM in [16], we also penalize the complexity of the weight vector 𝜷 in the above objective function to control the complexity of the prelearnt classifiers. By introducing the Lagrangian multipliers 𝛼𝑖 ’s and 𝜇𝑖 ’s for constraints in (1), we arrive at the following Lagrangian: 𝐿 =

𝑙 𝑙 ∑ ∑ 1 2 2 (∥w∥ + ∥𝜷∥ ) + 𝐶 𝜉𝑖 − 𝜇𝑖 𝜉𝑖 2 𝑖=1 𝑖=1

−

𝑙 ∑ 𝑖=1

𝛼𝑖 [𝑦𝑖 𝑓ˆ(x𝑖 ) − 1 + 𝜉𝑖 ].

(2)

By setting the derivatives of (2) with respect to the primal variables w, 𝑏, 𝜷 and 𝜉𝑖 to be zeros, we can obtain the following results: 𝑙 ∑

w=

𝛼𝑖 𝑦𝑖 𝜑(x𝑖 ),

𝜷=

𝑖=1 𝑙 ∑

𝑙 ∑

𝛼𝑖 𝑦𝑖 𝜓(𝒇 (x𝑖 )),

𝑖=1

𝛼𝑖 𝑦𝑖 = 0,

𝐶 = 𝛼𝑖 + 𝜇𝑖 .

𝑖=1

Substituting them back into (2), we arrive at the dual form of (1) as follows, which is a quadratic programming problem: max 𝛼𝑖

s.t.

𝑙 ∑

𝛼𝑖 −

𝑖=1

𝑙 1 ∑ ˆ 𝑖 , x𝑗 ) 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(x 2 𝑖,𝑗=1

0 ≤ 𝛼𝑖 ≤ 𝐶,

𝑙 ∑

(3)

𝛼𝑖 𝑦𝑖 = 0,

𝑖=1

where ˆ 𝑖 , x𝑗 ) = 𝜑(x𝑖 )⊤ 𝜑(x𝑗 ) + 𝜓(𝒇 (x𝑖 ))⊤ 𝜓(𝒇 (x𝑗 )). (4) 𝐾(x Finally, the decision function for any test sample x becomes: 𝑓ˆ(x)

= w⊤ 𝜑(x) + 𝜷 ⊤ 𝜓(𝒇 (x)) + 𝑏 =

𝑙 ∑

ˆ 𝑖 , x) + 𝑏. 𝛼𝑖 𝑦𝑖 𝐾(x

(5)

𝑖=1

The above dual form of the optimization problem in (3) is a Quadratic Programming (QP) problem, which can be readily solved using any QP solver such as the one in LIBSVM [2]. Interestingly, it is similar to the dual form of the standard SVM. The only difference is that we have a new kernel matrix as shown in (4). This new kernel matrix can be explained as the sum of two kernel matrices which are defined using the the original features and the decision values from the pre-learnt baseline SVM classifiers of other concepts. In our initial work for AFSVM [3], we firstly concatenated the original feature vector with the decision value vector from the pre-learnt classifiers of other concepts, then we used one nonlinear mapping function to obtain the new kernel matrix. This paper introduces a more flexible way to calculate the new kernel matrix. We map the original feature vector and the decision value vector by respectively using two mapping functions, which are further concatenated such that the kernel matric is calculated using (4). To be consistent with the previous work [3], we still use the same name Support Vector Machine with Augmented Features (AFSVM) here. We list the detailed procedure of AFSVM for video concept detection in Algorithm 1.

Algorithm 1 Procedure of AFSVM for video concept detection. 1. Train a nonlinear SVM classifier for each visual concept listed in [18] using the corresponding training samples. 2. For each visual concept, re-train an AFSVM classifier 𝑓ˆ(x) using the new kernel matrix defined in (4). 3. Output the final 374 AFSVM classifiers 𝑓ˆ(x)’s.

2.1. Discussion with the existing work We discuss the difference between our work and the existing methods [7, 12, 15] which also utilized the inherent correlation among different semantic concepts for video concept detection. In [7], Jiang et al. proposed a boosted Conditional Random Field (CRF) framework to exploit the inter-correlation among concepts for semantic concept detection by combining boosting and CRF. In [12], Qi et al. developed a Correlative Multi-Label (CML) framework to simultaneously classify the concepts and model the correlations among different concepts. However, the experiments were conducted using only 39 concepts, possibly because both methods are computationally expensive to tackle a large amount of concepts like all the 374 concepts in [18]. The most related work is the Discriminative Model Fusion (DMF) method [15], which re-trains the SVM classifier for each concept using the decision value vector obtained from the baseline SVM classifiers of other concepts. In contrast to [15], in this work we re-train the SVM classifier for each concept using both the original feature and the decision value vector, and our experiments show that AFSVM outperforms DMF for video concept detection.

3. Experiments We first compare AFSVM with the baseline SVM and DMF [15] for video concept detection using the large TRECVID 2005 database and all the 374 visual concepts listed in [18]. The only difference between AFSVM, DMF and SVM is that their kernel matrices are different. To explain the performance improvement, we also analyze the three kernel matrices using the kernel target alignment method in [4].

3.1. Dataset description and experimental setup TRECVID 2005 database is probably the largest annotated video benchmark data set available to researchers today. The development set of this database comprises of 137 labeled video programs (61,901 subshots) from six broadcast sources (in English, Arabic, and Chinese). In order to

describe the visual content in each of the 61,901 subshots, 374 semantic concepts have been manually annotated by a large group of students. Following [18], we also partition the development set into the training set and the test set, which contain 90 video programs (41,847 subshots) and 47 video programs (20,054 subshots), respectively. Each subshot is represented as one keyframe from which we extract three types of global visual features (i.e., 73-D Edge Direction Histogram, 48-D Gabor Texture and 225-D Grid Color Moment). While it is possible to use the local SIFT descriptors [10], we use the above global features because of their consistent and good performance reported in TRECVID. More details about the features are described in [18]. Finally, we concatenate the three types of visual features to form a 346-dimensional lengthy feature vector to represent each keyframe. For the performance evaluation, we use the noninterpolated Average Precision (AP) [1, 14], which has been used as the official performance metric in TRECVID since 2001. It corresponds to the multipoint AP value of a precision-recall curve and incorporates the effect of recall when AP is computed over the entire classification result set. Mean AP (MAP) means the mean of APs over all the 374 semantic concepts. For all the three methods, we train one-versus-all SVM classifiers with the fixed regularization parameter 𝐶 = 1. For the baseline SVM algorithm, we directly use the RBF kernel (i.e., 𝐾(x𝑖 , x𝑗 ) = exp(−𝛾1 ∥x𝑖 − x𝑗 ∥2 )) with 𝛾1 = 1/𝑑, where 𝑑 = 346 is the dimension of the original feature vector. For the DMF algorithm, we use the decision value vector obtained from the baseline SVM classifiers of all the concepts to calculate the kernel matrix. The kernel ˆ 𝑖 , x𝑗 ) = exp(−𝛾1 ∥x𝑖 − x𝑗 ∥2 )) + matrix in AFSVM is 𝐾(x exp(−𝛾2 ∥𝒇 (x𝑖 ) − 𝒇 (x𝑗 )∥2 ), where 𝛾2 = 1/𝐻. Note that for all the algorithms we normalize the original features and/or the decision values into zero mean and unit variance before calculating the kernel matrices.

3.2. Performance comparison The MAPs of AFSVM, DMF and SVM over all the 374 concepts are listed in Table 1. From it, we observe that the MAP significantly improves from 26.43% (SVM) (resp., 27.99% (DMF)) to 29.10% (AFSVM), equivalent to a 10.10% (resp., 3.97%) relative improvement respectively. The AP difference between AFSVM and SVM as well as between AFSVM and DMF for each semantic concept is plotted in Fig.1. From it, we observe that AFSVM obtains performance improvements over 309 (resp., 301) concepts out of 374 concepts when compared with the baseline SVM (resp., DMF). Some concepts enjoy large performance gains. For example, the AP for the concept “Capital” improves from 11.15% (SVM) (resp., 28.46% (DMF)) to 41.78% (AFSVM), which is equivalent to a 274.71%

Method MAP

SVM 26.43%

DMF 27.99%

AFSVM 29.10%

Table 1. MAP comparison of AFSVM, DMF and SVM over all the 374 concepts.

(resp., 46.80%) relative improvement. For each concept, we can rank the test keyframes according to the decision values. Fig. 2 shows the top–20 ranked test keyframes using AFSVM, DMF and SVM for the concept “Parade”.

3.3. Analysis using kernel target alignment [4] The only difference between AFSVM, DMF and SVM is that their kernel matrices are different. To explain the performance improvement, we also analyze the kernel matrices using the kernel target alignment method in [4]. Let us denote the Frobenius product between two Gram matrices 𝐾1 and 𝐾1 as: ∑ < 𝐾1 , 𝐾2 >𝐹 = 𝐾1 (x𝑖 , x𝑗 )𝐾2 (x𝑖 , x𝑗 ). (6) 𝑖𝑗

The alignment value (AV) between a given kernel matrix 𝐾 and the ideal kernel matrix 𝑦𝑦 𝑇 constructed from the label vector 𝑦 ∈ {−1, 1}𝑙 is defined as follows: < 𝐾, 𝑦𝑦 𝑇 >𝐹 𝐴𝑉 = √ < 𝐾, 𝐾 >𝐹 < 𝑦𝑦 𝑇 , 𝑦𝑦 𝑇 >𝐹 < 𝐾, 𝑦𝑦 𝑇 >𝐹 = √ . 𝑙 < 𝐾, 𝐾 >𝐹

(7)

Method SVM DMF AFSVM MAV 0.7498 0.8386 0.8498 Table 2. MAV comparison of AFSVM, DMF and SVM over all the 374 concepts.

As shown in [4], a larger AV indicates this kernel is close to the ideal kernel. For each concept, we use the test samples to calculate the AV between the kernel matrix of SVM/DMF/AFSVM and the ideal kernel matrix using (7). Table 2 lists the mean alignment values (MAVs) of AFSVM, DMF and SVM over all the 374 concepts. Again, we observe that the MAV of AFSVM is higher than that of SVM or DMF, demonstrating that the new kernel matrix used in AFSVM is closer to the ideal kernel. The above results also explain the performance improvement of AFSVM.

4. Conclusion We have proposed a direct application of Support Vector Machine with Augmented Features (AFSVM) for video

concept detection. For each visual concept, we learn a classifier by leveraging the pre-learnt SVM classifiers of other concepts. The solution of AFSVM is to re-train the SVM classifier using augmented feature, which concatenates the original feature vector with the decision value vector obtained from the pre-learnt SVM classifiers in the Reproducing Kernel Hilbert Space (RKHS). The experiments on the large TRECVID 2005 dataset using 374 semantic concepts demonstrate the effectiveness of AFSVM for video concept detection. Acknowledgements: This work is supported by Singapore A*STAR SERC Grant (082 101 0018).

References [1] http://www-nlpir.nist.gov/projects/tv2005/tv2005.html. [2] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [3] L. Chen, D. Xu, I. W.-H. Tsang, and J. Luo. Tag-based web photo retrieval improved by batch mode re-tagging. In CVPR, 2010. [4] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target alignment. In NIPS, pages 367–373, 2001. [5] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domain adaptation from multiple sources via auxiliary classifiers. In ICML, 2009. [6] L. Duan, D. Xu, I. W. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. In CVPR, 2010. [7] W. Jiang, S.-F. Chang, and A. C. Loui. Context-based concept fusion with boosted conditional random fields. In ICASSP, pages 949–952, 2007. [8] Y.-G. Jiang, J. Wang, S.-F. Chang, and C.-W. Ngo. Domain adaptive semantic diffusion for large scale contextbased video annotation. In IEEE International Conference on Computer Vision (ICCV), 2009. [9] Y. Liu, D. Xu, I. W. Tsang, and J. Luo. Textual query of personal photos facilitated by large-scale web data. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [10] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [11] M. R. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. H. Hsu, L. S. Kennedy, A. G. Hauptmann, and J. Curtis. Largescale concept ontology for multimedia. IEEE MultiMedia, 13(3):86–91, 2006. [12] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation. In ACM Multimedia, pages 17–26, 2007. [13] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. [14] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In Multimedia Information Retrieval, pages 321–330, 2006.

0.35 0.3

0.2 0.15 0.1 0.05 0 -0.05

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271 280 289 298 307 316 325 334 343 352 361 370

AP difference

0.25

-0.1 -0.15

Index of the concept 0.15

0 -0.05

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271 280 289 298 307 316 325 334 343 352 361 370

AP difference

0.1 0.05

-0.1 -0.15

Index of the concept Figure 1. Per-concept AP difference between AFSVM and SVM (top) as well as between AFSVM and DMF (bottom). The 374 concepts are sorted according to the frequency of positively labeled samples in the TRECVID 2005 dataset.

SVM

DMF

AFSVM

Figure 2. Top–20 ranked test keyframes for the semantic concept “Parade”. Incorrect results are highlighted by red boxes. [15] J. R. Smith, M. Naphade, and A. Natsev. Miltimedia semantic indexing using model vectors. In IEEE Proc. International Conference on Multimedia and Expo, pages 445 – 448, 2003. [16] A. J. Smola, T.-T. Frieß, and B. Sch¨olkopf. Semiparametric support vector and linear programming machines. In NIPS, pages 585–591, 1998.

[17] D. Xu and S.-F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1985–1997, 2008. [18] A. Yanagawa, S.-F. Chang, L. Kennedy, and W. Hsu. Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts, 2007. ADVENT Technical Report ,Columbia University, March 2007.