Grouplet-based Distance Metric Learning for Video ...

Viewer
Transcript

Grouplet-based Distance Metric Learning for Video Concept Detection Wei Jiang Alexander C. Loui Corporate Research and Engineering Eastman Kodak Company Rochester, NY 14650 {wei.jiang,alexander.loui}@kodak.com

Abstract—We investigate general concept detection in unconstrained videos. A distance metric learning algorithm is developed to use the information of the grouplet structure for improved detection. A grouplet is defined as a set of audio and/or visual codewords that are grouped together according to their strong correlations in videos. By using the entire grouplets as building elements, concepts can be more robustly detected than using discrete audio or visual codewords. Compared with the traditional method of generating aggregated grouplet-based features for classification, our grouplet-based distance metric learning approach directly learns distances between data points, which better preserves the grouplet structure. Specifically, our algorithm uses an iterative quadratic programming formulation where the optimal distance metric can be effectively learned based on the large-margin nearestneighbor setting. The framework is quite flexible, where various types of distances can be computed using individual grouplets, and through the same distance metric learning algorithm the distances computed over individual grouplets can be combined for final classification. We extensively evaluate our method over the large-scale Columbia Consumer Video set. Experiments demonstrate that our approach can achieve consistent and significant performance improvements. Keywords-grouplet, video concept classification, distance metric learning

I. I NTRODUCTION Many efforts have been devoted to detect general concepts in unconstrained generic videos, such as the human action recognition in Hollywood movies [1], the TRECVID highlevel feature extraction or multimedia event detection [2], and the concept detection in Columbia Consumer Video (CCV) set [3]. The current state-of-the-art in video concept detection tasks is based on the Bag-of-Words (BoW) model and SVM classifiers [2], [3], [4]. In the visual aspect, local visual descriptors (e.g., SIFT [5] or HOG [6]) are computed from 2D local points or 3D local volumes. These descriptors are vector-quantized against a codebook of prototypical visual descriptors to generate a histogram-like visual representation. In the audio aspect, audio descriptors (e.g., MFCCs or transients [7]) are computed from short temporal windows that are either uniformly distributed in the soundtrack or sparsely detected as salient audio onsets. These descriptors are also vector-quantized against a codebook of prototypical audio descriptors to generate a histogram-like audio

representation. Next, the visual and audio histogram-like representations are combined (e.g., by early fusion in the form of feature concatenation or by late fusion in the form of classifier ensemble) to learn SVM-based video concept classifiers [2], [3]. From another perspective, beyond the traditional BoW representation, structured local descriptors have been recently found to be effective in many computer vision tasks. For example, in addition to the local visual appearance, pairwise spatial constraints among local interest points have been used to enhance image registration [8]; various types of spatial contextual information have been used for object detection [9]; and a grouplet representation has been developed to capture discriminative visual features and their spatial configurations for detecting the human-objectinteraction scenes in images [10]. The rationale behind this is that individual local descriptors tend to be sensitive to variations such as changes of illumination, views, scales, and occlusions. In comparison, a set of co-occurrent local patterns can be less ambiguous. Along this direction, for the task of general video concept detection, Jiang and Loui have recently developed an Audio-Visual Grouplet (AVG) representation [11] that incorporates temporal audio-visual correlations to enhance classification. An AVG is defined as a set of audio and visual codewords that are grouped together according to their strong temporal correlations in videos. By using the entire grouplets as building elements to represent videos, various concepts can be more robustly classified than the use of discrete audio and visual codewords. For example, the AVG that captures the visual bride and audio speech gives a strong audio-visual cue to classify the “wedding ceremony” concept, and the AVG that captures the visual bride and audio dancing music is quite discriminative to classify the “wedding dance” concept. Although the extracted AVGs can capture interesting audio-visual signatures, a naive approach is taken in [11], however, to compute the grouplet-based features for final concept classification. That is, for the audio and/or visual codewords associated with a grouplet, the values over the corresponding bins in the original visual-based and/or audiobased BoW features are aggregated together (e.g., by taking summation or average) as the feature for the grouplet. As illustrated in Figure 1, such aggregated BoW features can

be problematic and cannot fully utilize the advantage of the grouplet structure. visual codeword #1

visual codeword #2

visual codeword #3

data point x1 2

data point x2

2

2 1

audio codeword #1 An Audio-Visual Grouplet (AVG)

data point x3

2

2 1

2 1

audio codeword #2 aggregated BoW (sum): 5

aggregated BoW (sum): 5

aggregated BoW (sum): 5

Figure 1. An example of the aggregated BoW feature based on a grouplet. In the example, assume that all codewords have equal weights, data points x1 , x2 , and x3 have the same aggregated BoW features for the given grouplet (value 5 by taking summation). However, data points x1 and x3 should be more similar to each other than data points x1 and x2 . This is because x1 and x3 have the same feature values over visual codeword #1 and visual codeword #3, while x1 and x2 only have the same feature value over audio codeword #2.

In this work, we propose a distance metric learning algorithm to better use the grouplet structure information. Instead of computing an aggregated feature for each grouplet, we directly learn the distance metric between data points by using the grouplets. Therefore, we can avoid the information loss in generating the aggregated grouplet-based feature and compute better grouplet-based distances by preserving the grouplet structure. For instance, in the example given in Figure 1, we can calculate distances DG (xi , xj ) between data points by using the AVG-based histogram-like vectors of data points (e.g., with Euclidean distance), and such distances can better preserve the grouplet structure (i.e., the distance between data points x1 and x3 should be smaller than the distance between data points x1 and x2 ). Next, by combining the grouplet-based distances DG (xi , xj ) computed over different AVGs, a final metric D(xi , xj ) can be learned to measure the distances between data points. Distance metric learning is an important machine learning technique of adapting the underlying distance metric according to the available data for improved classification. The most popular distance metric learning algorithms are based on the Mahalanobis distance metric, such as the Large-Margin Nearest Neighbor (LMNN) [12] method, the Maximally Collapsing Metric Learning approach (MCML) [13], and the Information-Theoretic Metric Learning (ITML) method [14]. However, it is non-trivial to incorporate the grouplet structure into the existing distance metric learning algorithms. In this work, we develop a distance metric learning algorithm by using the grouplet structure information. An iterative Quadratic Programming (QP) problem is formulated to learn the optimal distance metric based on the LMNN setting [12]. LMNN is used because of its resemblance to SVMs, i.e., the role of large margin in LMNN is inspired by its role in SVMs, and LMNN should inherit various strengths of SVMs [15]. Therefore, the final learned distance metric can provide reasonably good performance for SVM concept detectors. Our distance metric learning framework is quite flexible, where various types of distances DG (xi , xj ) can be com-

puted using individual grouplets, and these grouplet-based distances can be combined by the same distance metric learning algorithm. In addition, we propose a grouplet-based distance DG (xi , xj ) based on the chi-square distance and word specificity [16], and through our distance metric learning such a grouplet-based distance can achieve consistent and significant detection performance gain. Our approach is evaluated over the large-scale CCV set [3], containing 9317 consumer videos from YouTube. Experimental comparison with the original grouplet-based aggregated BoW feature used in [11] shows the effectiveness of our approach. The remaining part of the paper is organized as follows. Section II describes the process of generating the grouplets. Section III develops our distance metric learning algorithm and introduces our grouplet-based distance. Section IV gives the experimental results, and Section V concludes the paper. II. G ROUPLET C ONSTRUCTION Following the recipe of [11], we extract four types of AVGs by computing four types of audio-visual temporal correlations. Specifically, in the visual aspect, we extract SIFT points and conduct SIFT tracking by using Lowe’s method [5], and separate the SIFT tracks into foreground tracks and background tracks. Each SIFT track is represented by a 136-dim feature vector composed by the 128-dim SIFT descriptor and an 8-dim motion vector. Next, by clustering the foreground SIFT tracks from a set of training videos, a foreground visual codebook V f −v is constructed. Similarly, by clustering the background SIFT tracks from the set of training videos, a background visual codebook V b−v is constructed. In the audio aspect, the 13-dim MFCCs are extracted from evenly distributed overlapping short windows (i.e., 25 ms windows with 10 ms hops) in the soundtrack, and by clustering the MFCCs from the set of training videos, a background audio codebook V b−a is constructed. Also, the 20-dim transient features [7] describing the foreground audio salient events are computed, and by clustering the transient features from the set of training videos, a foreground audio codebook V f −a is constructed. For each codebook (e.g., codebook V f −v ), given an input video xj , a histogram-like temporal sequence f −v f −v f −v , Hj2 , . . .} can be generated, where each Hjk is {Hj1 the BoW feature for the k-th frame in the video computed using a soft weighting scheme [17]. Based on the temporal sequences generated for different codebooks, the temporal Granger causality [18] between pairwise audio and visual codewords can be calculated. That is, we compute four types of temporal Granger causalities: between visual foreground and audio foreground codewords; between visual foreground and audio background codewords; between visual background and audio foreground codewords; and between visual background and audio background codewords. The Granger causality between two codewords measures the similarity between these codewords, and we can obtain a causal matrix

describing the pairwise similarities between audio and visual codewords. Spectral clustering algorithms (e.g., method of [19]) can be used to cluster the audio and visual codewords into grouplets (AVGs) based on the causal matrix. Each grouplet G contains a set of audio and visual codewords that have strong Granger causal relations. Therefore, four types of AVGs are obtained from the four types of temporal Granger causal audio-visual correlations. After obtaining grouplets G, the work of [11] takes a naive approach to compute the grouplet-based features for concept classification. That is, for the audio and/or visual codewords associated with a grouplet, the values over the corresponding bins in the original visual-based and/or audiobased BoW features are aggregated together (e.g., by taking summation) as the feature for the grouplet. In comparison to the naive approach, in the following section, we develop a distance metric learning algorithm to directly learn the pairwise distances between data points by using the grouplets information. Our method can better preserve the grouplet structure and avoid the information loss in generating aggregated grouplet-based features, and therefore, is better for concept classification. III. P ROBLEM F ORMULATION AND O UR D ISTANCE M ETRIC L EARNING A PPROACH Assume that we have K grouplets Gk , k = 1, . . . , K. Let DkG (xi , xj ) denote the distance between data xi and xj computed based on the grouplet Gk . The overall distance D(xi , xj ) between data xi and xj is given by: K vk DkG (xi , xj ). (1) D(xi , xj ) = k=1

The SVM classifiers are generally used as the concept classifiers due to their good performances in classifying generic videos, and the RBF-like kernels (Eqn. (2)) are found to provide state-of-the-art performances in several semantic concept classification tasks [2], [3], K(xi , xj ) = exp {−γD(xi , xj )} .

(2)

For example, the chi-square RBF kernel usually performs well with histogram-like features [3], [17], where distance D(xi , xj ) in Eqn. (2) is the chi-square distance. It is not trivial, however, to directly learn the optimal weights vk (k = 1, . . . , K) in the SVM optimization setting, due to the exponential function in RBF-like kernels. In this work, we formulate an iterative QP problem to learn the optimal weights vk (k = 1, . . . , K). The basic idea is to incorporate the LMNN setting for distance metric learning [12]. The rationale is that the role of large margin in LMNN is inspired by its role in SVMs, and LMNN should inherit various strengths of SVMs [15]. Therefore, although we do not directly optimize weights vk (k = 1, . . . , K) in the SVM optimization setting, the final optimal weights can still provide reasonably good performance for SVM-based concept classifiers.

A. The LMNN Formulation Let d2M (xi , xj ) denote the Mahalanobis distance metric between two data points xi and xj : d2M (xi , xj ) = (xi − xj )T M(xi − xj ),

(3)

where M ≥ 0 is a positive semidefinite matrix. LMNN aims to learn an optimal M over a set of training data (xi , yi ), i = 1, . . . , N , where yi ∈ {1, . . . , c} and c is the number of classes. For LMNN classification, the training process has two steps. First, nk similarly labeled target neighbors are identified for each input training datum xi . The target neighbors are selected by using prior knowledge or by simply computing nk nearest (similarly labeled) neighbors using the Euclidean distance. Let ηij = 1 (or 0) denote that xj is a target neighbor of xi (or not). In the second step, the Mahalanobis distance metric is adapted so that these target neighbors are closer to xi than all other differently labeled inputs. The Mahalanobis distance metric can be estimated by solving the following semidefinite programming problem: 2 ηij dM (xi , xj ) + C (1 − yil )ijl , (4) min M

s.t.

ij l 2 2 dM (xi , xl )−dM (xi , xj ) ≥ 1−ijl , ijl ≥ 0, M ≥ 0.

yil ∈ {0, 1} indicates whether the inputs xi and xl have the same class label. ijl is the amount by which a differently labeled input xl invades the “perimeter” around the input xi defined by its target neighbor xj . B. Our Approach Following the idea of LMNN, and defining v = G (xi , xj )]T , [v1 , . . . , vK ]T , D(xi , xj ) = [D1G (xi , xj ), . . . , DK we can obtain the following problem: min J = (5) v ⎧ ⎫ ⎨||v||2 ⎬ 2 +C0 ηij vT D(xi , xj )+C ηij (1−yil )ijl , min v ⎩ 2 ⎭ ij ijl

T

T

s.t. v D(xi , xl ) − v D(xi , xj ) ≥ 1−ijl , ijl ≥ 0, vk ≥ 0. ||v||22 is the L2 regularization that controls the complexity of v. By introducing Lagrangian multipliers μijl ≥ 0, γijl ≥ 0, and σk ≥ 0, we have: min J = v ⎧ ⎨||v||2 2 +C0 ηij vT D(xi , xj )+C ηij (1−yil )ijl min v ⎩ 2 ij ijl − μijl ηij vT D(xi , xl ) − vT D(xi , xj ) − 1 + ijl ijl

−

ijl

γijl ηij ijl −

k

⎫ ⎬ σk vk

⎭

.

(6)

Next, we obtain: ∂J = 0 =⇒ Cηij (1 − yil ) − μijl ηij − γijl ηij = 0. ∂ijl

(7)

That is, for any pair of xi and its target neighbor xj , since we only consider xl with yil = 0, 0 ≤ μijl ≤ C. Based on Eqn. (7), Eqn. (6) turns to: ⎧ ⎨1 ||v||22 + C0 ηij vT D(xi , xj ) min J = min v v ⎩2 ij ⎫ ⎬ T μijl ηij v D(xi , xl )−vT D(xi , xj )−1 −vT σ , (8) − ⎭ ijl

where σ = [σ1 , . . . , σK ]T . Thus, ∂J = 0 =⇒ v = μijl ηij [D(xi , xl )−D(xi , xj )] ijl ∂v ηij D(xi , xj ). (9) +σ − C0 ij

Define set P as the set of indexes i, j, l that satisfy the conditions of ηij = 1, yil = 0, and that xl invades the “perimeter” around the input xi defined by its target neighbor xj , i.e., 0 ≤ D(xi , xl ) − D(xi , xj ) ≤ 1. Define set Q as the set of indexes i, j that satisfy ηij = 1. Next, we can use μp , p ∈ P to replace the original notation μijl , use DpP , p ∈ P to replace the corresponding D(xi , xl ) − D(xi , xj ), and use DqQ , q ∈ Q to replace the corresponding D(xi , xj ). Define u = [μ1 , . . . , μ|P| ]T , T

|P| , and |Q|×K matrix |P|×K matrix DP = D1P , . . . , DP T

|Q| DQ = D1Q , . . . , DQ . Through some derivation, we can obtain the dual of Eqn. (8) as follows: 1 max − uT DP DTP u + C0 uT DP DTQ 1Q + uT 1P σ,u 2 1 T T T T − σ σ − u DP σ + C0 σ DQ 1Q , (10) 2 where 1Q is a |Q|-dim vector whose elements are all ones, and 1P is a |P|-dim vector whose elements are all ones. When σ is fixed, Eqn. (10) can be further rewritten to the following QP problem: 1 T T T T max − u DP DP u+u C0 DP DQ 1Q +1P −DP σ , u 2 (11) s.t. ∀p ∈ P, 0 ≤ μp ≤ C. On the other hand, when u is fixed, Eqn. (10) turns into the following QP problem: 1 T T T T max − σ σ + σ C0 DQ 1Q − DP u , σ 2 (12) s.t. ∀k = 1, . . . , K, σk ≥ 0. Therefore, we can iteratively solve the QP problems of Eqn. (11) and Eqn. (12) and obtain the desired weights v through Eqn. (9).

For each of the QP problems, since we have positive definite Q (or positive semi-definite Q that can be made positive definite by using practical tricks), it can be solved relatively efficiently in polynomial time. C. Grouplet-based Kernels One of the most intuitive kernels that incorporates the grouplet information is the grouplet-based chi-square RBF kernel. That is, each DkG (xi , xj ) is a chi-square distance:

DkG (xi , xj ) =

2

wm ∈Gk

[fwm (xi ) − fwm (xj )] , 1 2 [fwm (xi ) + fwm (xj )]

(13)

where fwm (xi ) is the feature of xi corresponding to the codeword wm in grouplet Gk . When vk = 1, k = 1, . . . , K, Eqn. (13) will give the standard chi-square RBF kernel. From another perspective, we can treat each grouplet as a phrase, which consists of the orderless codewords associated with that grouplet. Analogous to measuring the similarity between two text segments, we should take into account the word specificity [16] in measuring the similarity between data points. One popular way of computing the word specificity is to use the inverse document frequency (idf). Therefore, we use the following metric to compute the distance DkG (xi , xj ):

2 1 [fw (xi ) − fwm (xj )] idf(wm ) 1 m . (14) idf(wm ) 2 [fwm (xi ) + fwm (xj )]

wm ∈Gk

wm ∈Gk

idf(wm ) is computed as the total number of occurrences of all codewords in the training corpus divided by the total number of occurrences of codeword wm in training corpus:

idf(wm ) = fwm (x) / fwm (x) . (15) wm

x

x

Using either the chi-square distance Eqn. (13) or the idfweighted chi-square distance Eqn. (14), respectively, the distance metric learning method developed in the previous Section III-B can be applied to find the optimal metric and compute the optimal kernels for concept classification. Finally, as described in Section II, we have four types of grouplets extracted by studying four types of audiovisual temporal correlations. The distance metric learning algorithm described in Section III-B can be applied to each type of grouplets individually, and four types of optimal kernels can be computed. After that, the Multiple Kernel Learning technique [20] is adopted to combine the four types of kernels for final concept detection. IV. E XPERIMENTS We evaluate our method over the large-scale CCV set [3], containing 9317 consumer videos from YouTube. These videos are captured by ordinary users under unrestricted challenging conditions, without post-editing. Each video is manually labeled to 20 semantic concepts by using Amazon

0.8

χ -RBF Series1

Series2 agg-BoW

Series3 idf- χ -RBF 2

Series4 w- χ -RBF

0.8

2 Series5 w-idf-χ -RBF

2

0.7

0.6

0.6

0.5

0.5

AP

AP

2

0.7

0.4

0.3

0.2

0.2

0.1

0.1

0 n ay on y g at g d e e ce e h d P l ll er ng g g bal eba occ ati kiin min bikin c do bir uatio irthd cepti emon dancmancrman parad beac roun MA d b re cer ing for fo ket bas s ice sk s wim yg g s gra r d n g e pla d i per dd din we sic pusic u we wed m n-m no

2

agg-BoW Series2

idf- χ -RBF Series3 2

w- χ -RBF Series4 2

idf- χ -RBF Series3 2

w- χ -RBF Series4

Series5 w-idf-χ -RBF

2

2

n ay on y g at g e e ce e h d P l ll er ng g g bal eba occ ati kiin in ikin c do birduatio irthd cepti emon dancmancrman parad beac roun MA d b re er ing for fo ket bas s ice sk s wimm b yg c dd er per g s gra n g pla i dd din we sic pusic u we wed m n-m no

(b) Visual-background-audio-background AVG

w-idf-χ -RBF Series5 2

0.8

0.7

0.7

0.6

0.6

0.5

0.5

χ -RBF Series1 2

agg-BoW Series2

Series3 idf- χ -RBF 2

w-χ -RBF Series4 2

2 w-idf-χ -RBF Series5

AP

AP

χ -RBF Series1

agg-BoW Series2

0

bas

(a) Visual-foreground-audio-foreground AVG 0.8

2

0.4

0.3

bas

χ -RBF Series1

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

n ay on y g at g e e ce e h d P l ll er ng g g bal eba occ ati kiin in ikin c do birduatio irthd cepti emon dancmancrman parad beac roun MA d b re er ing for fo ket bas s ice sk s wimm b yg c g s gra r d bas e per g pla d in dd din we sic pusic u we wed m n-m no

(c) Visual-foreground-audio-background AVG Figure 2.

bas

n ay on y g at g e e ce e h d P l ll er ng g g bal eba occ ati kiin in ikin c do birduatio irthd cepti emon dancmancrman parad beac roun MA d b re er ing for fo ket bas s ice sk s wimm b yg c g s gra r d e per g pla d in dd din we sic pusic u we wed m n-m no

(d) Visual-background-audio-foreground AVG

Performances comparison of different approaches using individual types of grouplets.

Mechanical Turk. Our experiments take similar settings as [11], i.e., we use the same training (4659 videos) and test (4658 videos) sets, and similar AVGs generated with the same procedure. The performance is measured by Average Precision (AP, the area under un-interpolated PR curve) and Mean AP (MAP, averaged AP across concepts). We compare the classification performances of 5 different approaches: the aggregated BoW (“agg-Bow”) used in [11]; the standard chi-square RBF kernel that does not use any grouplet information (“χ2 -RBF”); the chi-square RBF kernel that uses the idf information but does not use any grouplet information (“idf-χ2-RBF”); the weighted chi-square RBF kernel with distance metric learning that uses the grouplets (“w-χ2 -RBF”); and the weighted chi-square RBF kernel with distance metric learning that uses both the idf information and the grouplets (“w-idf-χ2-RBF”). Figures 2 (a)-(d) give the performance comparison of different kernels using individual types of grouplets, i.e., using visual-foreground-audio-foreground AVG, visual-background-audio-background AVG, visualforeground-audio-background AVG, and visual-backgroundaudio-foreground AVG, respectively. From the figures we can see that by finding appropriate weights of AVGs through our distance metric learning, we can consistently improve the detection performance. For example, for all

four types of AVGs, “w-χ2 -RBF” works better than “χ2 RBF” on average, and “w-idf-χ2 -RBF” outperforms “idfχ2 -RBF”. Conversely, compared with the standard “χ2 RBF”, the naive “agg-BoW” method that is used in [11] can only marginally improve the overall MAP over the visualforeground-audio-foreground AVG, the visual-backgroundaudio-background AVG, and the visual-foreground-audiobackground AVG, and cannot achieve better performance over the visual-background-audio-foreground AVG. As discussed in the previous sections, this is because in the process of generating aggregated grouplet-based features, the naive “agg-BoW” method can lose the important information of the grouplet structure. Therefore, it cannot take full advantage of the AVGs. In comparison, the advantages of our “w-idf-χ2-RBF” are quite apparent, i.e., it performs the most efficiently over almost every concept across all four types of AVGs. Compared to the naive “agg-BoW” from [11], “w-idf-χ2-RBF” improves the final MAP by 26%, 15%, 14%, and 32% on a relative basis, respectively, over the visual-foreground-audio-foreground AVG, the visualbackground-audio-background AVG, the visual-foregroundaudio-background AVG, and the visual-background-audioforeground AVG. Figure 3 gives the performance comparison when we combine various types of grouplets, where for both the stan-

0.9

χ -RBF Series1 2

0.8

Series2 agg-BoW

Series3χ -RBF w-idf2

0.7

AP

0.6 0.5 0.4 0.3 0.2 0.1 0 n ay on y g at g d e e ce e h d P l ll er ng g g bal eba occ ati kiin in ikin c do bir uatio irthd cepti emon dancmancrman parad beac roun MA d b re er ing for fo ket bas s ice sk s wimm b s yg a c s gra b pla ing ing edd c per ic per d d d w si u s we wed mu n-m no

Figure 3. Performance comparison of combining kernels that are computed over different types of grouplets.

dard chi-square RBF kernel (“χ2 -RBF”) and our weighted chi-square RBF kernel using the idf information and grouplet structure (“w-idf-χ2 -RBF”), multiple kernel learning is applied to find the optimal weights to combine kernels computed over individual grouplets. Again, the aggregated BoW (“agg-Bow”) is the method used in [11] where features computed based on various AVGs are concatenated into a single vector to train classifiers. From the figure we can see that our “w-idf-χ2 -RBF” can consistently and significantly outperform both the standard “χ2 -RBF” and the naive “aggBow” over all concepts. For example, compared with “χ2 RBF” that does not use the grouplet information, “w-idf-χ2RBF” improves the AP of every concept by more than 5% (on a relative basis), and over 15 (out of 20) concepts, the improves are more than 10%. Again, the naive “agg-Bow” can only marginally improve the performance compared with “χ2 -RBF”. Finally, “w-idf-χ2-RBF” is able to improve the overall MAP by 14% and 10%, respectively, compared with “χ2 -RBF” and “agg-Bow”. V. C ONCLUSION We develop a distance metric learning algorithm to use the grouplet structure for effective concept classification in unconstrained videos. Based on the LMNN setting, the algorithm optimizes an iterative QP problem to find the appropriate weights of combining individual grouplet-based distances for optimal classification. Compared with the traditional method that generates aggregated grouplet-based features, we directly learn distances between data points to better preserves the grouplet structure. Specifically, we propose a grouplet-based distance by using the chi-square distance and word specificity, which achieves consistent and significant detection performance improvements over the large-scale CCV set. Future work includes exploring other types of regularization, e.g., L1 regularization of weights v. The L2 norm is used in this work to prevent sparse solutions, due to the relatively small number of grouplets in our experiments. For tasks with a large number of grouplets, L1 norm may be a better choice.

R EFERENCES [1] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” IEEE CVPR, 2008, Anchorage, Alaska. [2] A.F. Smeaton, P. Over, and W. Kraaij, “Evaluation campaigns and trecvid,” ACM MIR, pp. 321–330, 2006. [3] Y.G. Jiang, G.N. Ye, S.F. Chang, D. Ellis, and A.C. Loui, “Consumer video understanding: A benchmark database and an evaluation of human and machine performance,” ACM ICMR, 2011, Trento, Italy. [4] K. Van de Sande, T. Gevers, and C.G.M. Snoek, “Evaluating color descriptors for object and scene recognition,” IEEE PAMI, vol. 32, no. 9, pp. 1582–1596, 2010. [5] D.G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004. [6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” IEEE CVPR, 2005. [7] C. Cotton, D. Ellis, and A. Loui, “Soundtrack classification by transient events,” IEEE ICASSP, 2011. [8] O. Enqvist, K. Josephson, and F. Kahl, “Optimal correspondences from pairwise constraints,” IEEE ICCV, 2009, Kyoto, Japan. [9] S. Divvala, D. Hoiem, J. Hays, A. Efros, and M. Hebert, “An empirical study of context in object detection,” IEEE CVPR, 2009, Miami, Florida. [10] B. Yao and L. Fei-Fei, “Grouplet: A structured image representation for recognizing human and object interactions,” IEEE CVPR, 2010, San Francisco, California. [11] W. Jiang and A.C. Loui, “Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification,” ACM Multimedia, 2011, Scottsdale, AZ. [12] K. Weinberger and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” JMLR, vol. 10, no. 12, pp. 207–244, 2009. [13] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” NIPS, pp. 451–458, 2006. [14] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon, “Information-theoretic metric learning,” ICML, pp. 209–216, 2007. [15] B. Sch¨ olkopf and A.J. Smola, “Learning with kernels: Support vector machines, regularization, optimization, and beyond,” Cambridge, MA: MIT Press, 2002. [16] R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” National Conference on Artificial Intelligence, pp. 775–780, 2006, AAAI Press. [17] Y.G. Jiang, C.W. Ngo, and J. Yang, “Towards optimal bag-offeatures for object categorization and semantic video retrieval,” ACM CIVR, pp. 494–501, 2007. [18] C. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, 1969. [19] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” NIPS, pp. 849–856, 2001. [20] M. Varma and B.R. Babu, “More generality in efficient multiple kernel learning,” ICML, pp. 1065–1072, 2009.

Learning a Distance Metric for Object Identification ...

Sparse distance metric learning for embedding compositional data

Learning a Mahalanobis Distance Metric for Data ...

A Hybrid Method for Distance Metric Learning - Edward Yi-Hao Kao

Distance-Learning-Committee_2016 ...

Sparse Distance Learning for Object Recognition ... - Washington

Maximum Distance Separable Codes in the Ï Metric ...

Learning the Inter-frame Distance for ... - Semantic Scholar

Distance Learning experiments between two ...

Video-mediated farmer-to-farmer learning for ...

a video-based biometric authentication for e- learning ...

Deep Learning Methods for Efficient Large Scale Video Labeling

Learning a Mahalanobis Metric from Equivalence ...

Learning to Search a Melodic Metric Space