A Procedure of Adaptive Kernel Combination with Kernel-Target Alignment for Object Classification Motoaki Kawanabe
Shinichi Nakajima
Alexander Binder
Fraunhofer Institute FIRST.IDA kekuléstr. 7 Technical University of Berlin Franklinstr. 28 Berlin, Germany
NIKON Corporation Tokyo, Japan Technical University of Berlin Franklinstr. 28 Berlin, Germany
Fraunhofer Institute FIRST kekuléstr. 7 Berlin, Germany
nabe@first.fhg.de
[email protected]
ABSTRACT In order to achieve good performance in object classification problems, it is necessary to combine information from various image features. Because the large margin classifiers are constructed based on similarity measures between samples called kernels, finding appropriate feature combinations boils down to designing good kernels among a set of candidates, for example, positive mixtures of predetermined base kernels. There are a couple of ways to determine the mixing weights of multiple kernels: (a) uniform weights, (b) a brute force search over a validation set and (c) multiple kernel learning (MKL). MKL is theoretically and technically very attractive, because it learns the kernel weights and the classifier simultaneously based on the margin criterion. However, we often observe that the support vector machine (SVM) with the average kernel works at least as good as MKL. In this paper, we propose as an alternative, a two-step approach: at first, the kernel weights are determined by optimizing the kernel-target alignment score and then the combined kernel is used by the standard SVM with a single kernel. The experimental results with the VOC 2008 data set [8] show that our simple procedure outperforms the average kernel and MKL.
Categories and Subject Descriptors I.4.8 [Computing Methodologies]: Image Processing And Computer Vision—Scene Analysis
General Terms Measurement Performance
Keywords Object Classification, Multiple Kernel Learning, Kernel-Target Alignment
1. INTRODUCTION
binder@first.fhg.de
In computer vision, it is necessary to combine image representations of various types (color, texture, shape etc.) in order to achieve good performance. Moreover, importance of image features changes significantly between tasks. For example, in object classification, color information is not relevant for the car class, while it helps substantially to recognize the stop signs. Therefore, techniques for combining many features with adaptation of their contributions are useful. We will approach object classification problems with machine learning techniques. In the last decades, kernel machines have been investigated extensively and have also been widely and successfully applied to many practical problems in various fields [18]. Kernels define similarities between data points and play a central role in such techniques. Although generic kernels such as polynomials and Gaussians were used at the beginning, designing good kernels for data sets at hand is essential to achieve better performance. On one hand, there have been proposed many fine-tuned kernels incorporating prior assumptions and domain knowledges [10, 23]. On the other hand, a general framework called multiple kernel learning (MKL) was introduced by Lanckriet et al.(2004) [12] to construct appropriate kernels adaptively and have been developed in aspects of both theory and algorithms [2, 19, 17, 22]. More specifically, from positive mixtures of pre-determined base kernels, MKL selects the optimal mixing weights and simultaneously learns the classifier by maximizing the margin. In computer vision, MKL was recently applied to object classification in order to combine various image descriptors optimally [20]. At this moment, feature or kernel combination methods used for object classification are: (a) the uniform weights after an appropriate normalization [13, 21], (b) a brute force search over a validation set [3, 4], (c) the MKL [20]. Although the average kernel (a) is not adaptive, it often works reasonably. On the other hand, the best combination in a validation set outperforms the others, but computationally intractable, if the number of the base kernels is not small. The MKL (c) can give adaptive kernel weights within reasonable amount of time. It is also theoretically and technically attractive, because the kernel weights and the classifier can be learned simultaneously with the margin criterion. However, it can not often outperform the average kernel, because it relies on the ℓ1 regularizer, which gives too sparse weights. Motivated by
this observation, we propose in this paper a method which outputs less sparse solutions. The weights are optimized based on the kernel-target alignment [5], which measures goodness of kernels to determine the mixing weights adaptively, prior to the classifier learning by the support vector machine (SVM). Through experiments with an image data set from PASCAL visual object classification (VOC) challenge 2008 [8], we show that our simple two-step approach is at least as good as the average kernel and the ℓ1 MKL in all object classes. This paper is organized as follows. In Section 2, we will briefly explain the relevant techniques, MKL and the kerneltarget alignment. Then, the optimization of the alignment score within mixtures of base kernels will be discussed and a simple procedure to get an approximation of the solution will be described in Section 3. Experimental results with the VOC 2008 data set will be presented in Section 4, followed by conclusions and future directions in Section 5.
2. RELATED RESEARCHES 2.1 Multiple Kernel Learning Let K1 , . . . , Km be a set of base kernels obtained from different features. In the MKL framework, linear combinations of the base kernels, i.e. Kβ =
m X
PM 1 Here, ∥β∥1 = j=1 βj is the ℓ norm of a vector. This optimization problem can also be solved by a semi-infinite program (SIP) in the dual formulation [19]. min λ λ,β
s.t.
λ≥
n X
αi −
i=1
n m X 1 X αi αl yi yl βj kj (xi , xl ), 2 j=1 i,l=1
∀α ∈ Rn (1) n X
0 ≤ αi ≤ C, ∀i;
yi αi = 0;
i=1
βj ≥ 0, ∀j;
∥β∥1 ≤ 1
The solution can be obtained with interleaving cutting plane algorithms, i.e. by iterating the following two steps alternatively. − For the actual mixture β, the solution of the regular SVM as in the second line of eq.(1) generates the most strongly violated constraint. − With respect to set of active constraints, the optimal values of β and λ are identified by solving a linear program. An implementation of the MKL is available in the toolbox Shogun [19].
βj Kj
j=1
are considered and the mixing parameter β = (βj ) is learned together with model parameters, so as to maximize the generalization ability. In Lankriet et al. (2004) [12], they considered two kinds of constraints on β and the combined kernel K. In the most general form, a class of positive semidefinite matrices K ≽ 0 with tr(K) ≤ c is investigated, which leads to an expensive semi-definite programming (SDP) problem. Therefore, additional constraints βj ≥ 0 are usually imposed. The optimization problem can be solved by a more tractable quadratically constrained quadratic program (QCQP). Suppose that ψj is the feature mapping from the input X to a reproducing kernel Hilbert space Hj which gives rise to the j-th base kernel function via ¯ ) = 〈ψj (x), ψj (¯ kj (x, x x)〉Hj . In the case of learning with muptiple kernels, the large margin classifier is extended as f (x) =
m X
βj w ⊤ j ψj (x)
+b=
j=1
m X
v⊤ j ψj (x)
+ b,
j=1
where v j denotes the directional parameter w j for the j-th feature multiplied by the kernel weight βj . MKL with the ℓ1 regularizer optimizes the following problem [22]. min
β,v,b,ξ
s.t.
m 1 X v⊤ j vj + C∥ξ∥1 2 j=1 βj
∀i,
m X
v⊤ j ψj (xi ) + b
!
≥ 1 − ξi
j=1
ξ ≥ 0;
β ≥ 0;
∥β∥1 ≤ 1
However, the above framework tends to output sparse solutions; many elements of β are suppressed to 0. This is because of the nature of ℓ1 regularizer [18]. Sparsity often provides some advantage, but sometimes causes poorer performance. As shown in Section 4, MKL is outperformed even by the average kernel in our application. In the following sections, we propose a simple procedure which outputs less sparse solutions and outperforms both MKL and the average kernel.
2.2 Kernel-Target Alignment Cristianini et al. (2001) [5] introduced the notion of kernel alignment, a measure of similarity between two kernel functions or between a kernel and a target function. Let K1 and K2 be the Gram matrices of kernel functions k1 and k2 for a set {xi }n i=1 of the inputs. Then, the alignment between the kernels k1 and k2 is defined as the cosine of the angle between the two matrices K1 and K2 A(K1 , K2 ) := where 〈K1 , K2 〉F :=
Pn
i,j=1 1/2
〈K1 , K2 〉F , ∥K1 ∥F ∥K2 ∥F
(2)
k1 (xi , xj )k2 (xi , xj ) and
∥K1 ∥F := 〈K1 , K1 〉F are a standard inner product and Frobenius norm in matrix spaces, respectively. Let us take as K2 the ideal kernel/similarity 1, yi = yj L = (Lij ), Lij := 0, otherwise which achieves perfectly the correct clustering defined by the labels {yi }n i=1 . Then, the alignment A(K, L) measures the degree of fitness of the kernel K to the given learning task based on the training samples {xi , yi }n i=1 . Cristianini et al. (2001) [5] proposed a procedure to update kernels (Gram
matrices) by optimizing the alignment A(K, L), where the definition of the ideal kernel L was slightly different. The original deffinition (2) of the alignment is rather problematic, if the sizes of classes are unbalanced. Therefore, people recommend centering of kernels (Gram matrices) before computing the alignment score. Centering in the corresponding feature spaces is achieved by multiplying the matrix 1 H := I − 11⊤ n to Gram matrices K from both sides, where I is the identity matrix of size n and 1 is the column vector with all elements 1. In this paper we also use the alignment with centering A(HKH, HLH) =
〈HKH, HLH〉F , ∥HKH∥F ∥HLH∥F
(3)
as the measure for evaluating goodness of kernels, where 〈HKH, HLH〉F = tr(HKHL), because H is a projection matrix. In binary classification problems, the centrized ideal ey e ⊤ , where kernel HLH is proportional to y ( 1 yi = +1 n+ e = (e y yi ), yei := − n1− yi = −1 and n+ and n− are the sizes of the positive and negative classes, respectively.
j, the optimization problem is simplified to a QCQP. max β
m X
D E βj Kj , yy ⊤ F
j=1 m X
s.t.
βi βj 〈Ki , Kj 〉F ≤ 1
(5)
i,j=1
β ≥ 0.
3.2
An Approximation Procedure
Instead of the alignment score considered in Lanckriet et al.(2004) [12], we use its centerised version (3), i.e. A(β) := A(HKβ H, HLH), Pm where Kβ = j=1 βj Kj . In addition, we consider an approximation procedure, which is much simpler than the exact QCQP (5). Thanks to the simple form of the alignment (3), we can compute the optimum of the parameter β, if we ignore the positivity constraints β ≥ 0. Let us define the following vector and matrix. b = (bi ),
bi = tr(HKi HL)
G = (Gij ),
Gij = tr(HKi HKj )
Then, the eqution (3) can be expressed as b⊤ β A(β) = A(HKβ H, HLH) ∝ ` ´1/2 . β ⊤ Gβ Thus, the solution to maximize the alignment unconditionally can be obtained analytically as
3. TUNING KERNEL WEIGHTS BY KERNEL-TARGET ALIGNMENT 3.1 The Optimization Problem
β ∗ = argmax A(β) ∝ G−1 b. β
MKL solves kernel selection (learning β) and classification (learning α) simultaneously by optimizing the margin criterion. On the other hand, we use the kernel target alignment for determining the mixing weights β separately, prior to learning classifiers. Lanckriet et al.(2004) [12] proposed using the alignment score for learning the kernel weights and an optimization procedure by solving a quadratically constrained quadratic program (QCQP). Suppose that K is the set of Gram matrices from which adaptive kernels are selected for final classification tasks. The kernel matrix K ∈ K which is maximally aligned with the set of modified labels y ∈ Rn can be found by solving the following optimization problem. D E max K, yy ⊤
However, if the matrix G has a bad condition number, this solution is unstable and unreliable. Therefore, it is common to introduce a regularization term in the denominator of the alignment [1], i.e. to modify the objective as A(β, r) :=
∥HLH∥F
b⊤ β ` ⊤ ´1/2 , β Gβ + r∥β∥2
(6)
where r > 0 is a regularization parameter. This quantity is maximized at β ∗ (r) = argmax A(β, r) ∝ (G + rI)−1 b,
(7)
β
(4)
which still has an analytical form. The regularization parameter r can be determined by a validation schemes. Note that only one parameter is optimized by a validation scheme in our method, while all the weights for kernels are optimized in a brute force search.
We remark that, when K is the set of all positive semidefinite matrices, i.e. K = {K ≽ 0}, this is an SDP. In this case, the optimization problem (4) has a trivial solution K ∝ yy ⊤ which achieves the alignment score 1, the maximal P value. If we restrict ourselves to linear mixtures K = m j=1 βj Kj of base kernels as MKL, there are two opitons, i.e. with or without further positivity constraints βj ≥ 0 for all j. With the positivty constraints and the assumition Kj ≽ 0 for all
Since we have been ignoring the positivity constraint so far, some elements of β ∗ (r) can be negative. In such cases, we fix the negative weights at 0 and calculated the unconstrained solution only with the remaining elements similarly as eq.(7). We repeat this process until all weights become non-negative (see Algorithm 1). This additional procedure could result in a sparse solution. However, as will be shown in the next section, our method gives much less sparse solutions than MKL in practice. In summary, the differences from the short note in Lanckriet et al.(2004) are: (a) centerization of the alignment score, (b) introducing the regularization parameter in
A,K
s.t.
F
tr(A) ≤ 1 „ « A K⊤ ≽0 K In K∈K
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
sheep
sofa
train
tvmonitor
pottedplant
Figure 1: Example images of the VOC 2008 data set. An image can contain multiple objects, and some objects are difficult to be detected. (See, e.g., ”bottle” example. Two bottles are on a table behind two persons.)
the denominator to make the score robust, and (c) an approximation procedure (Algorithm 1) which gives reasonable solutions.
4. EXPERIMENTS In this section, we show an advantage of our procedure with the VOC 2008 data set [8]. The data set for the classification tasks contains 8780 images of 20 object classes, which were split into train, validation and test sets by the organizers (2113 for train, 2227 for validation, and 4340 for test). For each class, the label is yi = 1 if at least one object from the class is included in the i-th image, yi = −1 if no object is included, and yi = 0 if some objects, which are too difficult to find out, is included1 . The evaluation is based on precision-recall (PR) curves. The principal quantitative measure is the average precision (AP) over all recall values. Since the images labeled as y = 0 are not taken into account in the challenge, we also excluded them in the training and validation processes in our experiments. 1 This judgement, as well as annotating all the images, was done by human.
We used the following four types of basic features. HoWg: Histogram of visual words [6] based on SIFT [15] descriptor in gray (average over RGB) channel. We didn’t use interest point detectors but 10 pixels regular grid, which has proved to be more powerful for classification task. HoWh: HoW in the hue color channel. HoG: Histogram of oriented gradients [7]. Hocol: Histogram of the hue color channel [16]. We also applied the image pyramid representation, where an image is tiled into regions with multiple resolutions. With 1-level pyramid representation, for example, the histogram within the whole image (level 0) and the histograms within 4 quarter images (level 1) are used. This captures spatial information of feature appearance to some extent, and have shown good performance in image classification [14]. We applied 2-level representation, i.e., 1, 4, and 16 tiled images were considered. For each feature, we constructed a kernel
Table 1: Average precisions on the test images of random splits (10 iterations). For each row, the best method and comparable ones based on the Wilcoxon signed-rank test at the significance level 5% are marked in bold faces. In almost all classes, our method was among the best ones. Note that APs are worse than Table 2, since the number of training images (train+val) in the test phase is less than that of the official split.
uniform MKL proposed
average 40.8±1.0 40.8±0.9 41.5±0.8
uniform MKL proposed uniform MKL proposed
aeroplane 66.4±6.6 66.9±6.8 66.2±6.6 cat 46.1±3.2 47.7±3.8 47.4±3.1 person 83.9±1.2 84.1±1.3 84.3±1.2
bicycle 39.1±5.9 36.4±6.6 40.9±6.6 chair 43.0±4.7 44.1±4.9 42.9±5.1 pottedplant 15.5±4.3 14.7±4.5 16.1±5.5
bird 43.3±5.8 44.1±5.7 44.4±5.6 cow 8.2±2.9 10.8±3.5 9.4±2.5 sheep 22.9±7.0 26.3±7.7 26.7±6.8
Algorithm 1 Iterative Modification Input: base kernels {Ki }, label y, regularization constant r Calculate the ideal kernel L from the label Calculate all inner products Gij = 〈HKi H, HKj H〉F bi = 〈HKi H, HKj H〉F Calculate the unconstrained solution β ∗ = (G + rI)−1 b while β ∗ has negative elements do Set the negative elements at 0 e and the subvector e Take the submatrix G b corresponding to the positive components e + rI)−1 e Calculate (G b and substitute it to the corresponding elements of β ∗ end while Output: the mixing weights β ∗
from each level of the pyramid to control the kernel weights between levels, depending on object classes. In the end, we used 4 × 3 = 12 kernels. Our choice of kernel function is the χ2 kernel, which has proved to be a suitable similarity measure between bag of words histograms [21]: „ « 1 k(x, x′ ) = exp − χ2 (h(x), h(x′ )) , 2γ
(8)
where χ2 (h, h′ ) =
V X (hv − h′v )2 . hv + h′v v=1
Here, x as well as x′ is an image, h(x) = (h1 , . . . , hV ) is the histogram of an image x with V bins, and γ is the width parameter of the χ2 kernel. We set the width for each kernel to be the mean of the χ2 distances between all pairs of training samples [11]. We used Shogun library for training SVM as well as MKLSVM [19]. In the first experiment, we prepared 10 repetitions of smaller
boat 57.5±5.0 56.8±5.0 56.4±5.3 diningtable 29.5±9.1 27.1±7.0 28.4±8.7 sofa 31.3±6.5 33.0±7.2 32.9±6.2
bottle 18.4±3.6 19.2±3.8 18.6±3.4 dog 33.2±2.7 34.4±4.4 34.5±2.4 train 50.9±9.2 50.9±9.8 50.6±9.3
bus 42.3±9.1 39.3±10.6 43.7±8.7 horse 42.5±6.5 39.6±5.8 42.7±7.4 tvmonitor 51.0±5.7 50.8±5.5 51.1±6.3
car 48.9±3.3 49.0±2.8 48.9±3.1 motorbike 42.8±2.9 41.7±4.5 44.8±3.8
data sets with 2111 train, 1111 val, and 1110 test sets by randomly splitting the train and validation sets2 . In the validation phase, classifiers were trained with the train set and evaluated the APs with the validation set. The regularization parameter (usually denoted by C) for SVM was optimized for each class in this phase. For the proposed method, the regularization parameter (r in (6)) was also optimized. (Chosen from 10−4 , 10−3 , . . . , 104 .) In the test phase, we trained the classifiers with train + validation sets with the parameter optimized in the validation phase, and evaluated the APs with the test set. We iterated this procedure with the 10 different random splittings. Table 1 shows the APs with standard deviations. In order to check significance of the performance differences, we conducted Wilcoxon signed-rank test with the 10 iterations of AP score for each method and class (and average of all classes). In the table, the methods whose performance is not significantly worse than the best score are marked in bold faces. We see that our proposed method almost always is included in the best group, while the uniform weights and MKL are sometimes significantly worse than the best. Figure 2 shows the selected weights by MKL and those by our method. By nature of ℓ1 -norm regularization, MKL tends to give sparse solutions, while our method gives less spare solutions. In the second experiment, we evaluated the performance with the official split of the VOC2008 challenge. The validation and the test procedures were the same as in the first experiment. Table 2 shows the comparison of the test APs evaluated by the organizers on request, which shows the advantage of our method.
5.
CONCLUSIONS
In this paper, we proposed a procedure to tune weights of multiple kernel adaptively based on a regularized kerneltarget alignment score. The combined kernel determined by our method can be used by the standard SVM with a single kernel for obtaining classification results. Through experiments with the image data set from PASCAL visual object classification (VOC) challenge [8], we showed that our 2
The ground truth for test set is not available.
0.7 aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
11
12
0.7 aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
11
12
Figure 2: Chosen (normalized) weights for 12 kernels by MKL (upper) and the proposed mothod (lower). The indices in the horizontal axis correspond to the following kernels. 1–3: HoWg with pyramid level 0, 1, and 2. 4–6:HoWh with pyramid level 0, 1, and 2. 7–9:HoG with pyramid level 0, 1, and 2. 10–12:Hocol with pyramid level 0, 1, and 2. Our method gives less sparse weights than MKL.
Table 2: Average precisions on the test images of the official split. The obtained classifier outputs were evaluated by the VOC challenge organizers on request. For each row, the best methods are marked in bold faces. On average and in most classes, our proposed method gave better AP scores.
uniform MKL proposed
uniform MKL proposed
average
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
44.5 43.2 45.8
70.9 71.5 72.0
41.5 37.1 42.0
49.6 45.7 51.3
59.9 58.6 59.0
25.2 24.8 25.6
34.2 37.6 39.8
51.0 48.5 50.8
48.4 44.1 49.0
45.4 43.1 44.0
24.3 23.1 26.1
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
34.2 28.7 34.3
37.3 35.9 36.9
54.4 51.6 57.1
51.7 47.8 53.0
82.2 81.5 82.6
22.9 19.2 23.2
21.0 21.3 28.7
27.7 30.8 31.1
62.6 65.5 62.9
46.5 47.3 46.6
simple approach significantly outperforms the average kernel SVM and MKL. Future work can be investigating other framework which outputs less sparse solutions of the kernel weights, for example, optimizing Hilbert-Schmidt normalized information criterion [9] is a candidate.
6. ACKNOWLEDGMENTS We acknowledge the THESEUS project (01MQ07018) funded by German federal ministry of economics and technology. We also thank the organizers of the PASCAL VOC challenge, in particular, Mark Everingham, for evaluating the test performance of our classification results.
7. REFERENCES [1] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002. [2] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In C. E. Brodley and C. E. Brodley, editors, Proceedings of the 21st International Converence on Machine Learning (ICML ’04), volume 69 of ACM International Conference Proceeding Series, 2004. [3] A. Bosch, A. Zisserman, and X. Mu˜ noz. Representing shape with a spatial pyramid kernel. In Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR ’07), pages 401–408, 2007. [4] A. Bosch, A. Zisserman, and X. Mu˜ noz. Image classification using ROIs and multiple kernel learning, 2008. submitted to International Journal of Computer Vison. [5] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems, volume 14, pages 367–373, 2001. [6] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, Prague, Czech Republic, May 2004. [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In C. Schmid, S. Soatto, and C. Tomasi, editors, Proceedings of the International Computer Society Conference on Computer Vision & Pattern Recognition (CVPR ’05), volume 2, pages 886–893, San Diego, USA, June 2005. [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. http://www.pascal-network.org/challenges/VOC/ voc2008/workshop/index.html, 2008. [9] K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨ olkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, volume 20, 2008. [10] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems, volume 11, pages 487–493, 1998. [11] C. Lampert and M. Blaschko. A multiple kernel learning approach to joint multi-class object detection. In DAGM, pages 31–40, 2008.
[12] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, pages 27–72, 2004. [13] S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using local affine regions. IEEE Trans. of Pattern Analysis and Machine Intelligence, 27:1265–1278, 2005. [14] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’06), volume 2, pages 2169–2178, New York, USA, 2006. [15] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [16] M. Marszalek and C. Schmid. Learning representations for visual object class recognition. http://pascallin.ecs.soton.ac.uk/challenges/VOC /voc2007/workshop/marszalek.pdf. [17] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. More efficiency in multiple kernel learning. In Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML), 2007. [18] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. [19] S. Sonnenburg, G. Raetsch, C. Schaefer, and B. Sch¨ olkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006. [20] M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV ’07), pages 1–8, 2007. [21] J. Zhang, M. Marszalek, S.Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2):213–238, 2007. [22] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pages 1191–1198, New York, NY, USA (06 2007), 2007. ACM Press. [23] A. Zien, G. Ratsch, S. Mika, B. Sch¨ olkopf, and T. Lengauer. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16:799–807, 2000.