Publication

Viewer
Transcript

Publication

1. Multiple Kernel Clustering: Published on SDM 2009. 2. Efficient MultiClass Maximum Margin Clustering: Published on ICML 2008 3. CutS3VM: A Fast Semi‐Supervised SVM Algorithm: Published on KDD 2008 4. Efficient Maximum Margin Clustering via Cutting Plane Algorithm: Published on SDM 2008 5. Maximum Margin Embedding: Published on ICDM 2008 6. Active Model Selection for Graph‐Based Semi‐Supervised Learning: Published on ICASSP 2008 7. Smoothness Maximization via Gradient Descents: Published on ICASSP 2007

Multiple Kernel Clustering Bin Zhao∗

James T. Kwok†

Abstract Maximum margin clustering (MMC) has recently attracted considerable interests in both the data mining and machine learning communities. It first projects data samples to a kernel-induced feature space and then performs clustering by finding the maximum margin hyperplane over all possible cluster labelings. As in other kernel methods, choosing a suitable kernel function is imperative to the success of maximum margin clustering. In this paper, we propose a multiple kernel clustering (MKC) algorithm that simultaneously finds the maximum margin hyperplane, the best cluster labeling, and the optimal kernel. Moreover, we provide detailed analysis on the time complexity of the MKC algorithm and also extend multiple kernel clustering to the multi-class scenario. Experimental results on both toy and real-world data sets demonstrate the effectiveness and efficiency of the MKC algorithm.

1 Introduction Over the decades, many clustering methods have been proposed in the literature, with popular examples including the k-means clustering [9], mixture models [9] and spectral clustering [4, 8, 21]. Recently, maximum margin clustering (MMC) has also attracted considerable interests in both the data mining and machine learning communities [26, 27, 28, 30, 31, 32]. The key idea of MMC is to extend the maximum margin principle of support vector machines (SVM) to the unsupervised learning scenario. Given a set of data samples, MMC performs clustering by labeling the samples such that the SVM margin obtained is maximized over all possible cluster labelings [27]. Recent studies have demonstrated its superior performance over conventional clustering methods. However, while supervised large margin methods are usually formulated as convex optimization problems, MMC leads to a non-convex integer optimization problem which is much more difficult to solve. Recently, different optimization techniques have been used to alleviate this problem. Examples include semi-definite programming (SDP) [26, 27, 28], alternating optimiza∗ Department

of Automation, Tsinghua University, China. of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong. † Department

Changshui Zhang∗

tion [30] and the cutting-plane method [31, 32]. Moreover, like other kernel methods, MMC also relies on a kernel function to project the data samples to a high-dimensional kernel-induced feature space. A good choice of the kernel function is therefore imperative to the success of MMC. However, one of the central problems with kernel methods in general is that it is often unclear which kernel is the most suitable for a particular task [2, 5, 14, 17]. So, instead of using a single fixed kernel, recent developments in the SVM and other kernel methods have shown encouraging results in constructing the kernel from a number of homogeneous or even heterogeneous kernels [1, 10, 13, 14, 18, 33, 23, 24, 29]. This provides extra flexibility and also allows domain knowledge from possibly different information sources to be incorporated to the base kernels. However, previous works in this so-called multiple kernel learning approach have all been focused on the supervised and semi-supervised learning settings. Therefore, how to efficiently learn the kernel in unsupervised learning, or maximum margin clustering in particular, is still an interesting yet unexplored research topic. In this paper, we propose a multiple kernel clustering (MKC) algorithm that finds the maximum margin hyperplane over all possible cluster labelings, together with the optimal kernel-induced feature map, automatically from the data. Specifically, we consider a non-negative combination of a given set of M feature maps Φ1 , . . . , ΦM (corresponding to M base kernels K1 , . . . , KM ): (1.1)

Φ(x) =

M X

βk Φk (x),

k=1 P p β k k ≤

with βk ≥ 0 and 1 for some integer p. By simultaneously optimizing the objective function in MMC with respect to both the hyperplane parameters (weight w and bias b) and the combination parameters βk ’s, we can obtain the optimal feature mapping for MMC. Computationally, the optimization problem in multiple kernel clustering can be solved by the cutting plane method [12]. As will be shown later in the sequel, one can construct a nested sequence of successively tighter relaxations of the original MKC problem, and each optimization problem in this sequence can be efficiently

solved as a second order cone program (SOCP) [3] by using the constrained concave-convex procedure (CCCP) [22]. Experimental evaluations on toy and real-world data sets demonstrate both the effectiveness and efficiency of multiple kernel clustering. The rest of this paper is organized as follows. In Section 2, we first present the principles of multiple kernel clustering on the simpler setting of two-class clustering. We will show that the original integer programming problem can be transformed to a sequence of convex programs which are then efficiently solved by a cutting plane algorithm. In Section 3, we provide theoretical analysis on the time complexity of the MKC algorithm. Section 4 extends multiple kernel clustering from the two-class to the multi-class setting. Experimental results on both toy and real-world data sets are provided in Section 5, followed by some concluding remarks in Section 6. 2

ˆX ˆ T , and set decomposition of the kernel matrix K = X T ˆ i,1 , . . . , X ˆ i,n ) . Φ(xi ) = (X Moreover, the last constraint in (2.2) is the class balance constraint, which is introduced to avoid the trivially “optimal” solution that assigns all patterns to the same class and thus achieves “infinite” margin. This class balance constraint also avoids the unwanted solution of separating a single outlier or a very small group of samples from the rest of the data. Here, l > 0 is a constant controlling the class imbalance. According to Eq.(2.2), maximum margin clustering maximizes the margin with respect to both the labeling vector y and the separating hyperplane parameters (w, b). The unknown binary vector y renders Eq.(2.2) an integer program, which is much more difficult to solve than the quadratic program (QP) in SVM. However, as shown in [31], we can equivalently formulate the maximum margin clustering problem as

Multiple Kernel Clustering

In this section, we first present the multiple kernel clustering algorithm for two-class clustering. Extension to the multi-class case will be discussed in Section 4. 2.1 Maximum Margin Clustering As briefly introduced in Section 1, the key idea of maximum margin clustering (MMC) is to extend the maximum margin principle from supervised learning to unsupervised learning. In the two-cluster case, given a set of examples X = {x1 , · · · , xn }, MMC aims at finding the best label combination y = {y1 , . . . , yn } ∈ {−1, +1}n such that an SVM trained on this {(xi , yi ), . . . , (xn , yn )} will yield the largest margin. Computationally, it solves the following optimization problem: (2.2) min

min

y∈{±1}n w,b,ξi

s.t.

n CX 1 T w w+ ξi 2 n i=1

∀i ∈ {1, . . . , n} : yi (wT Φ(xi )+b) ≥ 1−ξi , ξi ≥ 0, n X −l ≤ yi ≤ l. i=1

Here, the data samples X are mapped to a highdimensional feature space using a possibly nonlinear feature mapping Φ. In the support vector machine, training is usually performed in the dual and this Φ is utilized implicitly by using the kernel trick. In cases where primal optimization with a nonlinear kernel is preferred, we can still obtain a finite-dimensional representation for each sample in the kernel-induced feature space by using kernel principal component analysis [20]. Alternatively, following [6], one can also compute the Cholesky

(2.3)

min

w,b,ξi

s.t.

n CX 1 T w w+ ξi 2 n i=1

∀i ∈ {1, . . . , n} :

|wT Φ(xi ) + b| ≥ 1 − ξi , ξi ≥ 0, n X £ T ¤ w Φ(xi ) + b ≤ l. −l ≤ i=1

Here, the labeling vector y is computed as yi = sgn(wT φ(xi ) + b) and a slightly relaxed class balance constraint is used [21]. This is much easier to handle than the original one in Eq.(2.2). 2.2 Multiple Kernel Maximum Margin Clustering Traditionally, maximum margin clustering projects the data samples to the feature space by a fixed feature mapping Φ (which is induced by a kernel K). Choosing a suitable kernel is therefore imperative to the success of maximum margin clustering. However, it is often unclear which kernel is the most suitable for the task at hand. In this paper, inspired by the works of multiple kernel learning in supervised learning [1, 10, 13, 14, 18, 33, 23, 24, 29], we propose to use a non-negative combination of several base kernels for computing the feature map in this maximum margin clustering setting. Specifically, each data sample xi in the input space is translated via M mappings Φk : x 7→ Φk (x) ∈ RDk , k = 1, . . . , M , to M feature representations Φ1 (xi ), . . . , ΦM (xi ). Here, Dk denotes the dimensionality of the kth feature space. For each feature mapping, there is a separate weight vector wk . Then one solves the following optimization problem, which is equivalent

to the MMC formulation in Eq.(2.3) when M = 1: (2.4) min

β,w,b,ξ

s.t.

M n CX 1X βk ||wk ||2 + ξi 2 n i=1 k=1

∀i ∈ {1, . . . , n} : ¯ ¯ M ¯X ¯ ¯ ¯ T βk wk Φk (xi ) + b¯ ≥ 1 − ξi , ξi ≥ 0, ¯ ¯ ¯ k=1

ξi ’s, one for each data sample. In the following, we first reformulate Eq.(2.6) to reduce the number of slack variables. Theorem 2.1. Multiple kernel MMC can be equivalently formulated as: M

(2.7) min

β,v,b,ξ

∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βkp ≤ 1,

(2.8)

k=1

−l ≤

"M n X X i=1

#

βk wkT Φk (xi ) + b ≤ l.

k=1

Here, we regularize the M output functions according to their weights βk ’s. The non-negativity constraints on the weights guarantee that the combined regularizer is convex, and the resulting kernel is positive semi-definite. Moreover, p here is a positive integer. In this paper, we choose p = 2 or, in other words, the `2 regularizer is used on β = (β1 , . . . , βM )T . While Eq.(2.4) is quite intuitive, it has the disadvantage that both the objective function and the first and last constraints are non-convex due to the coupling of βk and wk in the output function. Therefore, we apply the following change of variables [33] (2.5)

∀k ∈ {1, . . . , M } : vk = βk wk .

After the above change of variables, multiple kernel MMC is equivalently formulated as follows. (2.6) min

β,v,b,ξ

s.t.

M n 1 X ||vk ||2 CX + ξi 2 βk n i=1 k=1

∀i ∈ {1, . . . , n} : ¯ ¯ M ¯X ¯ ¯ ¯ vkT Φk (xi ) + b¯ ≥ 1 − ξi , ξi ≥ 0, ¯ ¯ ¯ k=1

∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βk2 ≤ 1,

k=1

−l ≤

n M X X i=1

"

k=1

#

vkT Φk (xi ) + b ≤ l,

where v = (v1 , . . . , vM )T . Note that the objective function and all constraints except the first one are now convex. 2.3 Cutting Plane Algorithm The multiple kernel MMC formulation in Eq.(2.6) has n slack variables

1 X ||vk ||2 + Cξ 2 βk k=1

s.t. ∀c ∈ {0, 1}n : ¯ ¯ n n M ¯ 1X 1 X ¯¯X T ¯ ci ¯ vk Φk (xi )+b¯ ≥ ci −ξ, ¯ n ¯ n i=1

i=1

k=1

∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βk2 ≤ 1, ξ ≥ 0,

k=1

−l ≤

"M n X X i=1

vkT Φk (xi )

k=1

#

+ b ≤ l.

Proof. For simplicity, we denote the optimization problem shown in Eq.(2.6) as OP1 and the problem in Eq.(2.7) as OP2. To prove the theorem, we will show that OP1 and OP2 have the same optimal objective value and an equivalent set of constraints. Specifically, we will prove that for every (v, b, β), the Pn optimal ξ ∗ and {ξ1∗ , . . . , ξn∗ } are related by ξ ∗ = n1 i=1 ξi∗ . This means that, with (v, b, β) fixed, (v, b, β, ξ ∗ ) and (v, b, β, ξ1∗ , . . . , ξn∗ ) are optimal solutions to OP1 and OP2, respectively, and they result in the same objective value. First, note that for any given (v, b, β), each slack variable ξi in OP1 can be optimized individually as ¯M ¯) ( ¯X ¯ ¯ ¯ (2.9) ξi∗ = max 0, 1 − ¯ vkT Φk (xi ) + b¯ . ¯ ¯ k=1

For OP2, the optimal slack variable ξ is (2.10) ¯M ¯) ( n n ¯ 1 X ¯¯X T 1X ¯ ∗ ci − ci ¯ vk Φk (xi )+b¯ . ξ = max n ¯ n i=1 n i=1 ¯ c∈{0,1} k=1

Since the ci ’s are independent of each other in Eq.(2.10), they can also be optimized individually and so ¯M ¯) ( n ¯ X 1 ¯¯X T 1 ¯ ∗ ci − ci ¯ vk Φk (xi )+b¯ max (2.11)ξ = ¯ ¯ n ci ∈{0,1} n i=1 k=1 ¯ ¯ ) ( M n ¯ ¯X 1X ¯ ¯ vkT Φk (xi ) + b¯ max 0, 1− ¯ = ¯ ¯ n i=1 k=1

n 1X ∗ = ξ . n i=1 i

Hence, the objectives of OP1 and OP2 have the same value for any (v, b, β) given the optimal ξ ∗ and {ξ1∗ , . . . , ξn∗ }. Therefore, the optima of these two optimization problems are the same. That is to say, we can solve the optimization problem in Eq.(2.7) to get the multiple kernel MMC solution. 2 In the optimization problem shown in Eq.(2.7), the number of slack variables is reduced by n−1 and a single slack variable ξ is now shared across all the non-convex constraints. This greatly reduces the complexity of the non-convex optimization problem for multiple kernel MMC. On the other hand, the number of constraints in Eq.(2.8) is increased from n to 2n . This exponential increase of constraints may seem intimidating at first sight. However, we will show that we can always find a small subset of constraints from the whole constraint set in (2.8) while still ensuring a sufficiently accurate solution. Specifically, we employ an adaptation of the cutting plane algorithm [12] to solve the multiple kernel MMC problem. It starts with an empty constraint subset Ω, and computes the optimal solution to problem (2.7) subject to the constraints in Ω. The algorithm then finds the most violated constraint in (2.8) and adds it to the subset Ω. In this way, we construct a series of successively tightening approximations to the original multiple kernel MMC problem. The algorithm stops when no constraint in (2.8) is violated by more than ². The whole cutting plane algorithm for multiple kernel MMC is presented in Algorithm 1. Algorithm 1 Cutting plane algorithm for multiple kernel maximum margin clustering. Input: M feature mappings Φ1 , . . . , ΦM , parameters C, l and ², constraint subset Ω = φ. repeat Solve problem (2.7) for (v, b, β) under the current working constraint set Ω. Select the most violated constraint c and set Ω = Ω ∪ {c}. until the newly selected constraint c is violated by no more than ². We will prove in Section 3 that one can always find a polynomially-sized subset of constraints such that the solution of the corresponding relaxed problem satisfies all the constraints in (2.8) up to a precision of ². That is to say, the remaining exponential number of constraints are guaranteed to be violated by no more than ², and thus do not need to be explicitly added to the optimization problem [11]. There are two remaining issues in our cutting plane algorithm for multiple kernel MMC. First, how to solve

problem (2.7) under a given constraint subset Ω? Second, how to find of the most violated constraint in (2.8)? These will be addressed in the following two subsections. 2.3.1 Optimization via the CCCP In each iteration of the cutting plane algorithm, we need to solve a non-convex optimization problem to obtain the optimal separating hyperplane under the current working constraint set Ω. Although the objective function in (2.7) is convex, the constraints are not. This makes problem (2.7) difficult to solve. Fortunately, the constrained concave-convex procedure (CCCP) is designed to solve these optimization problems with a concave-convex objective function and concave-convex constraints [22]. Specifically, the objective function in (2.7) is quadratic and all the constraints except the first one are linear. Moreover, note that although the constraint in Eq.(2.8) is non-convex, it is a difference of two convex functions which can be written as: (2.12) ∀c ∈ Ω : ¯M ¯ ! Ã n n ¯ 1 X ¯¯X T 1X ¯ ci −ξ − ci ¯ vk Φk (xi )+b¯ ≤ 0. ¯ n i=1 n i=1 ¯ k=1

Hence, we can solve problem (2.7) with the CCCP as follows. Given an initial estimate (v (0) , b(0) ), the CCCP computes ¯(v(t+1) , b(t+1) ) from¯ (v(t) , b(t) ) by replacing Pn ¯P M ¯ 1 T i=1 ci ¯ k=1 vk Φk (xi ) + b¯ in the constraint (2.12) n

with its first-order Taylor expansion at (v (t) , b(t) ). Problem (2.7) then becomes: M

(2.13)min

β,v,b,ξ

1 X ||vk ||2 + Cξ 2 βk k=1

s.t. ∀c ∈ Ω : # "M n n 1X (t) X T 1X ci ≤ ξ + ci z vk Φk (xi )+b , ni=1 ni=1 i k=1

∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βk2 ≤ 1, ξ ≥ 0,

k=1

−l ≤

"M n X X i=1

k=1

vkT Φk (xi )

#

+ b ≤ l,

³P ´ (t) (t)T M (t) where zi = sgn . Introducv Φ (x ) + b k i k=1 k ing additional variable tk defined as the upper bound of 2 ||vk ||2 (i.e., adding additional constraints ||vβkk|| ≤ tk ), βk we can formulate the above as the following second order

cone programming (SOCP) [3] problem: M

(2.14)min

β,v,b,ξ,t

1X tk + Cξ 2 k=1

s.t. ∀c ∈ Ω : # "M n n 1X (t) X T 1X vk Φk (xi )+b , ci ≤ ξ + ci z ni=1 ni=1 i k=1

∀k ∈ {1, . . . , M } : ¯¯· ¸¯¯ ¯¯ ¯¯ 2vk ¯¯ ¯¯ ¯¯ tk − βk ¯¯ ≤ tk + βk , βk ≥ 0, M X

βk2 ≤ 1, ξ ≥ 0

k=1

−l ≤

"M n X X i=1

k=1

vkT Φk (xi )

#

+ b ≤ l.

Here, we have used the fact that hyperbolic constraints of the form sT s ≤ xy, where x, y ∈ R+ and s ∈ Rn , can be equivalently transformed to the second order cone constraint [16, 25] ¯¯· ¸¯¯ ¯¯ ¯¯ 2s ¯ ¯ ¯¯ (2.15) ¯¯ x − y ¯¯ ≤ x + y.

The above SOCP problem can be solved in polynomial time [15]. Following the CCCP, the obtained solution (v, b, β, ξ, t) from this SOCP problem is then used as (v(t+1) , b(t+1) , β, ξ, t), and the iteration continues until convergence. The algorithm for solving problem (2.7) subject to the constraint subset Ω is summarized in Algorithm 2. As for its termination criterion, we check if the difference in objective values from two successive iterations is less than α% (which is set to 0.01 in the experiments). Algorithm 2 Solve problem (2.7) subject to constraint subset Ω via the constrained concave-convex procedure.

Initialize (v(0) , b(0) ). repeat Obtain (v(t+1) , b(t+1) , β, ξ, t) as the solution of the second order cone programming problem (2.14). Set v = v(t+1) , b = b(t+1) and t = t + 1. until the stopping criterion is satisfied.

2.3.2 The Most Violated Constraint The most violated constraint in (2.8) can be easily identified. Recall that the feasibility of a constraint in (2.8) is measured by the corresponding value of ξ. Therefore, the most violated constraint is the one that results in the largest ξ. Since each constraint in (2.8) is represented by a vector c, we have the following theorem:

Theorem 2.2. The most violated constraint c in (2.8) can be computed as: ¯P ¯ ( ¯ M ¯ 1 if ¯ k=1 vkT Φk (xi )+b¯ < 1, (2.16) ci = 0 otherwise.

Proof. The most violated constraint is the one that results in the largest ξ. In order to fulfill all the constraints in problem (2.7), the optimal ξ can be computed as: ¯) ¯ ( M n ¯ ¯X X 1 1 ¯ ¯ (2.17)ξ ∗= ci − ci ¯ vkTΦk (xi )+b¯ max ¯ ¯ n ci ∈{0,1} n i=1 k=1 ¯ ¯#) ( " n M ¯X ¯ 1X ¯ ¯ = max ci 1− ¯ vkTΦk (xi )+b¯ . ¯ ¯ n ci ∈{0,1} i=1

k=1

Therefore, the most violated constraint c corresponding to ξ ∗ can be obtained as in Eq.(2.16). 2

The cutting plane algorithm iteratively selects the most violated constraint under the current hyperplane parameter and then adds it to the working constraint set Ω, until no constraint is violated by more than ². Moreover, there is a direct correspondence between ξ and the feasibility of the set of constraints in problem (2.7). If a point (v, b, β, ξ) fulfills all the constraints up to precision ², i.e., (2.18) ∀ c ∈ {0, 1}n : ¯ ¯ M n n ¯ 1X 1 X ¯¯X T ¯ vk Φk (xi ) + b¯ ≥ ci ¯ ci −(ξ + ²), ¯ n n i=1 ¯ i=1 k=1

then the point (v, b, β, ξ + ²) is feasible. Furthermore, note that in the objective function of problem (2.7), there is a single slack variable ξ measuring the clustering loss. Hence, we can simply select the stopping criterion in Algorithm 1 as being all the samples satisfying inequality (2.18). Then, the approximation accuracy ² of this approximate solution is directly related to the clustering loss.

2.4 Accuracy of the Cutting Plane Algorithm The following theorem characterizes the accuracy of the solution computed by the cutting plane algorithm. Theorem 2.3. For any ² > 0, the cutting plane algorithm for multiple kernel MMC returns a point (v, b, β, ξ) for which (v, b, β, ξ + ²) is feasible in problem (2.7). Proof. In the cutting plane algorithm, the most violated constraint c in (2.8), which leads to the largest value of ξ, is selected using Eq.(2.16). The cutting

plane algorithm terminates only when the newly selected constraint c is violated by no more ¯than ², i.e., ¯P Pn Pn ¯ ¯ M 1 1 k=1 vk Φk (xi ) + b¯ ≤ ξ + ². i=1 ci − n i=1 ci ¯ n Since the newly selected constraint c is the most violated one, all the other constraints will satisfy the above inequality. Therefore, if (v, b, β, ξ) is the solution returned by our cutting plane algorithm, then (v, b, β, ξ+²) will be a feasible solution to problem (2.7). 2 Based on this theorem, ² indicates how close one wants to be to the error rate of the best separating hyperplane. This justifies its use as the stopping criterion in Algorithm 1. 3 Time Complexity Analysis In this section, we provide theoretical analysis on the time complexity of the cutting plane algorithm for multiple kernel MMC. Theorem 3.1. The cutting plane algorithm for multi3.5 2.5 ple kernel MMC takes O( D ²+nD + D²4 ) time, where 2 PM D = k=1 Dk and Dk is the dimensionality of the kth feature space. To prove the above theorem, we will first obtain the time involved in each iteration of the algorithm. Next, we will prove that the total number of constraints added into the working set Ω, i.e., the total number of iterations involved in the cutting plane algorithm, is upper bounded. Specifically, we have the following two lemmas. Lemma 3.1. Each iteration of the cutting plane algorithm for multiple kernel MMC takes O(D 3.5 + nD + D2.5 |Ω|) time for a working constraint set size |Ω|. Proof. In each iteration of the cutting plane algorithm, two steps are involved: solving problem (2.7) under the current working constraint set Ω via CCCP and selecting the most violated constraint. To solve problem (2.7) under the working constraint set Ω, we will need to solve a sequence of SOCP problems. Specifically, for an SOCP problem of the form (3.19)min f T x x

s.t. ∀k ∈ {1, . . . , M } : ||Ak x+bk || ≤ cTkx+dk , where x ∈ RN , f ∈ RN , Ak ∈ R(Nk −1)×N , bk ∈ RNk −1 , ck ∈ RN and dk ∈P R, then its time complexity for each iteration is O(N 2 k Nk ) [15, 25]. According to the PM SOCP formulation in (2.14), we have N = k=1 Dk + P PM 2M + 2 = O(D) and k Nk = k=1 (Dk + 2) + 2M +

|Ω| + 3 = O(D + |Ω|). Thus, the time complexity per iteration is O(D 3 + D2 |Ω|). Using the primal-dual method for solving this SOCP, the accuracy of a given solution can be improved by an absolute constant factor in O(D 0.5 ) iterations [16]. Hence, each iteration in the CCCP takes O(D 3.5 + D2.5 |Ω|) time. Moreover, as will be seen from the numerical experiments in Section 5, each round of the cutting plane algorithm requires fewer than 10 iterations for solving problem (2.7) subject to Ω via CCCP. This is the case even on large data sets. Therefore, the time complexity for solving problem (2.7) under the working constraint set Ω via CCCP is O(D 3.5 + D2.5 |Ω|). Finally, to select the most violated constraint using Eq.(2.16), we need to compute n inner products between (v1 , . . . , vM ) and (Φ1 (xi ), . . . , ΦM (xi )). Each inner product takes O(D) time and so a total of n inner products can be computed in O(nD) time. Thus, the time complexity for each iteration of the cutting plane algorithm is O(D3.5 + nD + D 2.5 |Ω|). 2 Lemma 3.2. The cutting plane algorithm terminates after adding at most CR ²2 constraints, where R is a constant independent of n and D.

Proof. Note that P v = 0, b = 0, ξ = 1 with arbitrary M β ≥ 0 satisfying k=1 βk2 ≤ 1 is a feasible solution to problem (2.7). Therefore, the optimal objective of (2.7) is upper bounded by C. In the following, we will prove that in each iteration of the cutting plane algorithm, the objective value will be increased by at least a constant after adding the most violated constraint. Due to the fact that the objective value is non-negative and has upper bound C, the total number of iterations will be upper bounded. For simplicity, we omit the class balance constraint in problem (2.7) and set the bias term b = 0. The proof for the problem with class balance constraint and non-zero bias term can be obtained similarly. To compute the increase brought about by adding one constraint to the working constraint set Ω, we will first need to present the dual problem of (2.7). The difficulty involved PM in obtaining this dual problem comes from the | k=1 vkT Φk (xi ) + b| term in the constraints. Thus, we will first replace the constraints in (2.8) with n

∀c ∈ Ω :

n

1X 1X c i ti ≥ ci − ξ, n i=1 n i=1

∀i ∈ {1, . . . , n} : t2i ≤ vT Ψi v, ∀i ∈ {1, . . . , n} : ti ≥ 0,

where the D × D matrix Ψi is defined as   Φ1 (xi )ΦT1 (xi ) . . . Φ1 (xi )ΦTM (xi )   .. .. .. (3.20) Ψi= . . . . T T ΦM (xi )Φ1 (xi ) . . . ΦM (xi )ΦM (xi )

Let λ, γ, µ, δ, α, ρ be the dual variables corresponding to the various constraints, the Lagrangian dual function for problem (2.7) can be obtained as

(t+1)

Let L∗ be the optimal value of the Lagrangian dual (t) function subject to Ω(t+1) = Ω(t) ∪ {c0 }, and γi be (t) the value of γi which results in the largest L∗ . The addition of a new constraint to the primal problem is equivalent to adding a new variable λt+1 to the dual problem, and so (3.25)

( P n X ( p λp cpi +λt+1 c0i +nδi )2 = max − λ,γ,µ,δ,α,ρ 4n2 γi i=1 " t #) 1 X + λp cpi + λt+1 c0i −ρ n p=1 ( Pt (t) n X λt+1c0i p=1λp cpi (t) − ≥L∗ + max (t) λt+1 ≥0 2γi n2 i=1 ) (t) λt+1 c0i δi (λt+1 c0i )2 1 0 − − + λt+1 ci . (t) (t) n 2γ n 4γ n2

(3.21) L(λ, γ, µ, δ, α, ρ)  " n # |Ω| M 1 X X 1X ||vk ||2 λp +Cξ + cpi (1−ti )−ξ = inf v,β,ξ,t2 βk n i=1 p=1 k=1

n X

+

γi (t2i −vTΨi v)−µξ −

+ρ

δ i ti −

i=1

i=1

Ã

n X

M X

βk −1

k=1

!)

M X

(t+1)

L∗

αk βk

k=1

 i i |Ω| n M 1 X 2 X X ||vk || λp ξ −µξ According to inequality (3.24) and the constraint λt+1 ≥ = inf γi Ψi v+Cξ − −vT v,β,ξ,t2 βk 0, we have p=1 i=1 k=1 " # Pt (t) n n |Ω| |Ω| X n n n n λt+1 c0i p=1 λp cpi λt+1 c0i δi(t) 1X X X X X X X 1 1 2 + λt+1 c0i −²λt+1 . ≤ + γ i ti − λ p cpi ti − δi ti + λp cpi (t) 2 (t) n 2γ n 2γ n n n i=1 i=1 p=1

i=1

−

M X

αk βk +ρ

k=1

=



n  X i=1



−

(

P|Ω|

i=1

Ã

M X

βk −1

k=1

i=1

2 p=1 λp cpi + nδi )

4n2 γi

p=1

 |Ω|  X 1 + λp cpi −ρ  n p=1

satisfying the following constraints  Pn Eβ − 2 i=1 γi Ψi º 0,      α − ρ = 0,    k P |Ω| C − p=1 λp − µ = 0, (3.22)    t = 1 P|Ω| λ c + δi ,  i  k=1 k ki 2nγi 2γi    λ, γ, µ, δ, α ≥ 0, ´ ³ IDM ×DM I 1 , . . . , and IDk ×Dk where Eβ = diag D1β×D βM 1 is the Dk × Dk identity matrix. The cutting plane algorithm selects the most violated constraint c0 and continues if the following inequality holds n

(3.23)

1X 0 c (1 − t∗i ) ≥ ξ + ². n i=1 i

i

i=1

!)

Since ξ ≥ 0, the newly added constraint satisfies

i

Substituting the above inequality into (3.25), we get the (t+1) following lower bound of L∗ : ( n 1X (t+1) (t) ≥L∗ + max − (3.26) L∗ λt+1 c0i +²λt+1 λt+1 ≥0 n i=1 ) n n X (λt+1 c0i )2 X 1 0 + λt+1 ci − (t) 2 n i=1 i=1 4γi n ( ) n X (λt+1 c0i )2 (t) =L∗ + max ²λt+1 − (t) 2 λt+1 ≥0 i=1 4γi n (t)

=L∗ + Pn

(t) 2 02 i=1 (ci /γi n )

1X 0 c (1 − t∗i ) ≥ ². n i=1 i

.

By maximizing the Lagrangian dual function shown in Eq.(3.21), γ (t) can be obtained as: (λ(t) , γ (t) , µ(t) , δ (t) , α(t) , ρ(t) ) ) ( Pt t n X ( p=1 λp cpi +nδi )2 1 X λp cpi −ρ + = arg max − 4n2 γi n p=1 λ,γ,µ,δ,α,ρ i=1 = arg max

n X

(γi − δi )

λ,γ,µ,δ,α,ρ i=1

subject to the following equation

n

(3.24)

²2

(3.27)

2nγi =

t X p=1

λp cpi + nδi .

The only P constraint on δi is δi ≥ 0. Therefore, to n maximize i=1 (γi − δi ), the optimal value for δi is 0. Hence, the following equation holds (t)

(3.28)

2nγi

=

t X

λp(t) cpi .

p=1

(t)

Thus, nγ Pn (c0i )i2

Instead of a single feature mapping Φ, we consider the non-negative combination of M feature mappings as shown in Eq.(1.1). The multiple kernel multi-class SVM can therefore be formulated as:

β,w,ξ

is a constant independent of n. Moreover,

n

M X

Recall that Lemma 3.2 bounds the number of iterations in our cutting plane algorithm by a constant CR Moreover, ²2 , which is independent of n and D. each iteration of the algorithm takes O(D 3.5 + nD + D2.5 |Ω|) time. Therefore, the cutting plane algorithm for multiple kernel MMC has a time complexity of PCR/²2 3.5 2.5 3.5 + D²4 ). + nD + D 2.5 |Ω|) = O( D ²+nD 2 |Ω|=1 O(D Hence, we have proved theorem 3.1.

∀i ∈ {1, . . . , n} : ξi ≥ 0, ∀k ∈ {1, . . . , M } : βk ≥ 0, M X

s.t.

∀i ∈ {1, . . . , n}, r ∈ {1, . . . , m} : wyi T Φ(xi )+δyi ,r −wr T Φ(xi ) ≥ 1−ξi , ∀i ∈ {1, . . . , n} : ξi ≥ 0.

βk2 ≤ 1,

k=1

where the superscript p in wkp denotes the pth class and the subscript k denotes the kth feature mapping. Instead of finding a large margin classifier given labels on the data as in SVM, MMC targets to find a labeling that will result in a large margin classifier. The multiple kernel multi-class maximum margin clustering can therefore be formulated as: (4.31) min

y,β,v,ξ

M m n 1 X X ||vkp ||2 C X ξi + 2 βk n i=1 p=1 k=1

s.t. ∀i ∈ {1, . . . , n}, r ∈ {1, . . . , m} : M X (vkyi−vkr )TΦk (xi )+δyi ,r≥ 1−ξi ,

k=1

∀i ∈ {1, . . . , n} : ξi ≥ 0, ∀k ∈ {1, . . . , M } : βk ≥ 0, M X

4 Multi-Class Multiple Kernel Clustering In this section, we extend the multiple kernel MMC algorithm to multi-class clustering. 4.1 Multi-Class Formulation For the multi-class scenario, we will start with an introduction to the multiclass support vector machine formulation proposed in [7]. Given a point set X = {x1 , · · · , xn } and their labels y = (y1 , . . . , yn ) ∈ {1, . . . , m}n , the SVM defines a weight vector wp for each class p ∈ {1, . . . , m} and classifies sample x by p∗ = arg maxp∈{1,...,k} wp T x. The weight vectors are obtained as follows: m n CX 1X (4.29)min ||wp ||2 + ξi w,ξ 2 p=1 n i=1

βk (wkyi −wkr )TΦk (xi )+δyi ,r ≥ 1−ξi ,

k=1

γi n

pendent of n and D, and we denote it with Q(t) . Moreover, define R = maxt {Q(t) } as the maximum of Q(t) throughout the whole cutting plane process. Therefore, the increase of the objective function of the Lagrangian dual problem after adding the most violated constraint 2 c0 is at least ²R . Furthermore, denote with G(t) the value of the objective function in problem (2.7) subject to Ω(t) after adding t constraints. Due to weak duality (t) (t) [3], at the optimal solution L∗ ≤ G∗ ≤ C. Since the Lagrangian dual function is upper bounded by C, the cutting plane algorithm terminates after adding at most CR 2 ²2 constraints.

k=1

s.t. ∀i ∈ {1, . . . , n}, r ∈ {1, . . . , m} :

measures the fraction of non-zero elements in the constraint vector c0 , and therefore is a constant related only to the newly added constraint, also indePn (c0i )2 pendent of n. Hence, i=1 (t) 2 is a constant indei=1

M n m CX 1X X βk ξi ||wkp ||2 + 2 n i=1 p=1

(4.30)min

βk2 ≤ 1,

k=1

∀p, q ∈ {1, . . . , m} : n X M X −l ≤ (vkp −vkq )TΦk (xi ) ≤ l, i=1 k=1

where we have applied the following change of variables (4.32)

∀i ∈ {1, . . . , n}, k ∈ {1, . . . , M } : vkp = βk wkp

to ensure that the objective function and the last constraint are convex. Similar to two-class clustering, we have also added class balance constraints (where l > 0) in the formulation to control class imbalance. Again, the above formulation is an integer program, and is much more complex than the QP problem in multiclass SVM. Fortunately, similar to the two-class case, we have the following theorem.

where we define ep as the k × 1 vector with only the pth element being 1 and others 0, e0 as the k × 1 zero vector and e as the vector of ones.

Theorem 4.1. Problem (4.31) is equivalent to (4.33)min

β,v,ξ

M m n 1 X X ||vkp ||2 C X ξi + 2 βk n i=1 p=1 k=1

s.t. ∀i ∈ {1, . . . , n}, r ∈ {1, . . . , m} : !T Ãm M X X p r zip vk −vk Φk (xi )+zir ≥ 1−ξi , p=1

k=1

∀i ∈ {1, . . . , n} : ξi ≥ 0, ∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βk2 ≤ 1,

k=1

∀p, q ∈ {1, . . . , m} : M n X X (vkp −vkq )TΦk (xi ) ≤ l, −l ≤ i=1 k=1

where zip is defined as ∀i ∈ {1, . . . , n}, p ∈ {1, . . . , m} : zip =

After the above reformulation, a single slack variable ξ is shared across all the non-convex constraints. We propose the use of the cutting plane algorithm to handle the exponential number of constraints in problem (4.34).

m Y

I[PM

rT k=1 vk Φk (xi )>

q=1,q6=p

qT k=1 vk Φk (xi )]

PM

,

with I(·) being the indicator function and the label for sample xi is determined as PM Pm T yi = arg maxp k=1 vkp Φk (xi ) = p=1 pzip .

The multiple kernel clustering formulation shown in Eq.(4.33) has n slack variables {ξ1 , . . . , ξn } in the nonconvex constraints. We propose the following theorem to reduce the number of slack variables in Eq.(4.33).

Algorithm 3 Cutting plane algorithm for multiple kernel multi-class maximum margin clustering. Input: M feature mappings Φ1 , . . . , ΦM , parameters C, l and ², constraint subset Ω = φ. repeat Solve problem (4.34) for (vkp , β), k = 1, . . . , M and p = 1, . . . , m, under the current working constraint set Ω. Select the most violated constraint c and set Ω = Ω ∪ {c}. until the newly selected constraint c is violated by no more than ².

4.2 Optimization via the CCCP Given an initial point v(0) , the CCCP computes v(t+1) from Pn P M P m T v(t) by replacing n1 i=1 k=1 p=1 cTiezip vkp Φk (xi )+ P P m n 1 p=1 cip zip in the constraint with its first-order i=1 n Taylor expansion at v(t) . M

(4.35) min

β,v,ξ

Theorem 4.2. Problem (4.33) can be equivalently forPn mulated as problem (4.34), with ξ ∗ = n1 i=1 ξi∗ . (4.34) min

β,v,ξ

M m 1 X X ||vp ||2 k

2

k=1 p=1

βk

+Cξ

s.t. ∀ci ∈ {e0 , e1 , . . . , ek }, i ∈ {1, . . . , n} : n

M

m

1 XXX T T (ci ezip −cip )vkp Φk (xi ) n i=1 p=1 +

k=1 m n X X

n 1X

1 cip zip ≥ cT e−ξ, n i=1 p=1 n i=1 i

∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βk2 ≤ 1, ξ ≥ 0,

k=1

∀p, q ∈ {1, . . . , m} : M n X X (vkp −vkq )TΦk (xi ) ≤ l. −l ≤ i=1 k=1

m

1 X X ||vkp ||2 +Cξ 2 βk p=1 k=1

s.t. ∀[c1 , . . . , cn ] ∈ Ω, i ∈ {1, . . . , n} : n

M

m

1 XXX T (t) T (ci ezip −cip )vkp Φk (xi ) n i=1 p=1 k=1

+

n m 1 XX

n i=1 p=1

n

(t)

cip zip ≥

1X T c e−ξ, n i=1 i

∀k ∈ {1, . . . , M } : βk ≥ 0, M X

βk2 ≤ 1, ξ ≥ 0,

k=1

∀p, q ∈ {1, . . . , m} : M n X X (vkp −vkq )TΦk (xi ) ≤ l. −l ≤ i=1 k=1

Similar to two-class clustering, the above problem can be formulated as an SOCP and solved efficiently. According to the procedure of CCCP, we solve problem (4.34) under the constraint set Ω with Algorithm 4.

5.2 Comparison of Clustering Accuracy We will first study the clustering accuracy of the MKC algorithm. Specifically, we use a kernel matrix K = P 3 k=1 βk Kk , where the Kk ’s are initial “guesses” of the kernel matrix. We use a linear kernel function k1 (x1 , x2 ) = xT1 x2 for K1 , a polynomial kernel function k2 (xi , xj ) = (1 + xT1 x2 )d for K2 and a Gaussian kernel function k3 (x1 , x2 ) = exp(−(x1 − x2 )T (x1 − x2 )/2σ) for K3 . For comparison, we use k-means clustering (KM) and normalized cut (NC) as baselines. We 4.3 The Most Violated Constraint The most also compare with IterSVR [30] which performs MMC violated constraint is the one that results in the largest on two-class data. For KM, the cluster centers are initialized randomly, and the performance reported are ξ, and can be obtained by the following theorem. summarized over 50 independent runs. For NC, the Theorem 4.3. The most violated constraint can be width of the Gaussian kernel is set by an exhaustive obtained as search from the grid {0.1σ0 , 0.2σ0 , . . . , σ0 }, where σ0 is ( hP i PM r∗ T M p∗ T the range of distance between any two data points in er∗ if k=1 vk Φk (xi )− k=1 vk Φk (xi ) < 1, ci = the data set. Finally, for IterSVR, the initialization is 0 otherwise, based on k-means with randomly selected initial cluster P centers. The Gaussian kernel is used and its width is T M p where p∗ = arg maxp k=1 vk Φk (xi ) and r ∗ = set in the same way as in NC. PM arg maxr6=p∗ k=1 vkr TΦk (xi ). In all the clustering algorithms, we set the number of clusters to the true number of classes (m). To assess 5 Experiments the clustering accuracy, we follow the strategy used in In this section, we demonstrate the accuracy and effi- [27]: We first take a set of labeled data, remove all the ciency of the multiple kernel clustering algorithms on labels and run the clustering algorithms; then we label several toy and real-world data sets. All the experi- each of the resulting clusters with the majority class ments are performed with MATLAB 7.0 on a 1.66GHZ according to the original labels, and finally measure Intel CoreTM 2 Duo PC running Windows XP and with the number of correct classifications. Moreover, we also 1.5GB memory. In the following, we will simply refer to calculate the Rand Index 4 [19] for each clustering result. the proposed algorithms as MKC. Results on the various data sets are summarized in Table 2. As can be seen, the clustering accuracy and 5.1 Data Sets We use seven data sets which are in- Rand Index of MKC are comparable to those attained tended to cover a wide range of properties: ionosphere, by the best base kernel. In most cases, it is even better digits, letter and satellite (from the UCI repository), than the other competing clustering algorithms. svmguide1-a (from the LIBSVM data1 ), ringnorm2 and mnist3 . The two-class data sets are created follow- 5.3 Speed A potential concern about multiple kering the same setting as in [30]. We also create several nel clustering is that it might be much slower than maxmulti-class data sets from the digits, letter and mnist imum margin clustering. Figure 1 compares the CPUdata. We summarize all of these in Table 1. time5 of MKC with the total CPU-time of MMC with K1 , K2 and K3 . As can be seen, the speed of MKC Table 1: Descriptions of the data sets. is comparable to MMC. Indeed, MKC even converges Data Size Dimension Class digits1v7 361 64 2 faster than MMC on several data sets. However, unlike digits2v7 356 64 2 MMC which requires a carefully defined kernel matrix, ionosphere 354 64 2 svmguide1-a 1000 4 2 MKC has the strong advantage that it can automatiringnorm 1000 20 2 cally choose a good combination of base kernels. letterAvB 1555 16 2 Algorithm 4 Solve problem (4.34) subject to constraint subset Ω via the CCCP. Initialize v(0) . repeat Obtain (v(t+1) , β, ξ) as the solution to the second order cone programming problem (4.35). Set v = v(t+1) and t = t + 1. until the stopping criterion is satisfied.

satellite digits0689 digits1279 letterABCD mnist01234

2236 713 718 3096 28911

36 64 64 16 196

2 4 4 4 5

4 The Rand index has a value between 0 and 1, with 0 indicating that the data clustering does not agree on any pair of points with the true clustering, and 1 indicating that the two clustering results are exactly the same. 5 The CPU-time consists of the computational time of kernel 1 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/data sets/ 2 http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm PCA (to obtain the feature representations corresponding to 3 http://yann.lecun.com/exdb/mnist/ nonlinear kernels) and that of MKC or MMC.

Table 2: Clustering accuracies (%) and Rand Indices on the various data sets. For each method, the number on the left denotes the clustering accuracy, and the number on the right stands for the Rand Index. The symbol ‘-’ means that the corresponding algorithm cannot handle the data set in reasonable time. Moreover, note that IterSVR can only be used on two-class data sets. K1 100.00 100.00 72.36 78.40 76.70 93.12 98.48 96.63 94.01 70.77 89.98

K2 95.52 97.47 62.51 77.60 58.10 90.35 76.79 94.11 90.11 65.05 87.34

1.00 1.00 0.599 0.661 0.642 0.873 0.971 0.968 0.943 0.804 0.901

K3 0.915 0.951 0.531 0.653 0.513 0.826 0.644 0.946 0.911 0.761 0.882

95.24 99.16 86.52 84.30 98.40 93.38 88.68 84.57 90.39 53.55 90.12

0.910 0.989 0.767 0.735 0.969 0.877 0.799 0.887 0.914 0.684 0.907

MKC 100.00 1.00 100.00 1.00 91.01 0.839 85.50 0.752 99.00 0.980 93.83 0.885 99.37 0.992 97.77 0.978 94.43 0.948 71.89 0.821 91.55 0.922

1400

NC 55.00 66.00 75.00 87.50 77.70 76.80 95.79 93.13 90.11 68.19 -

K1

K2

K3

T

CPU−Time (seconds)

MKC

1000 800 600 400

2

10

O(n2)

1

10

1

10

digits1v7 digits2v7 ionosphere svmguide1−a ringnorm digits0689 digits1279

lAB

10 0 10

sat 0689 1279 ABCDmnist

Figure 1: CPU-time (seconds) of MKC and MMC as a function of the data set size n. 5.4 Scaling Properties of MKC In Section 3, we showed that the time complexity of MKC scales linearly with the number of samples. Figure 2 shows a log-log plot of the empirical results. Note that lines in a log-log plot correspond to polynomial growth O(nd ), where d is the slope of the line. As can be seen, the CPU-time of MKC scales roughly as O(n), and is thus consistent with Theorem 3.1. Moreover, as mentioned in Section 3, each round of MKC requires fewer than 10 iterations for solving problem (2.7) or (4.34) subject to the constraints in Ω. Again, this is confirmed by the experimental results in Figure 2, which shows how the number of CCCP iterations (averaged over all the cutting plane iterations) varies with sample size on the various data sets. 12

3

10

svmguide1−a ringnorm letterAvB satellite digits0689 digits1279 letterABCD mnist01234

2

10

svmguide1−a ringnorm letterAvB satellite digits0689 digits1279 letterABCD mnist01234 O(n)

1

10

0

3

10 Data Set Size

4

10

CCCP Iterations

CPU−Time (seconds)

10

10 2 10

0

0

d1v7 d2v7 iono svmg ring

8 6 4 2 2 10

3

10 Data Set Size

4

10

Figure 2: Left: CPU-time of MKC vs. data set size. Right: Average number of CCCP iterations in MKC vs. data set size. Next, we study how the CPU-time of MKC varies

IterSVR 99.45 0.995 100.00 1.00 67.70 0.562 93.20 0.873 80.70 0.831 92.80 0.867 96.82 0.939

2

digits1v7 digits2v7 ionosphere svmguide1−a ringnorm digits0689 digits1279

200 0

0.504 0.550 0.626 0.781 0.653 0.644 0.919 0.939 0.909 0.811 -

10

3

10 T +T +T

1200

CPU−Time (seconds)

KM 99.45 0.995 96.91 0.940 68.00 0.564 76.50 0.640 76.00 0.635 82.06 0.706 95.93 0.922 42.33 0.696 40.42 0.681 66.18 0.782 67.63 0.898

CPU−Time (seconds)

Data digits1v7 digits2v7 ionosphere svmguide1-a ringnorm letterAvB satellite digits0689 digits1279 letterABCD minst01234

1

Number of Kernels

10

10

O(x−0.2)

−2

10

0

10 Epsilon

2

10

Figure 3: Left: CPU-time of MKC vs. the number of kernels. Right: CPU-time of MKC vs. ². when the number of base kernels6 is increased from one to ten. As can be seen from Figure 3, the CPU-time of MKC scales roughly quadratically with the number of base kernels. This is much better than the bound of O(M 3.5 ) in Section 3. Finally, Section 3 states that the total number of iterations involved in MKC is at most CR ²2 . This means that with a higher ², the algorithm might converge faster. Figure 3 shows how the CPU-time of MKC scales with ². As can be seen, the empirical scaling is roughly 3.5 2.5 1 ), which is much better than O( D ²+nD + D²4 ) O( ²0.2 2 shown in the bound. 5.5 Generalization Ability of MKC Recall that maximum margin clustering adopts the maximum margin principle of SVM, which often allows good generalization on unseen data. In this experiment, we also examine the generalization ability of MKC on unseen data samples. We first learn the multiple kernel clustering model on a data subset randomly drawn from the whole data set. Then we use the learned model to cluster the whole data set. As can be seen in Table 3, the clustering performance of the model learned on the data subset is comparable with that of the model learned on the whole data set. This suggests an important application scenario for multiple kernel clustering, namely that 6 Gaussian

kernels are used here.

for a large data set, we can simply perform the clustering process on a small subset of the data and then use the learned model to cluster the remaining data points. Table 3: Generalization ability on unseen samples when the MKC model is learned only from a data subset. Data letterAvB satellite letterABCD mnist01234

6

from whole set Acc RIndex 93.83 0.885 99.37 0.992 71.89 0.821 91.55 0.922

from data subset subset size Acc RIndex 500 93.27 0.874 500 98.47 0.984 500 70.00 0.781 1000 91.68 0.920

Conclusions

In this paper, we extend multiple kernel learning to the unsupervised learning setting. In particular, we propose the multiple kernel clustering (MKC) algorithm that simultaneously finds the maximum margin hyperplane, the best cluster labeling and the optimal kernel. We provide detailed analysis on the time complexity of the MKC algorithm and also extend it to the multiclass scenario. Experimental results on both toy and real-world data sets demonstrate the effectiveness and efficiency of the algorithm. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In ICML, 2004. [2] O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In NIPS, 2003. [3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [4] P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral kway ratio-cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 13:1088–1096, 1994. [5] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46:131–159, 2002. [6] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In AISTATS, 2005. [7] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292, 2001. [8] C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data mining. In ICDM, pages 107–114, 2001. [9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2001. [10] M. Gonen and E. Alpaydin. Localized multiple kernel learning. In ICML, 2008. [11] T. Joachims. Training linear SVMs in linear time. In SIGKDD, 2006. [12] J. E. Kelley. The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8:703–712, 1960.

[13] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble. A statistical framework for genomic data fusion. Bioinfomatics, 20(16):2626–2635, 2004. [14] G. Lanckriet, N. Cristianini, L. Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 5:27–72, 2004. [15] M. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order cone programming. Linear Algebra Appl., 284:193–228, 1998. [16] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994. [17] C. Ong, A. Smola, and R. Williamson. Learning the kernel with hyperkernels. JMLR, 6:1043–1071, 2005. [18] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efficiency in multiple kernel learning. In ICML, 2007. [19] W. Rand. Objective criteria for the evaluation of clustering methods. JASA, 66:846–850, 1971. [20] B. Sch¨ olkopf, A. J. Smola, and K. R. M¨ uller. Kernel principal component analysis. Advances in kernel methods: Support vector learning, pages 327–352, 1999. [21] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. [22] A. J. Smola, S.V.N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In AISTATS, 2005. [23] S. Sonnenburg, G. R¨ atsch, C. Sch¨ afer, and B. Sch¨ olkopf. Large scale multiple kernel learning. JMLR, 7:1531– 1565, 2006. [24] P. Sun and X. Yao. Boosting kernel models for regression. In ICDM, 2006. [25] I. Tsang and J. Kwok. Efficient hyperkernel learning using second-order cone programming. IEEE Transactions on Neural Networks, 17(1):48–58, 2006. [26] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. In NIPS, 2007. [27] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2004. [28] L. Xu and D. Schuurmans. Unsupervised and semisupervised multi-class support vector machines. In AAAI, 2005. [29] J. Ye, S. Ji, and J. Chen. Learning the kernel matrix in discriminant analysis via quadratically constrained quadratic programming. In SIGKDD, 2007. [30] K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. In ICML, 2007. [31] B. Zhao, F. Wang, and C. Zhang. Efficient maximum margin clustering via cutting plane algorithm. In SDM, 2008. [32] B. Zhao, F. Wang, and C. Zhang. Efficient multiclass maximum margin clustering. In ICML, 2008. [33] A. Zien and C. Ong. Multiclass multiple kernel learning. In ICML, 2007.

Efficient MultiClass Maximum Margin Clustering

Bin Zhao [email protected] Fei Wang [email protected] Changshui Zhang [email protected] State Key Laboratory of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China

Abstract This paper presents a cutting plane algorithm for multiclass maximum margin clustering (MMC). The proposed algorithm constructs a nested sequence of successively tighter relaxations of the original MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Experimental evaluations on several real world datasets show that our algorithm converges much faster than existing MMC methods with guaranteed accuracy, and can thus handle much larger datasets efficiently.

1. Introduction Clustering (Duda et al., 2001; Shi & Malik, 2000; Ding et al., 2001) aims at dividing data into groups of similar objects, i.e. clusters. Recently, motivated by the success of large margin methods in supervised learning, (Xu et al., 2004) proposed maximum margin clustering (MMC), which borrows the idea from the support vector machine theory and aims at finding the maximum margin hyperplane which can separate the data from different classes in an unsupervised way. Technically, what MMC does is to find a way to label the samples by running an SVM implicitly, and the SVM margin obtained would be maximized over all possible labelings (Xu et al., 2004). However, unlike supervised large margin methods which are usually formulated as convex optimization problems, maximum margin clustering is a non-convex integer optimization problem, which is much more difficult to solve. Several attempts have been made to solve the maxiAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

mum margin clustering problem in polynomial time. (Xu et al., 2004) and (Valizadegan & Jin, 2007) made several relaxations to the original MMC problem and reformulated it as a semi-definite programming (SDP) problem. However, even with the recent advances in interior point methods, solving SDPs is still computationally very expensive. Consequently, the algorithms can only handle very small datasets containing several hundreds of samples. More recently, Zhang et al. utilized alternating optimization techniques to solve the MMC problem (Zhang et al., 2007), in which the maximum margin clustering result is obtained by solving a series of SVM training problems. However, there is no guarantee on how fast it can converge and the algorithm is still time demanding on large scale datasets. Moreover, the methods described above can only handle binary clustering problems (Zhao et al., 2008), and there are significant complications to deriving an effective maximum margin clustering approach for the multiclass scenario1 . Therefore, how to efficiently solve the multiclass MMC problem to make it capable of clustering large scale dataset is a very challenging research topic. In this paper, we propose a cutting plane multiclass maximum margin clustering algorithm CPM3C. Specifically, the algorithm constructs a nested sequence of successively tighter relaxations of the original multiclass MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Moreover, we show that the computational time of CPM3C scales roughly linearly with the dataset size. Our experimental evaluations on several real world datasets show that CPM3C performs better than existing MMC methods, both in efficiency and accuracy. 1

It should be noted that (Xu & Schuurmans, 2005) proposed a multiclass extension for MMC, however, their algorithm has a time complexity of O(n7 ), which renders it impractical for real world datasets.

Efficient MultiClass Maximum Margin Clustering

The rest of this paper is organized as follows. Section 2 will show the CPM3C algorithm in detail, and the time complexity analysis of CPM3C will be presented in section 3. Section 4 presents the experimental results on several real world datasets, followed by the conclusions in section 5.

Pn where i=1 ξi is divided by n to better capture how the regularization parameter β scales with the dataset size. Different from SVM, where the class labels are given and the only variables are (w1 , . . . , wk ), MMC targets to find not only the optimal weight vectors, but also the optimal labeling vector y ∗ .

2. Cutting Plane Multiclass Maximum Margin Clustering

2.2. Cutting Plane Algorithm

We will formally present the cutting plane multiclass maximum margin clustering (CPM3C ) algorithm in this section.

In this section, we will reformulate problem (2) to reduce the number of variables. Specifically, Theorem 1 Problem (2) is equivalent to min

w1 ,...,wk ,ξ

2.1. Multiclass Maximum Margin Clustering

s.t.

Maximum margin clustering (MMC) extends the theory of support vector machine (SVM) to the unsupervised scenario. Specifically, given a point set X = {x1 , · · · , xn } and their labels y = (y1 , . . . , yn ) ∈ {1, . . . , k}n , SVM defines a weight vector wp for each class p ∈ {1, . . . , k} and classifies sample x by y ∗ = arg maxy∈{1,...,k} wyT x with the weight vectors obtained as follows 2 (Crammer & Singer, 2001) min

w1 ,...,wk ,ξ

s.t.

n k X 1 X ξi ||wp ||2 + β 2 p=1 i=1

Instead of finding a large margin classifier given labels on the data as in SVM, MMC targets to find a labeling that would result in a large margin classifier (Xu et al., 2004). That is to say, if we subsequently run an SVM with the labeling obtained from MMC, the margin would be maximal among all possible labelings. multiclass MMC could be formulated as follows: min

w1 ,...,wk ,ξ,y

s.t.

k n 1 X 1X β ||wp ||2 + ξi 2 p=1 n i=1

(2)

∀i = 1, . . . , n, r = 1, . . . , k wyTi xi +δyi ,r −wrT xi ≥ 1−ξi

2

Although we focus on the multiclass SVM formulation of (Crammer & Singer, 2001), our method can be directly applied to other multiclass SVM formulations.

wpTxi

+

k Y

I(wpT xi >wqT xi ) q=1,q6=p

p=1

k Y

I(wrT xi >wqT xi ) −wrTxi ≥ 1−ξi

q=1,q6=r

where I(·) is the indicator function and the label for sample xi is determined as yi =

∀i = 1, . . . , n, r = 1, . . . , k

where the data samples in X are mapped into a high (possibly infinite) dimensional feature space, and by using the kernel trick, this mapping could be done implicitly. However, in those cases where kernel trick cannot be applied, it is possible to compute the coordinates of each sample in the kernel PCA basis (Sch¨olkopf et al., 1999) according to kernel K. Therefore, throughout the rest of this paper, we use xi to denote the sample mapped by the kernel function.

(3)

∀i = 1, . . . , n, r = 1, . . . , k k X

(1)

wyTi xi +δyi ,r −wrT xi ≥ 1−ξi

k n 1 X 1X ||wp ||2 + ξi β 2 p=1 n i=1

k X

p

k Y

(4)

I(wpT xi >wqT xi ) p=1 q=1,q6=p

Proof. We will show Pn that for every (w1 , . . . , wk ) the smallest feasible i=1 ξi are identical for problem (2) and problem (3), and their corresponding labeling vectors are the same. For a given (w1 , . . . , wk ), the ξi in problem (2) can be optimized individually. According to the constraint in problem (2), ξi ≥ 1 − (wyTi xi +δyi ,r −wrT xi ), ∀r = 1, . . . , k

As the objective is to minimize value for ξi is (1)

ξi

=

min

1 n

Pn

i=1 ξi ,

(5)

the optimal

max {1 − (wyTi xi +δyi ,r −wrT xi )}

yi =1,...,k r=1,...,k

(6) (1)

and we denote the corresponding class label by yi . Without loss of generality, we assume the following relationship wiT1 xi ≥ wiT2 xi ≥ . . . ≥ wiTk xi

(7)

where (i1 , i2 , . . . , ik ) is a permutation of (1, 2, . . . , k). For yi 6= i1 , maxr=1,...,k {1 − (wyTi xi +δyi ,r −wrT xi )} ≥ 1, while for yi = i1 , maxr=1,...,k {1 − (wyTi xi + δyi ,r − (1)

wrT xi )} ≤ 1, therefore, yi (1) ξi

=

= i1 and

max {1 − (wiT1 xi +δi1 ,r −wrT xi )}

r=1,...,k

(8)

= max{0, 1 − (wiT1 xi −wiT2 xi )}

Similarly, for problem (3), the optimal value for ξi is

(2)

ξi

=

 

max

r=1,...,k 

+

k Y

Efficient MultiClass Maximum Margin Clustering  k k X Y For any given (w1 , . . . , wk ), the ξi in problem (3) can 1 −  wpTxi I(wpT xi >wqT xi ) be optimized individually and the optimum is achieved p=1

q=1,q6=p

  I(wrT xi >wqT xi ) −wrTxi  

as

(2)

max {1 − (wiT1 xi +δi1 ,r −wrT xi )}

r=1,...,k

= max{0, 1 −

ξ

(2)

=

k X

p

k Y

I(wpT xi >wqT xi ) p=1 q=1,q6=p

= i1

(10)

By reformulating problem (2) as problem (3), the number of variables involved is reduced by n, but there are still n slack variables ξi in problem (3). Define ep as the k × 1 vector with only the p-th element being 1 and others 0, e0 as the k × 1 zero vector and e as the all one vector. To further reduce the number of variables involved in the optimization problem, we have the following theorem Theorem 2 Problem (3) can be equivalently formuPn lated as problem (11), with ξ ∗ = n1 i=1 ξi∗ . w1 ,...,wk ,ξ≥0

s.t.

k 1 X β ||wp ||2 +ξ 2 p=1

(11)

∀ci ∈ {e0 , e1 , . . . , ek }, i = 1, . . . , n ( ) n k k X 1X T X T T ci e wp xi zip+ cip (zip −wp xi ) n i=1 p=1 p=1 n

≥

( n " n k 1X T 1X T X T = max ci e− ci e wp xi zip (13) c1 ,...,cn ∈{e0 ,...,ek } n n i=1 p=1 i=1 #) k X + cip (zip −wpTxi )

(3)

p=1

Therefore, the objective functions of both optimization problems are equivalent for any (w1 , . . . , wk ) with the same optimal ξi , and consequently so are their optima. Moreover, their corresponding labeling vectors y are the same. Hence, we proved that problem (2) is equivalent to problem (3). 2

min

(12)

Similarly for problem (11), the optimal ξ is

(wiT1 xi −wiT2 xi )}

and the class label could be calculated as yi

= max{0, 1 − (wiT1 xi −wiT2 xi )}

where we assume the relation in (7) holds.

q=1,q6=r

=

ξi

(9)

1X T ci e−ξ n i=1

Qk where zip = q=1,q6=p I(wpT xi >wqT xi ) ∀i = 1, . . . , n; p = 1 . . . , k and each constraint c is represented as a k × n matrix c = (c1 , . . . , cn ). Proof. To justify the above theorem, we will show that problem (3) and problem (11) have the same objective value and an equivalent set of constraints. Specifically, we will prove thatP for every (w1 , . . . , wk ), n the smallest feasible ξ and i=1 ξi are related by Pn 1 ξ = n i=1 ξi . This means, with (w1 , . . . , wk ) fixed, (w1 , . . . , wk , ξ) and (w1 , . . . , wk , ξi ) are optimal solutions to problem (3) and (11) respectively, and they result in the same objective function value.

Since each ci are independent in Eq.(13), they can be optimized individually. Therefore, k k n X X 1X max{cTie−cTie wpTxi zip − cip (zip −wpTxi )} n i=1 ci p=1 p=1 ½ ¾ n 1X = max 0, max [1 − (wiT1 xi +δi1 ,p −wpT xi )] p=1,...,k n i=1

ξ (3) =

n

= =

n o 1X max 0, max[0, 1 − (wiT1 xi −wiT2 xi )] n i=1

n n 1X 1 X (2) max{0, 1 − (wiT1 xi −wiT2 xi )} = ξ n i=1 n i=1 i

Hence, for any (w1 , . . . , wk ), the objective functions for problem (3) and problem (11) have the same value given the optimal ξ and ξi . Therefore, the optima of the two optimization problems are the same. 2 Putting theorems 1 and 2 together, we could therefore solve problem (11) instead to find the same maximum margin clustering solution, with the number of variables reduced by 2n − 1. Although the number of variables in problem (11) is greatly reduced, the number of constraints increases from nk to (k + 1)n . The algorithm we propose in this paper targets to find a small subset of constraints from the whole set of constraints in problem (11) that ensures a sufficiently accurate solution. Specifically, we employ an adaptation of the cutting plane algorithm (Kelley, 1960) to solve problem (11), where we construct a nested sequence of successively tighter relaxations of problem (11). Moreover, we can prove theoretically (see section 3) that we can always find a polynomially sized subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (11) up to a precision of ². That is to say, the remaining exponential number of constraints are guaranteed to be violated by no more than ², without the need for explicitly adding them to the optimization problem (Tsochantaridis et al., 2005). Specifically, the CPM3C algorithm keeps a subset Ω of working constraints and

Efficient MultiClass Maximum Margin Clustering

computes the optimal solution to problem (11) subject to the constraints in Ω. The algorithm then adds the most violated constraint in problem (11) into Ω. In this way, a successively strengthening approximation of the original MMC problem is constructed by a cutting plane that cuts off the current optimal solution from the feasible set (Kelley, 1960). The algorithm stops when no constraint in (11) is violated by more than ². Here, the feasibility of a constraint is measured by the corresponding value of ξ, therefore, the most violated constraint is the one that results in the largest ξ. Since each constraint in problem (11) is represented by a k × n matrix c, then we have Theorem 3 Define p∗ = arg maxp (wpT xi ) and r ∗ = arg maxr6=p∗ (wrT xi ) for i = 1, . . . , n, the most violated constraint could be calculated as follows ci =

½

er∗ if (wpT∗ xi −wrT∗ xi ) < 1 , i = 1, . . . , n 0 otherwise

−l ≤

n X i=1

(14)

n X

k X

k X

1 max{cTie−cTie wpTxi zip− cip (zip−wpTxi )} n i=1 ci p=1 p=1 1X max{cTi[e − wpT∗ xi e − zi + ti ]} n i=1 ci

where ti = (w1T xi , . . . , wkT xi )T . Since ci ∈ {e0 , . . . , ek }, ci selects the largest element of the vector e−wpT∗ xi e− zi +ti , which could be calculated as 1−(wpT∗ xi −wrT∗ xi ). Therefore, the most violated constraint c that results in the largest ξ ∗ could be calculated as in Eq.(14). 2 The CPM3C algorithm iteratively selects the most violated constraint under the current weight vectors and adds it into the working constraint set Ω until no violation of constraints is detected. Moreover, if a point (w1 , . . . , wk , ξ) fulfills all constraints up to precision ² ∀ci ∈ {e0 , e1 , . . . , ek }n , i = 1, . . . , n (15) ( ) n n k k X 1X T 1X T X T ci e wp xi zip+ cip (zip −wpTxi ) ≥ ci e−ξ−² n i=1 n i=1 p=1 p=1

then the point (w1 , . . . , wk , ξ + ²) is feasible. Furthermore, as in the objective function of problem (11), there is a single slack variable ξ that measures the clustering loss. Hence, we could simply select the stopping criterion as all samples satisfying the inequality (15). Then, the approximation accuracy ² of this approximate solution is directly related to the training loss.

n X wpTxi − wqTxi ≤ l, ∀p, q = 1, . . . , k

(16)

i=1

where l ≥ 0 controls the class imbalance. Therefore, multiclass maximum margin clustering with working constraint set Ω could be formulated as follows

w1 ,...,wk ,ξ≥0

s.t.

n

=

In 2-class maximum margin clustering, a trivially “optimal” solution is to assign all patterns to the same class, and the resultant margin will be infinite (Xu et al., 2004). Similarly, for the multiclass scenario, a large margin can always be achieved by eliminating classes (Xu & Schuurmans, 2005). Therefore, we add the following class balance constraints to avoid the trivially “optimal” solutions

min

Proof. The most violated constraint is the one that would result in the largest ξ. As each ci in the constraint is independent, in order to fulfill all constraints in problem (11), the value of ξ is as follows ξ∗ =

2.3. Enforcing the Class Balance Constraint

k 1 X (17) ||wp ||2 +ξ β 2 p=1 ( ) n k k X 1X T X T T ci e wp xi zip+ cip (zip −wp xi ) n i=1 p=1 p=1 n

≥

1X T ci e−ξ, ∀[c1 , . . . , cn ] ∈ Ω n i=1

n n X X −l ≤ wpTxi − wqTxi ≤ l, ∀p, q = 1, . . . , k i=1

i=1

Before getting into details of solving problem (17), we first present the CPM3C approach in Algorithm 1. Algorithm 1 Cutting Plane Multiclass MMC Initialize Ω = φ repeat Solve problem (17) for (w1 , . . . , wk ) under the current working constraint set Ω and select the most violated constraint c with Eq.(14). Set Ω = Ω ∪ {c}. until (w1 , . . . , wk ) satisfies c up to precision ²

2.4. Optimization via the CCCP In each iteration of the CPM3C algorithm, we need to solve problem (17) to obtain the optimal classifying hyperplanes under the current working constraint set Ω. Although the objective function in (17) is convex, the constraints are not, and this makes problem (17) difficult to solve. Fortunately, the constrained concaveconvex procedure (CCCP) is just designed to solve those optimization problems with a concave-convex objective function under concave-convex constraints (Smola et al., 2005). In the following, we will show how to utilize CCCP to solve problem (17).

Efficient MultiClass Maximum Margin Clustering

The objective function in (17) and the second constraint are convex. Moreover, the first constraint is, although non-convex, the difference of two convex functions. Hence, we hcan solve (17) with CCCP. Notice i Pn Pk Pk that while n1 i=1 cTie p=1 wpTxi zip + p=1 cip zip is convex, it is a non-smooth function of (w1 , . . . , wk ). To use CCCP, we need to calculate the subgradients: " #)¯ n k k ¯ X 1X T X T ¯ ci e wp xi zip + cip zip ¯ ¯ n i=1 p=1 p=1

(

∂w r

n

=

1 X T (t) ci ezip xi n i=1

(18) w=w(t)

∀r = 1, . . . , k (0)

(0)

Given an initial point (w1 , . . . , wk ), CCCP com(t+1) (t+1) (t) (t) putes (w1 , .h. . , wk ) from (w1 , . . . , wk )i by reP P P k k n placing n1 i=1 cTie p=1 wpTxi zip + p=1 cip zip in the constraint with its first order Taylor expansion at (t) (t) (w1 , . . . , wk ), i.e. ( ) n k k 1 X T X (t) T (t) X (t) ci e wp xi zip + cip zip n i=1 p=1 p=1

(19)

k n 1X T X (t) (wp −wp(t) )Txi zip ci e n i=1 p=1 ( ) n k k 1 X T X T (t) X (t) = ci e wp xi zip + cip zip n i=1 p=1 p=1

+

By substituting the above first-order Taylor expansion into problem (11), we obtain the following quadratic programming (QP) problem: min

w1 ,...,wk ,ξ≥0

k 1 X β ||wp ||2 +ξ 2 p=1

(20)

s.t. ∀[c1 , . . . , cn ] ∈ Ω

n n k 1 XX 1X T ci e−ξ + cip wpTxi n i=1 n i=1 p=1 ( ) n k k 1 X T X T (t) X (t) ci e wp xi zip + − cip zip ≤ 0 n i=1 p=1 p=1

−l ≤

n X i=1

n X wpTxi − wqTxi ≤ l, ∀p, q = 1, . . . , k i=1

Moreover, the dual problem of (20) is a QP problem with |Ω| + 2 variables and could be solved in polynomial time, where |Ω| denotes the total number of constraints in Ω. Putting everything together, according to the formulation of the CCCP (Smola et al., 2005), we solve problem (17) with the approach presented in Algorithm 2, where we set the stopping criterion in CCCP as the difference between two iterations less than α% and set α% = 0.01, which means the current

Algorithm 2 Solve problem (17) with CCCP Initialize wp = wp0 for p = 1, . . . , k. repeat Find (w1t+1 , . . . , wkt+1 ) as the solution to the quadratic programming problem (20). Set wp = wpt+1 , p = 1, . . . , k until stopping criterion satisfied. objective function is larger than 1 − α% of the objective function in last iteration, since CCCP decreases the objective function monotonically. 2.5. Theoretical Analysis We provide the following theorem regarding the correctness of the CPM3C algorithm. Theorem 4 For any dataset X = (x1 , . . . , xn ) and any ² > 0, if (w1∗ , . . . , wk∗ , ξ ∗ ) is the optimal solution to problem (11) with the class balance constraint, then our CPM3C algorithm returns a point (w1 , . . . , wk , ξ) for which (w1 , . . . , wk , ξ +²) is feasible in problem (11) and satisfies the class balance constraint. Moreover, the corresponding objective value is better than the one corresponds to (w1∗ , . . . , wk∗ , ξ ∗ ). Based on the above theorem, ² indicates how close one wants to be to the error rate of the best classifying hyperplanes and can thus be used as the stopping criterion (Joachims, 2006).

3. Time Complexity Analysis In this section, we will provide analysis on the time complexity of CPM3C. For the high-dimensional (say, d-dimensional) sparse data commonly encountered in applications like text mining and bioinformatics, we assume each sample has only s ¿ d non-zero features, i.e., s implies the sparsity, while for non-sparse data, by simply setting s = d, all our theorems still hold. Theorem 5 Each iteration of CPM3C takes time O(snk) for a constant working set size |Ω|. Moreover, for the binary clustering scenario, we have the following theorem Theorem 6 For any ² > 0, β > 0, and any dataset X = {x1 , . . . , xn } with samples belonging to two different classes, the CPM3C algorithm terminates after adding at most ²R2 constraints, where R is a constant number independent of n and s. It is true that the number of constraints can potentially explode for small values of ², however, experi-

Efficient MultiClass Maximum Margin Clustering

ence with CPM3C shows that relatively large values of ² are sufficient without loss of clustering accuracy. Since the number of iterations in CPM3C (with k = 2) is bounded by ²R2 , a constant independent of n and s, and each iteration of the algorithm takes time O(snk) (O(sn) for the binary clustering scenario), we arrive at the following theorem Theorem 7 For any dataset X = {x1 , . . . , xn } with n samples belonging to 2 classes and sparsity of s, and any fixed value of β > 0 and ² > 0, the CPM3C algorithm takes time O(sn) to converge. For the multiclass scenario, experimental results shown in section 4 also demonstrate that the computational time of CPM3C scales roughly linearly with the dataset size n.

4. Experiments In this section, we will validate the accuracy and efficiency of the CPM3C algorithm on several real world datasets. Moreover, we will also analyze the scaling behavior of CPM3C with the dataset size and the sensitivity of CPM3C to ², both in accuracy and efficiency. All the experiments are performed with MATLAB 7.0 on a 1.66GHZ Intel CoreTM 2 Duo PC running Windows XP with 1.5GB main memory. 4.1. Datasets We use eight datasets in our experiments, which are selected to cover a wide range of properties: Digits, Letter and Satellite from the UCI repository, MNIST3 , 20 newsgroup4 , WebKB5 , Cora (McCallum et al., 2000) and RCVI (Lewis et al., 2004). In order to compare CPM3C with other MMC algorithms which can only perform binary clustering, we choose the first two classes from Letter and Satellite. For the 20 newsgroup dataset, we choose the topic rec which contains autos, motorcycles, baseball and hockey from the version 20-news-18828. For WebKB, we select a subset consists of about 6000 web pages from computer science departments of four schools (Cornell, Texas, Washington, and Wisconsin). For Cora, we select a subset containing the research paper of subfield data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS) and programming language (PL). For RCVI, we use the data samples with the highest four topic codes (CCAT, ECAT, GCAT and MCAT) in the 3

http://yann.lecun.com/exdb/mnist/ http://people.csail.mit.edu/jrennie/20Newsgroups/ 5 http://www.cs.cmu.edu/∼WebKB/ 4

“Topic Codes” hierarchy in the training set. Table 1. Descriptions of the datasets. Data Letter UCIDig UCISat MNIST Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

Size (n) 1555 1797 2236 70000 751 400 1617 1246 1575 827 814 1166 1210 3970 21251

Feature (N ) 16 64 36 784 6234 3989 8329 6737 7949 4134 4029 4165 4189 8014 47152

Class 2 10 2 10 9 7 7 4 9 7 7 7 7 4 4

Sparsity 98.9% 51.07% 100% 19.14% 0.68% 1.1% 0.58% 0.75% 0.56% 2.32% 1.97% 2.05% 2.16% 0.75% 0.16%

4.2. Comparisons and Clustering Results Besides our CPM3C algorithm, we also implements some other competitive algorithms and present their results for comparison. Specifically, we use K-Means (KM) and Normalized Cut (NC) as baselines, and also compared with Maximum Margin Clustering (MMC) (Xu et al., 2004), Generalized Maximum Margin Clustering (GMC) (Valizadegan & Jin, 2007) and Iterative Support Vector Regression (SVR) (Zhang et al., 2007) which all aim at clustering data with the maximum margin hyperplane. Technically, for k-means, the cluster centers are initialized randomly. For NC, the implementation is the same as in (Shi & Malik, 2000), and the width of the Gaussian kernel is set by exhaustive search from the grid {0.1σ0 , 0.2σ0 , . . . , σ0 }, where σ0 is the range of distance between any two data points in the dataset. Moreover, for MMC and GMC, the implementation is the same as in (Xu et al., 2004; Xu & Schuurmans, 2005) and (Valizadegan & Jin, 2007) respectively. Furthermore, the implementation code for SVR is downloaded from http://www.cse.ust.hk/∼twinsen and the initialization is based on k-means with randomly selected initial data centers, and the width of the Gaussian kernel is set in the same way as in NC. In the experiments, we set the number of clusters equal to the true number of classes k for all the clustering algorithms. To assess clustering accuracy, we follow the strategy used in (Xu et al., 2004) where we first take a set of labeled data, remove the labels for all data samples and run the clustering algorithms, then we label each of the resulting clusters with the majority class according to the original training labels, and finally measure the number of correct classifications made by each clustering. Moreover, we also calculate the Rand Index (Rand, 1971) for each clustering result. The clustering accuracy and Rand index results are summarized in table 2 and table 3 respectively,

Efficient MultiClass Maximum Margin Clustering

KM 94.68 94.45 96.91 90.68 82.06 95.93 50.53 50.38 96.38 89.21 42.23 40.42 28.24 34.02 27.08 23.87 33.80 55.71 45.05 53.52 49.53 35.27 27.05

NC 65.00 55.00 66.00 52.00 76.80 95.79 93.79 91.35 97.57 89.92 93.13 90.11 36.88 42.00 31.05 23.03 33.97 61.43 35.38 32.85 33.31 41.89 -

MMC 90.00 68.75 98.75 96.25 94.83 91.91 -

GMC 94.40 97.8 99.50 84.00 -

SVR 96.64 99.45 100.0 96.33 92.80 96.82 96.82 93.99 98.18 92.41 -

CPM3C 96.92 100.0 100.0 97.74 94.47 98.48 95.00 96.28 99.38 95.71 96.63 94.01 43.75 59.75 45.58 58.89 46.83 71.95 69.29 77.96 73.88 70.63 61.97

where the results for k-means and iterative SVR are averaged over 50 independent runs and ‘-’ means the corresponding algorithm cannot handle the dataset in reasonable time. Since GMC and iterative SVR can only handle binary clustering problems, we also provide experiments on several 2-class problems: Letters, Satellite, autos vs. motorcycles (Text-1) and baseball vs. hockey (Text-2). Moreover, for the UCI-Digits and MNIST datasets, we enumerate all 45 possible class pairs, and report the average clustering results. Furthermore, as the MMC and GMC algorithms can only handle datasets with no more than a few hundred samples, we perform experiments on UCI Digits and focus on those pairs (3 vs 8, 1 vs 7, 2 vs 7, 8 vs 9, 0689 and 1279) that are difficult to differentiate. From the tables we can clearly observe

tive algorithms on almost all the datasets. 4.3. Speed of CPM3C Table 4 compares the CPU-time of CPM3C with other competitive algorithms. According to the table, CPM3C is at least 18 times faster than SVR, 200 times faster than GMC. As reported in (Valizadegan & Jin, 2007), GMC is about 100 times faster than MMC. Hence, CPM3C is still faster than MMC by about four orders of magnitude. Moreover, as the sample size increases, the CPU-time of CPM3C grows much slower than that of iterative SVR, which indicates CPM3C has much better scaling property with the sample size than SVR. Finally, CPM3C also performs much faster than conventional kmeans, which is a very appealing result. As for the Ncut method, since the calculation of the similarity matrix is very time consuming and usually takes several hours on the text datasets, we do not report the time it spends here. Table 4. CPU-time (seconds) comparisons. Data Dig 3-8 Dig 1-7 Dig 2-7 Dig 8-9 Letter UCISat Text-1 Text-2 Dig 0689 Dig 1279 Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

Table 3. Rand Index comparisons. KM 0.904 0.995 0.940 0.835 0.706 0.922 0.500 0.500 0.933 0.808 0.696 0.681 0.589 0.385 0.514 0.518 0.643 0.603 0.604 0.616 0.581 0.581 0.471

NC 0.545 0.504 0.550 0.500 0.644 0.919 0.884 0.842 0.956 0.818 0.939 0.909 0.744 0.659 0.720 0.522 0.675 0.602 0.514 0.581 0.509 0.496 -

MMC 0.823 0.571 0.978 0.929 0.941 0.913 -

GMC 0.899 0.962 0.994 0.733 -

SVR 0.940 0.995 1.00 0.934 0.867 0.939 0.939 0.887 0.967 0.860 -

CPM3C 0.945 1.00 1.00 0.956 0.897 0.971 0.905 0.929 0.989 0.921 0.968 0.943 0.735 0.692 0.754 0.721 0.703 0.728 0.707 0.747 0.752 0.782 0.698

that our CPM3C algorithm can beat other competi-

GMC 276.16 289.53 304.81 277.26 -

SVR 19.72 20.49 19.69 19.41 2133 6490 930.0 913.8 -

CPM3C 1.10 0.95 0.75 0.85 0.87 4.54 19.75 16.16 9.66 17.47 35.31 24.35 69.04 13.98 165.0 9.534 10.53 10.67 9.041 215.6 587.9

4.4. Dataset size n vs. Speed In the theoretical analysis section, we state that the computational time of CPM3C scales linearly with the number of samples. We present numerical demonstraCora & 20News 3

10 CPU−Time (seconds)

Data Dig 3-8 Dig 1-7 Dig 2-7 Dig 8-9 Letter UCISat Text-1 Text-2 UCIDig MNIST Dig 0689 Dig 1279 Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

KM 0.51 0.54 0.50 0.49 0.08 0.19 66.09 52.32 34.28 17.78 839.67 204.43 22781 47931 7791.4 672.69 766.77 4135.2 1578.2 2387.8 428770

2

10

Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL 20News O(n)

1

10

2

10

WK−CL WK−HA WK−WT WK−WC RCVI O(n)

1

10

0

0

10 2 10

WebKB & RCVI

3

10

CPU−Time (seconds)

Table 2. Clustering accuracy(%) comparisons. Data Dig 3-8 Dig 1-7 Dig 2-7 Dig 8-9 Letter UCISat Text-1 Text-2 UCIDig MNIST Dig 0689 Dig 1279 Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

3

10 Number of Samples

10 2 10

3

10 Number of Samples

4

10

Figure 1. CPU-Time (seconds) of CPM3C as a function of dataset size n.

Efficient MultiClass Maximum Margin Clustering

tion for this statement in figure 1, where a log-log plot of how computational time increases with the size of the data set is shown. Specifically, lines in the log-log plot correspond to polynomial growth O(nd ), where d is the slope of the line. Figure 1 shows that the CPU-time of CPM3C scales roughly O(n), which is consistent with the statement in section 3. 4.5. ² vs. Accuracy & Speed Theorem 6 states that the total number of iterations involved in CPM3C is at most ²R2 , and this means with higher ², the algorithm might converge fast. However, as ² is directly related to the training loss in CPM3C, we need to determine how small ² should be to guarantee sufficient accuracy. We present in figure 2 and figure 3 how clustering accuracy and computational time scale with ². According to figure 2, ² = 0.01 Cora & 20News

0.8 0.7 0.6 0.5 Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL 20News

0.4 0.3 0.2 −4 10

0.7

−2

10

Epsilon

WK−CL WK−TX WK−WT WK−WC data5

0.4 −4 10

0

10

−2

10

Epsilon

0

10

Figure 2. Clustering accuracy of CPM3C vs. ². Cora & 20News

3

2

10

1

10

Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL 20News

0

10

−1

0

10

Epsilon

0

10

Ding, C., He, X., Zha, H., Gu, M., & Simon, H. D. (2001). A min-max cut algorithm for graph partitioning and data mining. ICDM (pp. 107–114). Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. John Wiley & Sons, Inc. Joachims, T. (2006). Training linear svms in linear time. SIGKDD 12.

McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3, 127–163.

Sch¨ olkopf, B., Smola, A. J., & M¨ uller, K. R. (1999). Kernel principal component analysis. Advances in kernel methods: support vector learning, 327–352.

WK−CL WK−TX WK−WT WK−WC RCVI

−2

−2

10

Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. JASA, 66, 846–850.

2

10

10

O(x−0.5)

10 −4 10

WebKB & RCVI

4

10

CPU−Time (seconds)

CPU−Time (seconds)

10

References

Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. JMLR, 5, 361–397.

0.6 0.5

This work is supported by the projects (60721003) and (60675009) of the National Natural Science Foundation of China.

Kelley, J. E. (1960). The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8, 703–712.

0.8 Clustering Accuracy

Clustering Accuracy

WebKB & RCVI

0.9

Acknowledgments

−0.5

O(x −4

10

) −2

10

Epsilon

0

10

Figure 3. CPU-time (seconds) of CPM3C vs. ².

is small enough to guarantee clustering accuracy. The log-log plot in figure 3 verifies that the CPU-time of CPM3C decreases as ² increases. Moreover, the em1 pirical scaling of roughly O( ²0.5 ) is much better than 1 O( ²2 ) in the bound from theorem 6.

5. Conclusions We propose the cutting plane multiclass maximum margin clustering (CPM3C) algorithm in this paper, to cluster data samples with the maximum margin hyperplane. Preliminary theoretical analysis of the algorithm is provided, where we show that the computational time of CPM3C scales linearly with the sample size n with guaranteed accuracy. Moreover, experimental evaluations on several real world datasets show that CPM3C performs better than existing MMC methods, both in efficiency and accuracy.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. PAMI. Smola, A. J., Vishwanathan, S., & Hofmann, T. (2005). Kernel methods for missing variables. AISTATS 10. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. JMLR, 6, 1453–1484. Valizadegan, H., & Jin, R. (2007). Generalized maximum margin clustering and unsupervised kernel learning. NIPS 19 (pp. 1417–1424). Xu, L., Neufeld, J., Larson, B., & Schuurmans, D. (2004). Maximum margin clustering. NIPS 17. Xu, L., & Schuurmans, D. (2005). Unsupervised and semisupervised multi-class support vector machines. AAAI. Zhang, K., Tsang, I. W., & Kowk, J. T. (2007). Maximum margin clustering made practical. ICML 24. Zhao, B., Wang, F., & Zhang, C. (2008). Efficient maximum margin clustering via cutting plane algorithm. SDM (pp. 751–762).

CutS3VM: A Fast Semi-Supervised SVM Algorithm Bin Zhao

Department of Automation Tsinghua University Beijing, P. R. China

Fei Wang

Department of Automation Tsinghua University Beijing, P. R. China

[email protected] [email protected]

ABSTRACT

Changshui Zhang

Department of Automation Tsinghua University Beijing, P. R. China

[email protected]

of unlabeled data can be far easier to obtain. For example, in text classification, one may have an easy access to a large database of documents (e.g. by crawling the web), but only a small part of them are classified by hand. Consequently, Semi-Supervised Learning (SSL) methods, which aim to learn from partially labeled data, are proposed[5, 23]. Motivated by the success of large margin methods in supervised learning, [19] proposed a semi-supervised extension of Support Vector Machine (SVM), which treats the unknown data labels as additional optimization variables in the standard SVM problem. Specifically, by maximizing the margin over labeled and unlabeled data, the method targets to learn a decision boundary that traverses through low data density regions while respecting labels in the input space [7]. The idea was first proposed in [19] under the name Transductive SVM, but since it learns an inductive rule defined over the entire space, we refer to this approach as SemiSupervised SVM (S3VM) in this paper. Several attempts have been made to solve the non-convex optimization problem associated with S3VM, e.g., local combinatorial search [11], branch-and-bound algorithms [2, 6], gradient descent [7], semi-definite programming [5, 20, 21, 22], continuation techniques [4], non-differentiable methods [1], concave-convex procedure [10, 9], and deterministic annealing [17]. However, the time complexity of these methods scales at least quadratically with the dataset size, which renders them quite time consuming on large scale real world applications. Therefore, how to efficiently solve the S3VM problem to make it capable of handling large scale datasets is a very challenging research topic. Joachims proposes in [12] a fast training method for linear SVM based on cutting plane algorithm [13]. Although the algorithm in [12] greatly reduces the training time for linear SVM, it could only handle the supervised training case. Unlike supervised large margin methods which are usually formulated as convex optimization problems, semi-supervised SVM involves a non-convex integer optimization problem, which is much more difficult to solve. In this paper, we apply the cutting plane algorithm to semi-supervised SVM training, and propose the cutting plane semi-supervised support vector machine algorithm CutS3VM. Specifically, we construct a nested sequence of successively tighter relaxations [12] of the original S3VM problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Moreover, we prove theoretically that the CutS3VM algorithm takes time O(sn) to converge with guaranteed accuracy, where n is the total number of samples in the dataset and s

Semi-supervised support vector machine (S3VM) attempts to learn a decision boundary that traverses through low data density regions by maximizing the margin over labeled and unlabeled examples. Traditionally, S3VM is formulated as a non-convex integer programming problem and is thus difficult to solve. In this paper, we propose the cutting plane semi-supervised support vector machine (CutS3VM) algorithm, to solve the S3VM problem. Specifically, we construct a nested sequence of successively tighter relaxations of the original S3VM problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Moreover, we prove theoretically that the CutS3VM algorithm takes time O(sn) to converge with guaranteed accuracy, where n is the total number of samples in the dataset and s is the average number of non-zero features, i.e. the sparsity. Experimental evaluations on several real world datasets show that CutS3VM performs better than existing S3VM methods, both in efficiency and accuracy.

Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning

General Terms Algorithms, Performance

Keywords Semi-Supervised Support Vector Machine, Cutting Plane, Constrained Concave Convex Procedure

1. INTRODUCTION In many practical applications of pattern classification and data mining, one often faces a lack of sufficient labeled data, since labeling often requires expensive human labor and much time. However, in many cases, large number

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’08, August 24–27, 2008, Las Vegas, Nevada, USA. Copyright 2008 ACM 978-1-60558-193-4/08/08 ...$5.00.

830

positive class.

is the average number of non-zero features, i.e. the sparsity. Our experimental evaluations on several real world datasets show that CutS3VM converges much faster than existing S3VM methods with guaranteed accuracy, and can thus handle larger datasets efficiently. The rest of this paper is organized as follows. The cutting plane semi-supervised SVM algorithm is presented in detail in section 2. In section 3, we provide theoretical analysis on the accuracy and time complexity of CutS3VM. Experimental results on several real world datasets are provided in section 4, followed by the conclusions in section 5.

n 1 X T w φ(xj )+b = 2r−1 u

However, as the true class ratio for the unlabeled data is unknown, r can be estimated from the class ratio on the labeled set, or from prior knowledge of the classification problem. For a given r, an easy way to enforce the class balance constraint (2) is to translate all the points such that Pn j=l+1 φ(xi ) = 0. Then, by fixing b = 2r − 1, we have an unconstrained optimization problem on w [7]. We assume that the φ(xi ) are translated and b is fixed in this manner for the rest of this paper.

2. THE PROPOSED METHOD

2.2

In this section, we first introduce a slightly different formulation of the semi-supervised support vector machine that will be used throughout this paper and show that it is equivalent to the conventional S3VM formulation. Then we present the main procedure of the cutting plane semi-supervised support vector machine (CutS3VM ) algorithm.

Cutting Plane Algorithm

In this section, we will reformulate problem (1) to reduce the number of variables. Specifically, Theorem 1. Problem (1) is equivalent to l n 1 T Cl X Cu X w w+ ξi + ξj 2 n i=1 n

2.1 Semi-Supervised Support Vector Machine

min

Given a point set X = {x1 , · · · , xl , xl+1 , · · · , xn }, where the first l points in X are labeled as yi ∈ {−1, +1}1 and the remaining u = n − l points are unlabeled. Conventionally, semi-supervised support vector machine could be formulated as the following optimization problem

s.t. yi [wTφ(xi )+b] ≥ 1−ξi , ∀i = 1, . . . , l

min

min

yl+1 ,...,yn w,b,ξi

l X

1 T Cl Cu w w+ ξi + 2 n i=1 n

n X

ξj

(2)

j=l+1

w,ξi

(3)

j=l+1

|wTφ(xj )+b| ≥ 1−ξj , ∀j = l + 1, . . . , n ξi ≥ 0, ∀i = 1, . . . , l ξj ≥ 0, ∀j = l + 1, . . . , n where the labels yj , j = l + 1, . . . , n are calculated as yj = sign(wT φ(xj ) + b).

(1)

j=l+1

For simplicity, unless noted otherwise, the index i runs over 1, . . . , l, i.e., the labeled samples, and the index j runs over l + 1, . . . , n, i.e., the unlabeled samples. The proof for the above theorem is simple and we omit it due to lack of space. By reformulating problem (1) as problem (3), the number of variables involved in the S3VM problem is reduced by u, but there are still n slack variables in problem (3). To further reduce the number of variables involved in the optimization problem, we have the following theorem

s.t. yi [wTφ(xi )+b] ≥ 1−ξi , ∀i = 1, . . . , l yj [wTφ(xj )+b] ≥ 1−ξj , ∀j = l + 1, . . . , n ξi ≥ 0, ∀i = 1, . . . , l ξj ≥ 0, ∀j = l + 1, . . . , n P P where li=1 ξi and n j=l+1 ξj are divided by n to better capture how Cl and Cu scale with the dataset size. Moreover, φ(·) maps the data samples in X into a high (possibly infinite) dimensional feature space, and by using the kernel trick, this mapping could be done implicitly. However, in those cases where the kernel trick cannot be applied, if we still want to use a nonlinear kernel, it is possible to compute the coordinates of each sample in the kernel PCA basis [16] according to the kernel K. More directly, as stated in [7], one can also compute the Cholesky decomposition of the ˆX ˆ T , and set φ(xi ) = (X ˆ i,1 , . . . , X ˆ i,n )T . kernel matrix K = X Furthermore, in problem (1), the losses over labeled and unlabeled samples are weighted by two parameters, Cl and Cu , which reflect confidence in labels and in the cluster assumption respectively. In general, Cl and Cu need to be set at different values for optimal generalization performance [7]. The difficulty with problem (1) lies in the fact that we have to minimize the objective function w.r.t. the labels yl+1 , . . . , yn , in addition to w, b and ξ. Finally, to avoid unbalanced solutions, [7] introduces the following class balance constraint by enforcing that certain fraction r of the unlabeled data should be assigned to the

Theorem 2. Problem (3) can be equivalently formulated as 1 T w w+ξ (4) 2   l n  X  X 1 s.t. ci yi [wTφ(xi )+b]+Cu cj |wTφ(xj )+b| Cl  n  i=1 j=l+1   l n   X X 1 ci +Cu cj −ξ, ∀ c ∈ {0, 1}n Cl ≥  n

min

w,ξ≥0

i=1

j=l+1

∗

and any solution w to problem (4) is also a solution to P P ∗ problem (3) (vice versa), with ξ ∗ = Cnl li=1 ξi∗+ Cnu n j=l+1 ξj .

Proof. We will show that problem (3) and problem (4) have the same objective value and an equivalent set of constraints. Specifically, P we will prove P that for every w, the smallest feasible Cnl li=1 ξi + Cnu n j=l+1 ξj in problem (3) and ξ in problem (4) are equal. This means, with w fixed, (w, ξi , ξj ) and (w, ξ) are optimal solutions to problems (3) and (4) respectively, and they result in the same objective function value.

1 Datasets with more than two classes can be handled with the one-versus-rest approach.

831

For any given w, the ξi and ξj in problem (3) can be optimized individually and the optimum is achieved as (1)

ξi

(1) ξj

= max{0, 1 − yi (wT φ(xi ) + b)} T

= max{0, 1 − |w φ(xj ) + b|}

Similarly for problem (4), the optimal ξ is # ( " l l X Cl X ci yi [wTφ(xi )+b] ci − ξ (2) = max n c∈{0,1} n i=1 i=1   n n  X Cu  X + cj |wTφ(xj )+b| cj −  n j=l+1

subject to the constraints in Ω. The algorithm then adds the most violated constraint in problem (4) into Ω. In this way, a successively strengthening approximation of the original S3VM problem is constructed by a cutting plane that cuts off the current optimal solution from the feasible set [13]. The algorithm stops when no constraint in (4) is violated by more than ². Here, the feasibility of a constraint is measured by the corresponding value of ξ, therefore, the most violated constraint is the one that would result in the largest ξ. Since each constraint in problem (4) is represented by a vector c, then we have

(5) (6)

(7)

Theorem 3. The most violated constraint could be computed as follows ½ 1 if yi [wT φ(xi )+b] < 1 ci = (9) 0 otherwise ½ 1 if |wT φ(xj )+b| < 1 cj = (10) 0 otherwise

j=l+1

Since each ci , cj are independent in Eq.(7), they can be optimized individually. Therefore, ξ (2) =

l Cl X max {ci −ci yi [wTφ(xi )+b]} n i=1 ci ∈{0,1}

+

n Cu X n

j=l+1

=

Proof. As stated above, the most violated constraint is the one that would result in the largest ξ. In order to fulfill all constraints in problem (4), the slack variable ξ ∗ could be calculated as follows ( " l # l X Cl X ξ ∗ = max n ci − ci yi [wTφ(xi )+b] c∈{0,1} n i=1 i=1   n n  X Cu  X cj |wTφ(xj )+b| cj − +  n

max {cj −cj |wTφ(xj )+b|}

cj ∈{0,1}

l X

Cl max{0, 1−yi [wTφ(xi )+b]} n i=1 +

n Cu X max{0, 1−|wTφ(xj )+b|} n j=l+1

=

l X

n Cl Cu X (1) (1) ξi + ξj n i=1 n

(8)

j=l+1

j=l+1

=

Hence, for any w, the objective functions for problem (3) and problem (4) have the same value given the optimal ξi , ξj and ξ. Therefore, the optima of the two optimization problems are the same.

l Cl X max {ci [1−yi (wTφ(xi )+b)]} n i=1 ci ∈{0,1}

+

n Cu X n

j=l+1

Although problem (4) has 2n constraints, one for each possible vector c = (c1 , . . . , cn ) ∈ {0, 1}n , it has only one slack variable ξ that is shared across all constraints, thus, the number of variables is further reduced by n − 1. Each constraint in this formulation corresponds to the sum of a subset of constraints from problem (3), and the vector c selects the subset [12]. Putting theorem 1 and theorem 2 together, we could therefore solve problem (4) instead to find the same maximum margin classifying hyperplane. On the other hand, the number of constrains is increased from n to 2n . The algorithm we propose in this paper targets to find a small subset of constraints from the whole set of constraints in problem (4) that ensures a sufficiently accurate solution. Specifically, we employ an adaptation of the cutting plane algorithm [13] to solve the S3VM training problem, where we construct a nested sequence of successively tighter relaxations of problem (4) [12]. Moreover, we will prove theoretically in section 3 that we can always find a polynomially sized subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (4) up to a precision of ². That is to say, the remaining exponential number of constraints are guaranteed to be violated by no more than ², without the need for explicitly adding them to the optimization problem [12]. Specifically, the CutS3VM algorithm keeps a subset Ω of working constraints and computes the optimal solution to problem (4)

j=l+1

max {cj [1−|wTφ(xj )+b|]}

cj ∈{0,1}

(11)

Therefore, the most violated constraint c that results in the largest ξ ∗ could be calculated as in Eq. (9) and (10). The CutS3VM algorithm iteratively selects the most violated constraint under the current hyperplane parameter and adds it into the working constraint set Ω until no violation of constraint is detected. Moreover, since in problem (4), there is a direct correspondence between ξ and the feasibility of the set of constraints. If a point (w, b, ξ) fulfills all constraints up to precision ², i.e.   l n X 1 X T T ci yi [w φ(xi )+b]+Cu cj |w φ(xj )+b| Cl n i=1 j=l+1   n l X 1 X cj −(ξ +²), ∀ c ∈ {0, 1}n (12) ci +Cu ≥ Cl n i=1 j=l+1

then the point (w, b, ξ + ²) is feasible. Furthermore, as in the objective function of problem (4), there is a single slack variable ξ that measures the training loss. Hence, we could simply select the stopping criterion as all samples satisfying the inequality (12). Then, the approximation accuracy ² of this approximate solution is directly related to the training loss. Assume the current working constraint set is Ω, S3VM

832

with its first-order Taylor expansion at wt , i.e.

could be formulated as the following optimization problem

n n Cu X Cu X cj |wtTφ(xj )+b|+ cj sign(wtTφ(xj )+b) n n j=l+1 j=l+1 h i T · φ(xj ) (w−wt )

1 T w w+ξ (13) w,ξ≥0 2   l n X 1 X T T s.t. ∀c ∈ Ω : Cl ci yi [w φ(xi)+b]+Cu cj |w φ(xj)+b| n i=1 j=l+1   l n X X 1 ci +Cu ≥ Cl cj  −ξ n i=1 min

=

j=l+1

+

=

n

h i cj sign(wtTφ(xj )+b) wTφ(xj )+b

(16)

1 T w w+ξ (17) 2   l n l X X 1 X cj −Cl ci yi [wTφ(xi)+b] s.t. ∀c ∈ Ω : Cl ci +Cu n i=1 i=1 j=l+1

−ξ −

Cu n

n X

h i cj sign(wtTφ(xj)+b) wTφ(xj)+b ≤ 0

j=l+1

and the above QP problem could be solved in polynomial time. Following the CCCP, the obtained solution w from this QP problem is then used as wt+1 and the iteration continues until convergence and Smola et al. [18] proved that the CCCP is guaranteed to converge. We will show in the theoretical analysis section that the dual problem of (17) has desirable sparseness properties. For simplicity, define the following variables  P P  ||c || = Cl l c + Cu n  j=l+1 ckj  l k 1Cl Pnl i=1 ki n   z = c y φ(x ) i  ki i k n   u Pi=1 T c sign(w zk = Cnu n kj t φ(xj )+b)φ(xj ) j=l+1 (18) P  l Cl l   c s = ki yi b  k i=1 n   P  T  suk = Cu n j=l+1ckj sign(wt φ(xj )+b)b n

where k runs over 1, . . . , |Ω|. The dual problem of (17) is

j=l+1

(14)

max −

j=l+1

λ≥0

Hence, we canPsolve problem (13) with the CCCP. Notice T that while Cnu n j=l+1 cj |w φ(xj ) + b| is convex, it is a nonsmooth function of w. To use the CCCP, we need to replace the gradient by the subgradient [8]:

s.t.

|Ω| |Ω| |Ω| X 1 XX λkλh (zlk +zuk )T(zlh +zuh )+ λk(||ck ||1−slk−suk ) 2 k=1h=1

|Ω| X

λk ≤ 1

k=1

(19)

k=1

The above optimization problem is a QP problem with |Ω| variables, where |Ω| denotes the total number of constraints in the subset Ω. Note that in successive iterations of the CutS3VM algorithm, the optimization problem (13) differs only by a single constraint. Therefore, we can employ the solution in last iteration of the CutS3VM algorithm as the initial point for the CCCP, which greatly reduces the runtime. Putting everything together, according to the formulation of the CCCP [18], we solve problem (13) with Algorithm 2, where we set the stopping criterion in CCCP as the difference between



w=wt

n Cu X cj sign(wtTφ(xj )+b)φ(xj ) = n

n Cu X

min

  l n l X X 1 X T ∀c ∈ Ω : Cl ci +Cu cj −Cl ci yi [w φ(xi)+b]−ξ n i=1 i=1

¯ ¯ n X ¯ Cu ∂w  cj |wTφ(xj )+b|¯¯ n ¯ j=l+1

n

h i ci sign(wtTφ(xj )+b) wTφ(xj )+b

w,ξ≥0

In each iteration of CutS3VM, we need to solve problem (13) to obtain the optimal classifying hyperplane under the current working constraint set Ω. Although the objective function in (13) is convex, the constraints are not, and this makes problem (13) difficult to solve. Fortunately, the constrained concave-convex procedure (CCCP) is designed to solve those optimization problems with a concave-convex objective function under concave-convex constraints [18]. Specifically, the objective function in problem (13) is quadratic. Moreover, the constraint as shown in Eq. (14) is, though non-convex, a difference between two convex functions.

cj |wTφ(xj )+b| ≤ 0

j=l+1

n Cu X

By substituting the above first-order Taylor expansion (16) into problem (13), we obtain the following quadratic programming (QP) problem:

2.3 Optimization via the CCCP

n

j=l+1

j=l+1

Algorithm 1: Cutting Plane Semi-Supervised SVM 1. Initialization. Set the values for Cl , Cu , r and ², and set Ω = φ and b = 2r − 1; 2. Solve optimization problem (13) under the current working constraint set Ω; 3. Select the most violated constraint c under the current classifying hyperplane with Eq.(9) and (10). 4. If the selected constraint is violated by no more than ², goto step 5; otherwise Ω = Ω ∪ {c}, goto step 2. 5. Output. Return the labels for unlabeled samples as yj = sign(wT φ(xj ) + b)

−

n

j=l+1

where b = 2r − 1. Before getting into details of solving problem (13), we first present the outline of our CutS3VM algorithm in Algorithm 1.

n Cu X

n

Cu X Cu X cj |wtTφ(xj )+b|− cj |wtT φ(xj )+b| n n

(15)

j=l+1

Given an initial point computes wt+1 from P w0 , the CCCP T c |w φ(x )+b| in the constraint wt by replacing Cnu n j j j=l+1

833

Algorithm 2: Solve problem (13) using CCCP 1. Initialize (w0 , b0 ) with the output of the last iteration of CutS3VM algorithm, if this is the first iteration of CutS3VM algorithm, initialize (w0 , b0 ) with random values. 2. Find (wt+1 , bt+1 ) as the solution to the quadratic programming problem (17); 3. If convergence criterion satisfied, return (wt , bt ) as the optimal hyperplane parameter; otherwise t = t + 1, goto step 2.

will need to solve a series of quadratic programming (QP) problems. Setting up the dual problem (19) is dominated by computing the |Ω|2 elements (zlk + zuk )T(zlh + zuh ) inP|Ω| P|Ω| volved in the sum k=1 h=1 λkλh (zlk + zuk )T(zlh + zuh ), and this can be done in time O(|Ω|2 sn) after first computing zlk , zuk , (k = 1, . . . , |Ω|). Since the number of variables involved in the QP problem (19) is |Ω| and (19) can be solved in polynomial time, the time required for solving the dual problem is then independent of n and s. Therefore, each iteration in the CCCP takes time O(|Ω|2 sn). Moreover, in numerical analyses, we observed in each round of the CutS3VM algorithm, less than 10 iterations is required for solving problem (13), even for large scale datasets. Moreover, the number of iterations required is independent of n and s. Therefore, the time complexity for each iteration of our CutS3VM algorithm is O(sn), which scales linearly with n and s.

two iterations less than α% and set α% = 0.01, which means the current objective function is larger than 1 − α% of the objective function in last iteration, since CCCP decreases the objective function monotonically.

3. THEORETICAL ANALYSIS

˜ i ) = [φ(xi ), 1], then w ˜ i) = ˜ = [w, b] and φ(x ˜ T φ(x Define w 1 ˜Tw ˜ = w φ(xi ) + b. With b fixed at 2r − 1, arg minw ˜ 2w arg minw 21 wT w. Therefore, problem (4) could be equivalently formulated as

In this section, we will provide detailed theoretical analysis of the CutS3VM algorithm, including its correctness and time complexity. Specifically, the following theorem characterizes the accuracy of the solution computed by CutS3VM.

T

1 T ˜ w+ξ ˜ w (21) 2   n l X 1 X T ˜ T˜ ˜ φ(xj )| ˜ φ(xi )+Cu s.t. ∀c ∈ Ω : Cl ci y i w c j |w n i=1 j=l+1   l n X 1 X ≥ ci +Cu cj  −ξ Cl n i=1

min

Theorem 4. For any dataset X = (x1 , . . . , xn ) and any ² > 0, the CutS3VM algorithm returns a point (w, b, ξ) for which (w, b, ξ + ²) is feasible in problem (4).

˜ w,ξ≥0

Proof. In step 3 of our CutS3VM algorithm, the most violated constraint c, which leads to the largest value of ξ, is selected using Eq.(10). According to the outline of the CutS3VM algorithm, it terminates only when the newly selected constraint c satisfies the following inequality   n l X X 1 cj |wTφ(xj )+b| ci yi (wTφ(xi )+b)+Cu Cl n i=1 j=l+1   l n X X 1 ≥ C l ci +Cu cj −(ξ +²) (20) n i=1

j=l+1

Theorem 6. For any ² > 0, Cl > 0, Cu > 0, l ≥ 0, and any dataset X = {x1 , . . . , xn }, the CutS3VM algorithm u )R constraints, terminates after adding at most (ρCl +(1−ρ)C ²2 where R is a constant number independent of n and s.

j=l+1

If the above relation holds, since the newly selected constraint is the most violated one, all other constraints will satisfy the above inequality relation. Therefore, if (w, b, ξ) is the solution returned by our CutS3VM algorithm, then (w, b, ξ + ²) will be a feasible solution to problem (4).

Based on the above theorem, ² indicates how close one wants to be to the error rate of the best classifying hyperplane and can thus be used as the stopping criterion [12]. We next analyze the time complexity of CutS3VM. For the highdimensional (say, d-dimensional) sparse data commonly encountered in applications like text mining, web log analysis and bioinformatics, we assume each data sample has only s ¿ d non-zero features, i.e., s implies the sparsity, while for non-sparse data, by simply setting s = d, all our theorems still hold. Theorem 5. Each iteration of CutS3VM takes time O(sn) for a constant working set size |Ω|. Proof. In steps 3 and 4 of the CutS3VM algorithm, we need to compute n inner products between w and φ(xi ). Each inner product takes time O(s) when using sparse vector algebra, and totally n inner products will be computed in O(sn) time. To solve the CCCP problem in step 2, we

u = ρCl + (1 − Proof. Note that w = 0, ξ = lCl +uC n ρ)Cu is a feasible solution to problem (4), where ρ = nl denotes the fraction of labeled samples in the whole dataset X . Therefore, the objective function of the solution of (4) is upper bounded by ρCl +(1−ρ)Cu . We will prove that in each iteration of the CutS3VM algorithm, by adding the most violated constraint, the increase of the objective function is at least a constant number [12]. Due to the fact that the objective function of the solution is non-negative and has upper bound ρCl + (1 − ρ)Cu , the total number of iterations will be upper bounded. To compute the increase brought up by adding one constraint into the working set Ω, we will first need to present the dual problem of (4). The difficulty involved in obtaining this dual problem comes from the abstracts in the constraints. Therefore, we first need to replace the constraints in (4) with the following   n l X 1 X ˜ i )+Cu ˜ Tφ(x Cl c j tj  ci y i w n i=1 j=l+1   n l X 1 X cj −ξ, ∀c ∈ Ω (22) ci +Cu ≥ Cl n i=1

j=l+1

˜ j )φ(x ˜ j )T w, ˜ ∀j ∈ {l + 1, . . . , n} ˜ T φ(x t2j ≤ w tj ≥ 0, ∀j ∈ {l + 1, . . . , n}

834

(23) (24)

˜ j )φ(x ˜ j )T for j = l + 1, . . . , n. Hence, and we define Dj = φ(x the Lagrangian dual function can be obtained as follows

dual problem. Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) (29) = max Lk+1 (λ, γ, µ, δ) λ,γ,µ,δ  l n  X X 1 (k) (k) (k) (k) λk+1 [Cl ≥Lk (λ ,γ ,µ ,δ )+ max c0j ] c0i +Cu λk+1 ≥0n i=1

L(λ, γ, µ, δ) (25)   |Ω| l n  X X X 1 1 T ˜ w+ξ ˜ λk  [Cl cki +Cu ckj ]−ξ w + = inf ˜ w,ξ,t 2 n i=1 k=1 j=l+1  l n n X X X 1 ˜ i )+Cu ˜ Tφ(x − [Cl cki yi w ckj tj ]−µξ − δ j tj n i=1 j=l+1 j=l+1  n  X ˜ TDj w) ˜ γj (t2j − w +  j=l+1  |Ω| |Ω| n n n X X X Cu X X λk ckj tj− δj tj+ξ − λk ξ −µξ γj t2j − = inf ˜ w,ξ,t  n k=1 j=l+1 k=1 j=l+1 j=l+1   |Ω| l n X X X 1 T C l ˜ i) ˜ w ˜ T− ˜ w ˜ T γj Dj w+ ˜ w− + w λk cki yi φ(x 2 n i=1 k=1 j=l+1  |Ω| l n  X X 1X λk[Cl + cki +Cu ckj ]  n i=1

k=1

|Ω|

l X

j=l+1

i=1

j=l+1

(26)

i=1

i=1

=Lk (λ(k),γ (k),µ(k),δ (k) )+ P

Pn

where we define L = I − 2 j=l+1 γj Dj . The CutS3VM algorithm selects the most violated constraint c0 and continues if the following inequality holds

2Cl2 n2

²

2

2 02

C u cj n j=l+1γ (k)n2 +η j

˜ i )] and we ˜ i )]TL−1[Pl c0i yi φ(x [ i=1 c0i yi φ(x where η = i=1 (k) define γi as the value of γi that results in the largest Lk (λ, γ, µ, δ). Since for semi-supervised scenario, l ¿ u al(k) ways holds, the optimal value for γi can be approximately obtained by solving the following optimization problem P k n n X ( kp=1λp cpj Cu +nδj )2 Cu X X ˜ max L(λ,γ,δ) = λp cpj − λ,γ,δ n p=1 4n2 γj

    n n l l X X X 1 X 0 0 ∗ 0  1 0 T˜ ˜ φ(xi)+Cu cj tj >ξ+² (27) Cl ci+Cu cj − Cl ci yi w n i=1 n i=1 j=l+1

Pl

j=l+1

Since ξ ≥ 0, the newly added constraint c0 satisfies

j=l+1

By maximizing the above approximate Lagrangian dual func˜ tion L(λ, γ, δ), γ (k) could be obtained as follows

    l l n n X X X 1 X 0 0 T˜ 0  1 0 ∗ ˜ φ(xi)+Cu cj tj >² (28) Cl ci+Cu cj − Cl ci yi w n i=1 n i=1 j=l+1

i=1

Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) (31)  n  X (Cu λk+1 c0j )2 ≥Lk (λ(k),γ (k),µ(k),δ (k) )+ max ²λk+1 − λk+1 ≥0 4n2 γj j=l+1 #T " l " l #  X0 X Cl2 2 ˜ i ) L−1 ˜ i) − 2 λk+1 ci yi φ(x c0i yi φ(x  2n

satisfying the following constraints

j=l+1

−

Substituting the above inequality into (29), we get the lower bound of Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) as follows

k=1

i hP  Pl ˜ i)  ˜ = Cnl L−1 |Ω| w λ c y φ(x i k ki  k=1 i=1     1− P|Ω| λk −µ = 0 k=1 δj Cu P|Ω|  tj = 2nγ  k=1 λk ckj + 2γj , ∀j = l + 1, . . . , n  j    λ, γ, µ, δ ≥ 0

n X

Substituting Eq.(26) into inequality (28) and according to the constraint λk+1 ≥ 0, we have P n n l X X X λk+1 Cu c0j (Cu kp=1λp cpj+nδj ) 1 0 0 λk+1 [Cl ci +Cu cj ]− n 2n2 γj i=1 j=l+1 j=l+1 # " l #T " k l X X X0 C2 ˜ i ) ≥ ²λk+1 (30) ˜ i ) L−1 − 2l λk+1 λp cpi yi φ(x ci yi φ(x n p=1 i=1 i=1

n X

k=1

P λk+1 Cu c0j (Cu kp=1λp cpj+nδj )

(Cu λk+1 c0j )2 2n2 γj 4n2 γj j=l+1 j=l+1 # #T " k " l l X X X0 Cl2 −1 ˜ i) ˜ i) L λp cpi yi φ(x ci yi φ(x − 2 λk+1 n p=1 i=1 i=1 " l #T " l #  2 X X C ˜ i ) L−1 ˜ i) − l2 λ2k+1 c0i yi φ(x c0i yi φ(x  2n −

P|Ω| n X ( k=1 λk ckj Cu+nδj)2 1X λk [Cl cki +Cu ckj ]− n 4n2 γj i=1 k=1 j=l+1 j=l+1   T  |Ω| |Ω| l l X X Cl2 X X ˜ i ) ˜ i ) L−1 λk cki yi φ(x − 2 λk cki yi φ(x 2n i=1 i=1

=

n X

(λ(k) ,γ (k) ,δ (k) ) (32) ( Pk ) 2 n k X ( p=1 λp cpj Cu+nδj ) 1X = arg max − + λp cpj Cu λ,γ,δ 4n2 γj n p=1

j=l+1

Denote by Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) the optimal value of the Lagrangian dual function subject to Ωk+1 = Ωk ∪ {c0 }. The addition of a new constraint to the primal problem is equivalent to adding a new variable λk+1 into the

j=l+1

= arg max λ,γ,δ

835

n X

(γj −δj )

j=l+1

4.

subject to the following equation 2nγj =

k X

λp cpj Cu + nδj

In this section, we will validate the accuracy and efficiency of CutS3VM on several real world datasets. Moreover, we will also analyze the scaling behavior of CutS3VM with the dataset size and its sensitivity to ², Cl and Cu , both in accuracy and efficiency. All the experiments are performed on a 1.66GHZ Intel CoreTM 2 Duo PC running Windows XP with 1.5GB main memory.

(33)

p=1

The Pn only constraint on δj is δj ≥ 0, therefore, to maximize i=1 (γj − δj ), the optimal value for δj is 0. Hence, (k)

2nγj

=

k X

λ(k) p cpj Cu

(34)

4.1

p=1

(k)

γj n

independent of n. Moreover,

l l n X X Cl2 X 0 ˜ −1 T 1 ˜ i )] γ D ] [ c0i yi φ(x [ I − φ(x )] [ c y j j i i i n2 i=1 2 i=1 j=l+1

C2 = 2l n C2 = 2l n

l X

n

X ˜ j) ˜ i )T [ 1 I − c0i c0j yi yj φ(x γj Dj ]−1 φ(x 2 i,j=1 j=l+1

l X

˜ j) ˜ i )T [ 1 I −(1−ρ)EγD ]−1 φ(x c0i c0j yi yj φ(x 2 i,j=1 1 u

(35)

Table 1: Descriptions of the datasets. Data g50c text1 uspst coil20 Iono Sonar Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

Pn

where EγD = j=l+1 nγj Dj . Since we proved that nγj is a constant number independent of n, the matrix 12 I − (1 − ρ)EγD is also independent of n, therefore, η is propor2 02 P 2 cj Cu tional to nl 2 = ρ2 . Hence, n j=l+1 (k) 2 + η is a constant γj

Datasets

We use 10 datasets in our experiments2 , which are selected to cover a wide range of properties: g50c, uspst, text1 and coil20 from [7], Ionosphere and Sonar from the UCI repository, 20 newsgroup, WebKB, Cora [15] and RCVI [14]. For the 20 newsgroup dataset, we choose the topic rec which contains autos, motorcycles, baseball and hockey from the version 20-news-18828. For WebKB, we select a subset consists of about 6000 web pages from computer science departments of four schools (Cornell, Texas, Washington, and Wisconsin). For Cora, we select a subset containing the research paper of subfield data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS) and programming language (PL). For RCVI, we use the data samples with the highest two topic codes (CCAT and GCAT) in the “Topic Codes” hierarchy in the training set.

Thus, nγj is a constant number independent of n. MorePn (c0j )2 over, measures the fraction of non-zero elej=l+1 n ments in the constraint vector c0 , and therefore is a constant only related to the newly added constraint, and proportional 2 02 P Cu cj u to n = 1 − ρ. Therefore, n j=l+1 (k) 2 is a constant number η=

EXPERIMENTS

n

number independent of n and s, and we denote it with Qk . Moreover, we define R = maxk {Qk } as the maximum of Qk through the whole CutS3VM process. Therefore, the increase of the objective function of the Lagrangian dual problem after adding the most violated constraint c0 is at 2 least ²R . Furthermore, denote with Gk the value of the objective function in problem (4) subject to Ωk after adding k constraints. Due to the weak duality [3], at the opti˜ (k) , ξ (k) , t(k) ) ≤ mal solution Lk (λ(k) , γ (k) , µ(k) , δ (k) ) ≤ Gk (w ρCl +(1−ρ)Cu . Since the Lagrangian dual function is upper bounded by ρCl +(1−ρ)Cu , the CutS3VM algorithm termiu )R constraints. nates after adding at most (ρCl +(1−ρ)C ²2

4.2

Size 550 1946 2007 1440 351 208 751 400 1617 1246 1575 827 814 1166 1210 3970 14005

Labeled 50 50 50 40 20 20 50 50 50 50 50 50 50 50 50 50 1000

Feature 50 7511 256 1024 34 60 6234 3989 8329 6737 7949 4134 4029 4165 4189 8014 47152

Class 2 2 10 20 2 2 9 7 7 4 9 7 7 7 7 4 2

Sparsity 100% 0.73% 96.71% 65.61% 88.1% 99.93% 0.68% 1.1% 0.58% 0.75% 0.56% 2.32% 1.97% 2.05% 2.16% 0.75% 0.17%

Classification Accuracy

Besides our CutS3VM algorithm, we also implement some other competitive algorithms and present their results for comparison. Specifically, we use the conventional SVM algorithm as baseline, and also compare with four state-ofthe-art methods: TSVM-Light (TSVML)[11], Gradient Descent TSVM (LDS)[7], the Concave Convex Procedure (CCCP)[9] and Deterministic Annealing (DA)[17]. Moreover, we also report the 5-fold cross validation results (SVM-5cv in table 2) of an SVM trained on the whole dataset using the labels of the unlabeled points. Multiclass datasets are learned with a one-versus-rest approach. For each dataset, classification accuracy averaged over 20 independent trials is reported. In each trial, the training set contains at least one labeled point for each class, and the remaining data are used as the unlabeled (test) data. Moreover, for CutS3VM, linear kernel is used. For other

It is true that the number of constraints can potentially explode for small values of ², however, experience with the CutS3VM algorithm shows that relatively large values of ² are sufficient without loss of accuracy. Note that the objective function of problem (4) with the scaled Cnl and Cnu instead of Cl and Cu is essential for this theorem. Putting everything together, we arrive at the following theorem regarding the time complexity of CutS3VM. Theorem 7. For any dataset X = {x1 , . . . , xn } with n samples and sparsity of s, and any fixed value of Cl > 0, Cu > 0, l ≥ 0 and ² > 0, CutS3VM takes time O(sn). Proof. Since theorem 6 bounds the number of iterations u )R in our CutS3VM algorithm to a constant (ρCl +(1−ρ)C ²2 which is independent of n and s. Moreover, each iteration of the algorithm takes time O(sn). The CutS3VM algorithm has time complexity O(sn).

2 All datasets used in this paper could be found on http://binzhao02.googlepages.com/

836

Table 2: Classification accuracy(%) and parameters used in CutS3VM (last 3 columns). CCCP 95.12 81.92 94.29 82.72 78.25 68.62 41.62 42.15 51.50 51.21 44.34 68.36 73.07 67.74 70.78 60.15 89.27

DA 93.00 72.80 94.30 87.70 83.39 67.09 39.66 57.43 38.48 54.85 32.36 70.79 68.97 78.23 74.31 75.38 82.19

CutS3VM 95.92 83.90 94.67 78.93 85.80 78.19 45.79 67.14 63.37 68.98 42.16 81.47 79.45 88.26 79.14 84.74 89.18

CPU−Time (seconds)

2

10

WebKB & RCVI

3

1

2

10

WK−CL WK−TX WK−WT WK−WC RCVI O(n)

1

10

0

10 2 10

3

10 Number of Samples

3

4

10 Number of Samples

10

Figure 1: CPU-Time (seconds) of CutS3VM as a function of dataset size n. tional time increases with the size of the dataset is shown. Specifically, lines in the log-log plot correspond to polynomial growth O(nd ), where d is the slope of the line. Figure 1 shows that the CPU-time of CutS3VM scales roughly O(n), which is consistent with theorem 7.

4.5

² vs. Accuracy & Speed Theorem 6 states that the total number of iterations inu )R volved in CutS3VM is at most (ρCl +(1−ρ)C , and this ²2 Cora & 20News

WebKB & RCVI

0.95

0.8 0.9

0.7 Accuracy

CutS3VM 0.73 2.00 13.90 2.00 4.42 2.00 41.18 2.00 0.43 2.00 0.17 2.00 66.72 1.86 27.81 2.00 172.00 2.00 48.25 2.00 175.05 2.33 84.95 2.14 16.99 2.29 29.15 2.57 22.38 2.33 26.45 2.17 20.85 2.20

0.6 0.5 0.4 0.3 0.2 −4 10

Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL 20News

−2

Epsilon

2

1 Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL 20News

4.4 Dataset size n vs. Speed

0

10

WebKB & RCVI

10

2

10

1

10

0

10

WK−CL WK−TX WK−WT WK−WC RCVI −0.5

−0.5

O(x

O(x

)

−1

10 −4 10

Epsilon

3

10

0

−2

10

4

3

10

WK−CL WK−TX WK−WT WK−WC RCVI

10

10

10

0.8

0.7 −4 10

0

10

Cora & 20News

4

0.85

0.75

10

10

CPU−Time (seconds)

DA 9.51 365.83 77.39 2772.0 7.69 5.85 148.53 57.32 325.61 123.12 386.68 110.48 102.44 149.71 136.76 551.47 1946.1

² 0.01 0.1 0.01 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

10

10

Table 3: CPU-time (seconds) comparisons. The last column gives the average CCCP iterations involved in CutS3VM. CCCP 2.21 93.15 25.71 142.20 1.02 0.78 172.17 30.13 1104.2 605.02 864.86 88.06 93.64 137.45 144.31 342.90 1208.3

Cu 1 128 270 64 64 0.5 4 2 8 4 0.25 0.125 0.0625 0.125 0.125 64 64

0

Table 3 compares the CPU-time of CutS3VM with other competitive algorithms. Moreover, as we state in section 3 that in each round of CutS3VM, less than 10 iterations are required for solving problem (13), even for large scale datasets. Therefore, we also provide the average number of CCCP iterations involved in CutS3VM. According to table 3, CutS3VM converges faster on almost all the datasets. Moreover, as the sample size increases, the CPU-time of CutS3VM grows much slower, which indicates CutS3VM has better scaling property with the sample size. Finally, less than 10 CCCP iterations are required in each round of CutS3VM on average, regardless of the dataset size.

LDS 5.72 240.48 1266.0 163.77 2.23 0.98 261.78 52.60 2360.0 924.33 1586.0 231.53 239.74 481.27 497.22 10341 -

Cl 1 32768 960 8192 1024 512 128 32 64 256 64 32 16 128 128 128 1024

Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL 20News O(n)

10 2 10

4.3 Speed of CutS3VM

TSVML 5.77 967.72 570.73 421.74 15.40 29.30 79.19 25.60 234.45 90.20 236.34 65.68 50.74 77.50 75.05 408.21 4860.8

SVM-5cv 96.36 93.78 97.17 100.0 92.86 85.71 53.74 68.99 67.78 72.62 52.38 89.73 84.80 91.68 91.82 92.05 90.02 Cora & 20News

3

10

algorithms, both RBF and linear kernels are used and the better result among the two kernels is reported. The kernel width of RBF kernel is chosen by 5-fold cross-validation on the labeled data. The margin parameters Cl and Cu are tuned via grid search. Finally, we set r in the balancing constraint to the true ratio of the positive points in the unlabeled set. Table 2 clearly shows that CutS3VM can beat other competitive algorithms on almost all the datasets.

Data g50c uspst text1 coil20 Iono Sonar Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

SVM 91.68 76.82 81.14 75.36 80.21 66.22 28.71 46.25 34.80 42.55 34.07 59.07 62.48 64.58 60.45 57.35 88.74

CPU−Time (seconds)

LDS 94.20 82.39 94.29 82.44 89.61 62.77 39.44 57.40 41.46 54.17 35.99 71.24 69.25 67.03 67.51 52.53 -

Accuracy

TSVML 93.13 73.54 92.56 73.74 75.56 65.27 32.94 50.09 53.02 57.76 34.45 73.89 73.17 79.63 71.46 70.61 86.25

CPU−Time (seconds)

Data g50c uspst text1 coil20 Iono Sonar Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI

)

−1

−2

10

Epsilon

0

10

10 −4 10

−2

10

Epsilon

0

10

Figure 2: Classification accuracy and CPU-time (seconds) of CutS3VM vs. ².

In the theoretical analysis section, we state that the computational time of CutS3VM scales linearly with the number of samples. We present numerical demonstration for this statement in figure 1, where a log-log plot of how computa-

means with higher ², the algorithm might converge fast. However, as ² is directly related to the training loss, we need

837

7.

to determine how small ² should be to guarantee sufficient accuracy. We present in figure 2 how classification accuracy and computational time scale with ². According to figure 2, ² = 0.1 is small enough to guarantee classification accuracy. The log-log plot in figure 2 verifies that the CPU-time of CutS3VM decreases as ² increases. Moreover, the empirical 1 scaling of roughly O( ²0.5 ) shown in the log-log plot is much 1 better than O( ²2 ) in theorem 6.

4.6

Cl & Cu vs. Accuracy & Speed Besides ², Cl and Cu are also crucial in CutS3VM as they adjust the tradeoff between the margin and the training loss. Therefore, we study their effects on the accuracy and speed of CutS3VM and present in figure 3 and figure 4 how classification accuracy and computational time scale with Cl and Cu . Figures 3 and 4 show that the classification accuracy WebKB & 20News & RCVI

1

4

10

WebKB & 20News & RCVI

Accuracy

0.7 0.6 0.5 WK−CL WK−TX WK−WT WK−WC 20News RCVI

0.4 0.3

CPU−Time (seconds)

0.9 0.8

2

10

0

WK−CL WK−TX WK−WT WK−WC 20News RCVI O(x)

10

−2

0.2

0

10

10

2

Cl

10

0

10

2

Cl

10

Figure 3: Classification accuracy and CPU-time (seconds) of CutS3VM vs. Cl . WebKB & 20News & RCVI

0.95

4

10

WebKB & 20News & RCVI

0.9

CPU−Time (seconds)

Accuracy

0.85 0.8 0.75 0.7 0.65 0.6 0.55

WK−CL WK−TX WK−WT WK−WC 20News RCVI

2

10

0

WK−CL WK−TX WK−WT WK−WC 20News RCVI O(x)

10

−2

0

10

2

Cu

10

10

0

10

2

Cu

REFERENCES

[1] A. Astorino and A. Fuduli. Nonsmooth optimization techniques for semisupervised classification. IEEE Trans. Pattern Anal. Mach. Intell., 29(12):2135–2142, 2007. [2] K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In NIPS, pages 368–374, 1999. [3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004. [4] O. Chapelle, M. Chi, and A. Zien. A continuation method for semi-supervised svms. In ICML 23, pages 185–192, 2006. [5] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press: Cambridge, MA, 2006. [6] O. Chapelle, V. Sindhwani, and S. S. Keerthi. Branch and bound for semi-supervised support vector machines. In NIPS, pages 217–224, 2006. [7] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In AISTATS 10, 2005. [8] P. M. Cheung and J. T. Kowk. A regularization framework for multiple-instance learning. In ICML 23, 2006. [9] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive svms. Journal of Machine Learning Research, 7:1687–1712, 2006. [10] G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15:29–44, 2001. [11] T. Joachims. Transductive inference for text classification using support vector machines. In ICML 16, pages 200–209, 1999. [12] T. Joachims. Training linear svms in linear time. In SIGKDD 12, 2006. [13] J. E. Kelley. The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8:703–712, 1960. [14] D. D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [15] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [16] B. Sch¨ olkopf, A. J. Smola, and K. R. M¨ uller. Kernel principal component analysis. Advances in kernel methods: support vector learning, pages 327–352, 1999. [17] V. Sindhwani, S. S. Keerthi, and O. Chapelle. Deterministic annealing for semi-supervised kernel machines. In ICML 23, pages 841–848, 2006. [18] A. J. Smola, S.V.N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In AISTATS 10, 2005. [19] V. N. Vapnik and A. Sterin. On structural risk minimization or overall risk in a problem of pattern recognition. Automation and Remote Control, 10(3):1495–1503, 1977. [20] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2004. [21] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector machines. In AAAI, 2005. [22] Z. Xu, R. Jin, J. Zhu, I. King, and M. Lyu. Efficient convex relaxation for transductive support vector machine. In NIPS, 2007. [23] X. Zhu. Semi-supervied learning literature survey. Computer Sciences Technical Report, 1530, University of Wisconsin-Madison, 2006.

10

Figure 4: Classification accuracy and CPU-time (seconds) of CutS3VM vs. Cu . increases if we increase Cl or decrease Cu . Moreover, the computational time scales roughly linearly with Cl and Cu , which coincides with our theoretical analysis in section 3.

5. CONCLUSIONS We propose the cutting plane semi-supervised SVM algorithm in this paper, to efficiently classify data samples with the maximum margin hyperplane in the semi-supervised scenario. Detailed theoretical analysis of the algorithm is provided, where we prove that the computational time of our method scales linearly with the sample size n and sparsity s with guaranteed accuracy. Moreover, experimental evaluations on several real world datasets show that CutS3VM performs better than existing S3VM methods, both in efficiency and accuracy.

6. ACKNOWLEDGMENTS This work is supported by the National Natural Science Foundation of China, Grant No. 60675009, and the National 863 project, Grant No. 2006AA01Z121.

838

Efficient Maximum Margin Clustering via Cutting Plane Algorithm Bin Zhao∗

Fei Wang

Abstract Maximum margin clustering (MMC) is a recently proposed clustering method, which extends the theory of support vector machine to the unsupervised scenario and aims at finding the maximum margin hyperplane which separates the data from different classes. Traditionally, MMC is formulated as a non-convex integer programming problem and is thus difficult to solve. Several methods have been proposed in the literature to solve the MMC problem based on either semidefinite programming or alternative optimization. However, these methods are time demanding while handling large scale datasets and therefore unsuitable for real world applications. In this paper, we propose the cutting plane maximum margin clustering (CPMMC) algorithm, to solve the MMC problem. Specifically, we construct a nested sequence of successively tighter relaxations of the original MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Moreover, we prove theoretically that the CPMMC algorithm takes time O(sn) to converge with guaranteed accuracy, where n is the total number of samples in the dataset and s is the average number of non-zero features, i.e. the sparsity. Experimental evaluations on several real world datasets show that CPMMC performs better than existing MMC methods, both in efficiency and accuracy.

1 Introduction Clustering [8] is one of the most fundamental research topics in both data mining and machine learning communities. It aims at dividing data into groups of similar objects, i.e. clusters. From a machine learning perspective, what clustering does is to learn the hidden patterns of the dataset in an unsupervised way, and these patterns are usually referred to as data concepts. Clustering plays an important role in many real world data mining applications such as scientific information retrieval and text mining, web analysis, marketing, computational biology, and many others [7]. Many clustering methods have been proposed in the literature over the decades, including k-means clustering ∗ State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing, China.

Changshui Zhang [6], mixture models[6] and spectral clustering [15, 2, 5]. More recently, Xu et al. [19] proposed maximum margin clustering (MMC), which borrows the idea from the support vector machine theory and aims at finding the maximum margin hyperplane which can separate the data from different classes in an unsupervised way. Their experimental results showed that the MMC technique often obtains more accurate results than conventional clustering methods [19]. Technically, that MMC does is just to find a way to label the samples by running an SVM implicitly, and the SVM margin obtained would be maximized over all possible labelings [19]. However, unlike supervised large margin methods which are usually formulated as convex optimization problems, maximum margin clustering is a non-convex integer optimization problem, which is much more difficult to solve. [19] and [18] made several relaxations to the original MMC problem and reformulate it as a semi-definite programming (SDP) problem. Then maximum margin clustering solution could be obtained using standard SDP solvers such as SDPA and SeDuMi. However, even with the recent advances in interior point methods, solving SDPs is still computationally very expensive [21]. Consequently, the algorithms proposed in [19] and [18] can only handle very small datasets containing several hundreds of samples. In real world applications such as scientific information retrieval and text mining, web analysis, and computational biology, the dataset usually contains a large amount of data samples. Therefore, how to efficiently solve the MMC problem to make it capable of clustering large scale dataset is a very challenging research topic. Very recently, Zhang et al. utilized alternating optimization techniques to solve the MMC problem [21], in which the maximum margin clustering result is obtained by solving a series of SVM training problems. However, there is no guarantee on how fast it can converge and the algorithm is still likely to be time demanding on large scale datasets. In this paper, we propose a cutting plane maximum margin clustering algorithm CPMMC. Specifically, we construct a nested sequence of successively tighter relaxations of the original MMC problem, and each optimization problem in this sequence could be efficiently

solved using the constrained concave-convex procedure (CCCP). Moreover, we prove theoretically that the CPMMC algorithm takes time O(sn) to converge with guaranteed accuracy, where n is the total number of samples in the dataset and s is the average number of non-zero features, i.e. the sparsity. Our experimental evaluations on several real world datasets show that CPMMC performs better than existing MMC methods, both in efficiency and accuracy. The rest of this paper is organized as follows. In section 2, we introduce some works related to this paper. The CPMMC algorithm is presented in detail in section 3. In section 4, we provide a theoretical analysis on the accuracy and time complexity of the CPMMC algorithm. Experimental results on several real-world datasets are provided in section 5, followed by the conclusions in section 6. 2 Notations and Related Works In this section we will briefly review some related works of this paper and establish the notations we will use. 2.1 Maximum Margin Clustering Maximum margin clustering (MMC) extends the theory of support vector machine (SVM) to the unsupervised scenario, where instead of finding a large margin classifier given labels on the data as in SVM, the target is to find a labeling that would result in a large margin classifier [19]. That is to say, if we subsequently run an SVM with the labeling obtained from MMC, the margin obtained would be maximal among all possible labelings. Given a point set X = {x1 , · · · , xn } and their labels y = (y1 , . . . , yn ) ∈ {−1, +1}n , SVM finds a hyperplane f (x) = wT φ(x)+b by solving the following optimization problem n

(2.1)

minw,b,ξi s.t.

X 1 T ξi w w+C 2 i=1

yi (wT φ(xi ) + b) ≥ 1 − ξi ξi ≥ 0 i = 1, . . . , n

where the data samples X are mapped into a high (possibly infinite) dimensional feature space using a possibly nonlinear function φ(x), and by using the kernel trick, this mapping could be done implicitly. Specifically, we define the kernel matrix K formed from the inner products of feature vectors, such that Kij = φ(xi )T φ(xj ), and transform the problem such that φ(x) only appears in the inner product. However, in those cases where kernel trick cannot be applied, if we still want to use a nonlinear kernel, it is possible to compute the coordinates of each sample in the kernel

PCA basis [14] according to the kernel K. More directly, as stated in [3], one can also compute the Cholesky ˆX ˆ T , and set decomposition of the kernel matrix K = X T ˆ i,1 , . . . , X ˆ i,n ) . Therefore, throughout the φ(xi ) = (X rest of this paper, we use φ(xi ) to denote the sample mapped by the kernel function φ(·). Different from SVM, where the class labels are given and the only variables are the hyperplane parameters (w, b), MMC targets to find not only the optimal hyperplane (w∗ , b∗ ), but also the optimal labeling vector y∗ [19] n

(2.2)

min

min

y∈{−1,+1}n w,b,ξi

s.t.

X 1 T w w+C ξi 2 i=1

yi (wT φ(xi ) + b) ≥ 1 − ξi ξi ≥ 0 i = 1, . . . , n

However, the above optimization problem has a trivially “optimal” solution, which is to assign all patterns to the same class, and the resultant margin will be infinite. Moreover, another unwanted solution is to separate a single outlier or a very small group of samples from the rest of the data. To alleviate these trivial solutions, Xu el al. [19] introduced a class balance constraint on y (2.3)

−l ≤ eT y ≤ l

where l ≥ 0 is a constant controlling the class imbalance and e is the all-one vector. Similar as in SVM, [19] solves problem (2.2) in its dual form which is a non-convex integer optimization problem, and to get the MMC solution in reasonable time, several relaxations are made. The first one relaxes the labeling vector y to take continuous values. The second one relaxes yyT to a positive semi-definite matrix M with diagonal elements all set to 1. The final relaxation is to set the bias term b in the classifying hyperplane to 0. After making these relaxations, the dual problem is simplified into a semi-definite programming (SDP) problem. To ensure M = yy T , a few more constraints are added into the SDP problem [19]. However, the number of parameters in the SDP problem scales quadratically with the number of samples in X [19], and by setting the bias term b to zero restrains the classifying hyperplane to cross the origin of the feature space. To alleviate these deficiencies, Valizadegan and Jin proposed the generalized maximum margin clustering (GMMC) algorithm [18], in which the number of parameters is reduced from n2 to n and the bias term b could take non-zero values. Nevertheless, MMC and GMMC both solve the maximum margin clustering problem via SDP, which could be quite time demanding when handling large

datasets. Therefore, Zhang et al. [21] proposed a simple alternating optimization technique which alternates between maximizing the margin w.r.t. the class label vector y with the classifying hyperplane fixed and maximizing the margin w.r.t. the hyperplane parameter (w, b) with y fixed, until convergence. 2.2 Concave-Convex Procedure The concaveconvex procedure (CCCP) [20] is a method for solving non-convex optimization problem whose objective function could be expressed as a difference of convex functions. It can be viewed as a special case of variational bounding [10] and related techniques including lower(upper) bound maximization(minimization) [13], surrogate functions and majorization [12]. While in [20] the authors only considered linear constraints, Smola et al. [16] proposed a generalization, the constrained concave-convex procedure (CCCP), for problems with a concave-convex objective function under concaveconvex constraints. Assume we are solving the following optimization problem [16] (2.4)

min

f0 (z) − g0 (z)

s.t.

fi (z) − gi (z) ≤ ci

z

i = 1, . . . , n

where fi and gi are real-valued convex functions on a vector space Z and ci ∈ R for all i = 1, . . . , n. Denote by T1 {f, z}(z0 ) the first order Taylor expansion of f at location z, that is T1 {f, z}(z0 ) = f (z) + ∂z f (z)(z0 − z), where ∂z f (z) is the gradient of the function f at z. For non-smooth functions, it can be easily shown that the gradient ∂z f (z) would be replaced by the subgradient [4]. Given an initial point z0 , the CCCP computes zt+1 from zt by replacing gi (z) with its first-order Taylor expansion at zt , i.e. T1 {gi , zt }(z), and setting zt+1 to the solution to the following relaxed optimization problem (2.5) min f0 (z) − T1 {g0 , zt }(z) z

s.t. fi (z) − T1 {gi , zt }(z) ≤ ci

i = 1, . . . , n

The above procedure continues until zt converges, and Smola et al. [16] proved that the CCCP is guaranteed to converge. 2.3 Notations We denote the number of samples in X by n and the number of features for each sample by N . For the high-dimensional sparse data commonly encountered in applications like text mining, web log analysis and bioinformatics, we assume each sample has only s ¿ N non-zero features, i.e., s implies the sparsity.

Maximum Margin Clustering via Cutting Plane Algorithm In this section, we first introduce a slightly different formulation of the maximum margin clustering problem that will be used throughout this paper and show that it is equivalent to the conventional MMC formulation. Then we present the main procedure of the cutting plane maximum margin clustering (CPMMC ) algorithm. 3

3.1 Cutting Plane Algorithm Conventionally, maximum margin clustering could be formulated as the following optimization problem (3.6)

min

min

y∈{−1,+1}n w,b,ξi

s.t.

n CX 1 T ξi w w+ 2 n i=1

yi (wT φ(xi ) + b) ≥ 1 − ξi ξi ≥ 0

i = 1, . . . , n

Pn

where i=1 ξi is divided by n to better capture how C scales with the dataset size. Existing MMC algorithms solve either problem (3.6) directly [21] or its dual form [19, 18]. The difficulty with problem (3.6) lies in the fact that we have to minimize the objective function w.r.t. the labeling vector y, in addition to w, b and ξi . To reduce the variables involved, we could formulate the MMC problem as min

(3.7)

w,b,ξi

s.t.

n 1 T CX ξi w w+ 2 n i=1

|wT φ(xi ) + b| ≥ 1 − ξi ξi ≥ 0 i = 1, . . . , n

where the labeling vector y is calculated as yi = sign(w T φ(xi ) + b). Moreover, we have the following theorem. Theorem 3.1. Any solution to problem (3.7) is also a solution to problem (3.6) (and vice versa). Proof. We Pnwill show that for every (w, b) the smallest feasible i=1 ξi are identical for problem (3.6) and problem (3.7), and their corresponding labeling vectors are the same. For a given (w, b), the ξi in problem (3.6) can be optimized individually. According to the constraint in problem (3.6), ξi ≥ 1 − yi (wT φ(xi ) + b) Pn As the objective is to minimize C i=1 ξi , the optimal n value for ξi is

(3.8)

(1)

ξi =min max{0, 1−yi (wTφ(xi )+b)} yi

=min{max{0, 1−wTφ(xi )−b}, max{0, 1+wTφ(xi )+b}}

(1)

and we denote the corresponding class label by yi . Similarly, for problem (3.7), the optimal value for ξi is (2)

ξi

= max{0, 1−|wTφ(xi )+b|} = max{0, min{1−wTφ(xi )−b, 1+wTφ(xi )+b}} (2)

Proof. Similar with the proof for theorem 3.1, we will show that problem (3.7) and problem (3.9) have the same objective value and an equivalent set of constraints. Specifically, we will prove Pn that for every (w, b), P the smallest feasible ξ and i=1 ξi are related by n ξ = n1 i=1 ξi . This means, with (w, b) fixed, (w, b, ξ) and (w, b, ξi ) are optimal solutions to problem (3.7) and (3.9) respectively, and they result in the same objective function value. For any given (w, b), the ξi in problem (3.7) can be optimized individually and the optimum is achieved as

and the class label is calculated as yi = sign(w T φ(xi )+ (1) (2) (1) (2) b). ξi , ξi and yi , yi are determined by the value T of w φ(xi ) + b. If we arrange (−1, 0, 1, wT φ(xi ) + b) in ascending order, there are four possible sequences (1) (2) in total, and the corresponding values for ξi , ξi and (1) (2) (1) yi , yi are shown in table 1, from which we see ξi = (2) (3.10) ξi = max{0, 1 − |wT φ(xi ) + b|} (2) (1) (2) ξi and yi = yi always hold. For simplicity, define z1 = 1 − wT φ(xi ) − b and z2 = 1 + wT φ(xi ) + b, Similarly for problem (3.9), the optimal ξ is Relation w φ(xi ) + b ≤ −1 −1 < wT φ(xi ) + b ≤ 0 0 < wT φ(xi ) + b ≤ 1 wT φ(xi ) + b ≥ 1 T

(1)

Table 1: Proof of ξi

(1)

ξi 0 z2 z1 0

(2)

= ξi

(2)

(1)

ξi 0 z2 z1 0

(2)

yi −1 −1 1 1 (1)

and yi

yi −1 −1 1 1

s.t.

n n 1X 1X ci − ci |wT φ(xi )+b|} n i=1 n i=1

Since each ci are independent in Eq.(3.11), they can be optimized individually. Therefore,

(2)

By reformulating problem (3.6) as problem (3.7), the number of variables involved in the MMC problem is reduced by n, but there are still n slack variables ξi in problem (3.7). Next we will further reduce the number of variables by reformulating problem (3.7) into the following optimization problem w,b,ξ≥0

c∈{0,1}

= yi

Therefore, the objective functions of both optimization problems are equivalent for any (w, b) with the same optimal ξi , and consequently so are their optima. Moreover, their corresponding labeling vectors y are the same. Hence, we proved that problem (3.6) is equivalent to problem (3.7). 2

(3.9) min

(3.11)ξ (3)= max n{

1 T w w+Cξ 2 ∀ c ∈ {0, 1}n : n n 1X 1X ci |wT φ(xi )+b| ≥ ci −ξ n i=1 n i=1

Although problem (3.9) has 2n constraints, one for each possible vector c = (c1 , . . . , cn ) ∈ {0, 1}n , it has only one slack variable ξ that is shared across all constraints, thus, the number of variables is further reduced by n − 1. Each constraint in this formulation corresponds to the sum of a subset of constraints from problem (3.7), and the vector c selects the subset. Problem (3.7) and problem (3.9) are equivalent in the following sense. Theorem 3.2. Any solution (w∗ , b∗ ) to problem (3.9) is also aPsolution to problem (3.7) (and vice versa), with n ξ ∗ = n1 i=1 ξi∗ .

ξ (3) =

n X i=1

=

n X

1 1 max { ci − ci |wT φ(xi ) + b|} n n

ci ∈{0,1}

max{0,

i=1

=

n

1 (1 − |wT φ(xi ) + b|)} n n

1X 1 X (2) max{0, 1 − |wT φ(xi ) + b|} = ξ n i=1 n i=1 i

Hence, for any (w, b), the objective functions for problem (3.7) and problem (3.9) have the same value given the optimal ξ and ξi . Therefore, the optima of the two optimization problems are the same. 2 The above theorem shows that it is possible to solve problem (3.9) instead to get the optimal solution to problem (3.7), i.e. to get the same soft-margin classifying hyperplane. Putting theorem 3.1 and 3.2 together, we could therefore solve problem (3.9) instead of problem (3.6) to find the same maximum margin clustering solution, with the number of variables reduced by 2n−1. Although the number of variables in problem (3.9) is greatly reduced, the number of constrains is increased from n to 2n . The algorithm we propose in this paper targets to find a small subset of constraints from the whole set of constraints in problem (3.9) that ensures a sufficiently accurate solution. Specifically, we employ an adaptation of the cutting plane algorithm[11], recently proposed in [17] for structural SVM training, to solve the maximum margin clustering problem, where we construct a nested sequence of successively tighter relaxations of problem (3.9). Moreover, we will prove

theoretically in section 4 that we can always find a polynomially sized subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (3.9) up to a precision of ². That is to say, the remaining exponential number of constraints are guaranteed to be violated by no more than ², without the need for explicitly adding them to the optimization problem [17]. The CPMMC algorithm starts with an empty constraint subset Ω, and it computes the optimal solution to problem (3.9) subject to the constraints in Ω. The algorithm then finds the most violated constraint in problem (3.9) and adds it into the subset Ω. In this way, we construct a successive strengthening approximation of the original MMC problem by a cutting plane that cuts off the current optimal solution from the feasible set [11]. The algorithm stops when no constraint in (3.9) is violated by more than ². Here, the feasibility of a constraint is measured by the corresponding value of ξ, therefore, the most violated constraint is the one that would result in the largest ξ. Since each constraint in problem (3.9) is represented by a vector c, then we have Theorem 3.3. The most violated constraint could be computed as follows ½ 1 if |wT φ(xi )+b| < 1 (3.12) ci = 0 otherwise

Proof. As stated above, the most violated constraint is the one that would result in the largest ξ. In order to fulfill all constraints in problem (3.9), the minimum value of ξ is as follows

(3.13)ξ ∗ = max n{ c∈{0,1}

= =

n X

n n 1X 1X ci − ci |wT φ(xi )+b|} n i=1 n i=1

1 1 max { ci − ci |wT φ(xi ) + b|} n ci ∈{0,1} n i=1 n

1X max {ci (1 − |wT φ(xi ) + b|)} n i=1 ci ∈{0,1}

Therefore, the most violated constraint c that results in the largest ξ ∗ could be calculated as in Eq.(3.12). 2 The CPMMC algorithm iteratively selects the most violated constraint under the current hyperplane parameter and adds it into the working constraint set Ω until no violation of constraint is detected. Moreover, since in problem (3.9), there is a direct correspondence between ξ and the feasibility of the set of constraints. If a point (w, b, ξ) fulfills all constraints up to precision ², i.e. (3.14)

∀ c ∈ {0, 1}n : n n 1X 1X ci |wT φ(xi )+b| ≥ ci −(ξ + ²) n i=1 n i=1

then the point (w, b, ξ + ²) is feasible. Furthermore, as in the objective function of problem (3.9), there is a single slack variable ξ that measures the clustering loss. Hence, we could simply select the stopping criterion as all samples satisfying the inequality (3.14). Then, the approximation accuracy ² of this approximate solution is directly related to the training loss. 3.2 Enforcing the Class Balance Constraint As discussed in section 2, one has to enforce the class balance constraint to avoid trivially “optimal” solutions. However, in our algorithm, the optimal labeling vector y is calculated as yi = sign(wT φ(xi ) + b). The nonlinear function ‘sign’ greatly complicates the situation. Since the intention of restraining class balance is to exclude assigning all data samples into the same class or separating a very small group of outliers from the rest of the dataset. We can approximate the class balance constraint (2.3) originally proposed in [19] with the following similar but slightly relaxed constraint, which excludes the above stated two trivial solutions n X ¡ T ¢ −l ≤ (3.15) w φ(xi ) + b ≤ l i=1

where l ≥ 0 is a constant controlling the class imbalance. This approximation could also be viewed as an analogy to the treatment of the min-cut problem in spectral clustering [15]. Assume the current working constraint set is Ω, maximum margin clustering with the class balance constraint could be formulated as the following optimization problem 1 T w w+Cξ w,b,ξ≥0 2 n n 1X 1X ci |wTφ(xi )+b| ≥ ci −ξ s.t. ∀c ∈ Ω : n i=1 n i=1

(3.16)min

−l ≤

n X ¡ i=1

¢ wTφ(xi ) + b ≤ l

Before getting into details of solving problem (3.16), we first present the outline of our CPMMC algorithm for maximum margin clustering in table 2. 3.3 Optimization via the CCCP In each iteration of the CPMMC algorithm, we need to solve problem (3.16) to obtain the optimal classifying hyperplane under the current working constraint set Ω. Although the objective function in (3.16) is convex, the constraints are not, and this makes problem (3.16) difficult to solve. As stated in section 2, the constrained concave-convex procedure is designed to solve those optimization problems with a concave-convex objective function under

CPMMC for maximum margin clustering Initialization. Set the values for C and ², and set Ω = φ; Solve optimization problem (3.16) under the current working constraint set Ω; Select the most violated constraint c under the current classifying hyperplane with Eq.(3.12). If the selected constraint c satisfies the following inequality Pn Pn 1 1 T i=1 ci − n i=1 ci |w φ(xi ) + b| ≤ ξ + ² n goto step 5; otherwise Ω = Ω ∪ {c}, go to step 2. Output. Return the labeling vector y as yi = sign(wT φ(xi ) + b)

1. 2. 3. 4.

5.

Table 2: Outline of the CPMMC algorithm. concave-convex constraints [16]. In the following, we will show how to utilize CCCP to solve problem (3.16). By rearranging the constraints in problem (3.16), we can reformulate it as 1 T w w+Cξ 2 s.t. ξ ≥ 0 n X ¡ T ¢ w φ(xi ) + b ≤ l −l ≤

(3.17)min

w,b,ξ

order Taylor expansion at (wt , bt ), i.e. n n 1X 1X ci |wtTφ(xi )+bt |+ ci sign(wtTφ(xi )+bt ) n i=1 n i=1 £ ¤ · φ(xi )T (w−wt )+(b−bt ) n n 1X 1X ci |wtTφ(xi )+bt |− ci |wtT φ(xi )+bt | = n i=1 n i=1

(3.20)

n

£ ¤ 1X + ci sign(wtTφ(xi )+bt ) wT φ(xi )+b n i=1 n

£ ¤ 1X ci sign(wtTφ(xi )+bt ) wTφ(xi )+b n i=1

=

By substituting the above first-order Taylor expansion (3.20) into problem (3.17), we obtain the following quadratic programming (QP) problem: 1 T w w+Cξ 2 s.t. ξ ≥ 0 n X ¡ T ¢ −l ≤ w φ(xi ) + b ≤ l

(3.21)min

w,b,ξ

i=1

n n 1X 1X ci −ξ − ci n i=1 n i=1 £ ¤ ·sign(wtTφ(xi )+bt ) wTφ(xi )+b ≤ 0

∀c ∈ Ω :

i=1

∀c ∈ Ω :

n n 1X 1X ci |wTφ(xi )+b| ≥ ci −ξ n i=1 n i=1

The objective function in (3.17) is quadratic and the first two constraints are linear. Moreover, the third constraint is, though non-convex, a difference between two convex functions. Hence, we can solve problem (3.17) with the constrained Pn concave-convex procedure. Notice that while n1 i=1 ci |wT φ(xi ) + b| is convex, it is a non-smooth function of (w, b). To use the CCCP, we need to replace the gradients by the subgradients [4]:

and the above QP problem could be solved in polynomial time. Following the CCCP, the obtained solution (w, b) from this QP problem is then used as (wt+1 , bt+1 ) and the iteration continues until convergence. Moreover, we will show in the theoretical analysis section that the Wolfe dual of problem (3.21) has desirable sparseness properties. As it will be referred to later, we present the Wolfe dual here. For simplicity, we define the following three variables n

∂w

(3.18) = (3.19)

"

n

1X ci |wT φ(xi ) + b| n i=1

#¯ ¯ ¯ ¯ ¯

1 n

n X

ci sign(wtT φ(xi )

b=bt

1X cki , k = 1, . . . , |Ω| n i=1

n

w=wt

n 1X ci sign(wtT φ(xi ) + bt )φ(xi ) n i=1 " n #¯ ¯ 1X ¯ T ∂b ci |w φ(xi ) + b| ¯ ¯ n i=1

=

||ck ||1 = zk = ˆ= x

1X cki sign(wtTφ(xi )+bt )φ(xi ), k = 1, . . . , |Ω| n i=1

n X

φ(xi )

i=1

The Wolfe dual of problem (3.21) is

+ bt )

i=1

Given an initial point (w0 , b0 ), the CCCP computes (wt+1 , bt+1 ) from (wt , bt ) by replacing Pn 1 T c |w φ(xi ) + b| in the constraint with its firsti=1 i n

(3.22)max − λ≥0,µ≥0

|Ω| |Ω|

|Ω|

k=1 l=1

k=1

X 1 XX ˆ Tzk λkλl zTkzl +(µ1−µ2 ) λk x 2

|Ω| X 1 ˆ Tx ˆ −(µ1+µ2 )l+ λk||ck ||1 − (µ1 −µ2 )2 x 2 k=1

s.t.

|Ω| X

λk ≤ C

k=1 |Ω| n X λk X cki sign(wtTφ(xi )+bt ) = 0 (µ1 −µ2 )n− n i=1 k=1

Problem (3.22) is a QP problem with |Ω| + 2 variables, where |Ω| denotes the total number of constraints in the subset Ω. Note that in successive iterations of the CPMMC algorithm, the optimization problem (3.16) differs only by a single constraint. Therefore, we can employ the solution in last iteration of the CPMMC algorithm as the initial point for the CCCP, which greatly reduces the runtime. Putting everything together, according to the formulation of the CCCP [16], we solve problem (3.16) with the algorithm presented in table 3. We set the

1.

2. 3.

Solving problem (3.16) using the CCCP Initialize (w0 , b0 ) with the output of the last iteration of CPMMC algorithm, if this is the first iteration of CPMMC algorithm, initialize (w0 , b0 ) with random values. Find (wt+1 , bt+1 ) as the solution to the quadratic programming problem (3.21); If convergence criterion satisfied, return (wt , bt ) as the optimal hyperplane parameter; otherwise t = t + 1, goto step 2. Table 3: Outline of the CCCP algorithm.

stopping criterion in CCCP as the difference between two iterations less than α% and set α% = 0.01, which means the current objective function is larger than 1 − α% of the objective function in last iteration, since CCCP decreases the objective function monotonically. 4

Theoretical Analysis

In this section, we will provide a theoretical analysis of the CPMMC algorithm, including its correctness and time complexity. Specifically, the following theorem characterizes the accuracy of the solution computed by CPMMC. Theorem 4.1. For any dataset X = (x1 , . . . , xn ) and any ² > 0, if (w∗ , b∗ , ξ ∗ ) is the optimal solution to problem (3.9), then our CPMMC algorithm for maximum margin clustering returns a point (w, b, ξ) for which (w, b, ξ + ²) is feasible in problem (3.9). Moreover, the corresponding objective value is better than the one corresponds to (w∗ , b∗ , ξ ∗ ). Proof. In step 3 of our CPMMC algorithm, the most violated constraint c, which leads to the largest value of

ξ, is selected using Eq.(3.12). According to the outline of the CPMMC algorithm, it terminates only when the newly selected constraint satisfies the following inequality n

n

1X 1X ci − ci |wT φ(xi ) + b| ≤ ξ + ² n i=1 n i=1 If the above relation holds, since the newly selected constraint is the most violated one, all other constraints will satisfy the above inequality relation. Therefore, if (w, b, ξ) is the solution returned by our CPMMC algorithm, then (w, b, ξ + ²) will be a feasible solution to problem (3.9). Moreover, since the solution (w, b, ξ) is calculated with only constraints in the subset Ω instead of all constraints in problem (3.9), it holds that 1 ∗T ∗ 2 w + Cξ ∗ ≥ 12 wT w + Cξ. 2w Based on the above theorem, ² indicates how close one wants to be to the error rate of the best classifying hyperplane and can thus be used as the stopping criterion [9]. We next analyze the time complexity of CPMMC. We will mainly focus on the high-dimensional sparse data, where s ¿ N , while for low-dimensional data, by simply setting s = N , all our theorems still hold. We will first show that each iteration of the CPMMC algorithm takes polynomial time. Since the algorithm is iterative, we will next prove that the total number of constraints added into the working set Ω, i.e. the total iterations involved in the CPMMC algorithm, is upper bounded. Theorem 4.2. Each iteration of CPMMC takes time O(sn) for a constant working set size |Ω|. Proof. In steps 3 and 4 of the CPMMC algorithm, we need to compute n inner products between w and φ(xi ). Each inner product takes time O(s) when using sparse vector algebra, and totally n inner products will be computed in O(sn) time. To solve the CCCP problem in step 2, we will need to solve a series of quadratic programming (QP) problems. Setting up the dual problem (3.22) is dominated by computing the |Ω|2 elements P|Ω| P|Ω| zTk zl involved in the sum k=1 l=1 λk λl zTk zl , and this can be done in time O(|Ω|2 sn) after first computing zk , (k = 1, . . . , |Ω|). Since the number of variables involved in the QP problem (3.22) is |Ω| + 2 and (3.22) can be solved in polynomial time, the time required for solving the dual problem is then independent of n and s. Therefore, each iteration in the CCCP takes time O(|Ω|2 sn). Moreover, in numerical analyses, we observed in each round of the CPMMC algorithm, less than 10 iterations is required for solving the CCCP

problem, even for large scale dataset. Moreover, the number of iterations required is independent of n and s. Therefore, the time complexity for each iteration of our CPMMC algorithm is O(sn), which scales linearly with n and s. 2 Next we will prove an upper bound on the number of iterations in the CPMMC algorithm. For simplicity, we omit the class balance constraint in problem (3.16) and set the bias term b = 0. The theorem and proof for the problem with class balance constraint and non-zero bias term could be obtained similarly. Theorem 4.3. For any ² > 0, C > 0, and any dataset X = {x1 , . . . , xn }, the CPMMC algorithm terminates after adding at most CR ²2 constraints, where R is a constant number independent of n and s. Proof. Note that w = 0, ξ = 1 is a feasible solution to problem (3.9), therefore, the objective function of the solution of (3.9) is upper bounded by C. We will prove that in each iteration of the CPMMC algorithm, by adding the most violated constraint, the increase of the objective function is at least a constant number. Due to the fact that the objective function of the solution is non-negative and has upper bound C, the total number of iterations will be upper bounded. To compute the increase brought up by adding one constraint into the working set Ω, we will first need to present the dual problem of (3.9). The difficulty involved in obtaining this dual problem comes from the abstracts in the constraints. Therefore, we first need to replace the constraints in (3.9) with the following n

n

1X 1X c i ti ≥ ci − ξ, n i=1 n i=1

∀c ∈ Ω

∀i ∈ {1, . . . , n} t2i ≤ wT φ(xi )φ(xi )T w, ti ≥ 0, ∀i ∈ {1, . . . , n} and we define Di = φ(xi )φ(xi )T . Hence, the Lagrangian dual function can be obtained as follows (4.23) L(λ, γ, µ, δ)  " n # |Ω| 1 X 1X T = inf w w+Cξ + λk ci (1−ti )−ξ w,ξ,t2 n i=1 k=1 ) n n X X T 2 δ i ti γi (ti −w Di w)−µξ − +

i=1  |Ω| n 1 X X T T w w−w γi Di w+Cξ − = inf λk ξ −µξ w,ξ,t2 i=1 i=1

k=1

 |Ω| n  X X X 1 1 cki ti − δi ti + λk cki + γi t2i − λk n i=1 n i=1  i=1 i=1 n X

|Ω|

k=1

n X

n X

k=1

  P|Ω| |Ω| n   X ( k=1 λk cki + nδi )2 1 X = − + λ c k ki   4n2 γi n i=1

k=1

satisfying the following constraints (4.24)

I −2

n X

γi Di º 0

i=1

(4.25)

C−

|Ω| X

λk − µ = 0

k=1 |Ω| δi 1 X λk cki + 2nγi 2γi

(4.26)

ti =

(4.27)

λ, γ, µ, δ ≥ 0

k=1

The CPMMC algorithm selects the most violated constraint c0 and continues if the following inequality holds n

1X 0 c (1 − t∗i ) ≥ ξ + ² n i=1 i

(4.28)

Since ξ ≥ 0, the newly added constraint satisfies n

1X 0 c (1 − t∗i ) ≥ ² n i=1 i

(4.29)

Denote by Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) the optimal value of the Lagrangian dual function subject to Ωk+1 = Ωk ∪ {c0 }. The addition of a new constraint to the primal problem is equivalent to adding a new variable λk+1 into the dual problem. (4.30)Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) ( Pk n X ( p=1 λp cpi +λk+1 c0i +nδi )2 = max − λ,γ,µ,δ 4n2 γi i=1 " k #) 1 X 0 + λp cpi + λk+1 ci n p=1 ( Pk (k) n X λk+1c0i p=1λp cpi (k) (k) (k) (k) − ≥Lk (λ ,γ ,µ ,δ )+ max (k) λk+1 ≥0 2γi n2 i=1 ) (k) (λk+1 c0i )2 1 λk+1 c0i δi 0 − + λk+1 ci − (k) (k) n 2γ n 4γ n2 i

i

According to inequality (4.29) and the constraint λk+1 ≥ 0, we have " # Pk (k) n n X λk+1 c0i p=1 λp cpi λk+1 c0i δi(k) 1X λk+1 c0i −²λk+1 ≤ + (k) 2 (k) n 2γ n 2γ n i=1 i=1 i i Substituting the above inequality into (3.9), we get the lower bound of Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) )

as follows (4.31) Lk+1 (λ(k+1) , γ (k+1) , µ(k+1) , δ (k+1) ) ( n 1X (k) (k) (k) (k) λk+1 c0i +²λk+1 ≥Lk (λ ,γ ,µ ,δ )+ max − λk+1 ≥0 n i=1 ) n n X (λk+1 c0i )2 X 1 − + λk+1 c0i (k) 2 n 4γ n i=1 i=1 i ) ( n X (λk+1 c0i )2 (k) (k) (k) (k) =Lk (λ ,γ ,µ ,δ )+ max ²λk+1 − (k) 2 λk+1 ≥0 i=1 4γi n ²2

=Lk (λ(k),γ (k),µ(k),δ (k) )+ Pn

(k) 2 02 i=1 (ci /γi n )

(k)

where γi is the value of γi which results in the largest Lk (λ, γ, µ, δ). By maximizing the Lagrangian dual function shown in Eq.(4.23), γ (k) could be obtained as follows (λ(k) , γ (k) , µ(k) , δ (k) ) ( Pk ) n k X ( p=1 λp cpi +nδi )2 1 X = arg max − + λp cpi λ,γ,µ,δ 4n2 γi n p=1 i=1 = arg max

λ,γ,µ,δ

n X

(γi − δi )

i=1

subject to the following equation (4.32)

2nγi =

k X

λp cpi + nδi

p=1

The only P constraint on δi is δi ≥ 0, therefore, to n maximize i=1 (γi − δi ), the optimal value for δi is 0. Hence, the following equation holds (k)

2nγi

(4.33)

=

k X

λ(k) p cpi

p=1

(k)

Thus, nγi

is a constant number independent of n. Pn (c0i )2 Moreover, i=1 n measures the fraction of non-zero elements in the constraint vector c0 , and therefore is a constant only related to the newly added constraint, Pn (c0i )2 also independent of n. Hence, i=1 (k) 2 is a conγi

n

stant number independent of n and s, and we denote it with Qk . Moreover, we define R = maxk {Qk } as the maximum of Qk through the whole CPMMC process. Therefore, the increase of the objective function of the Lagrangian dual problem after adding the 2 most violated constraint c0 is at least ²R . Furthermore, denote with Gk the value of the objective function in problem (3.9) subject to Ωk after adding k constraints. Due to the weak duality [1], at the optimal

solution Lk (λ(k) , γ (k) , µ(k) , δ (k) ) ≤ Gk (w(k) , ξ (k) , t(k) ) ≤ C. Since the Lagrangian dual function is upper bounded by C, the CPMMC algorithm terminates after adding at most CR 2 ²2 constraints. It is true that the number of constraints can potentially explode for small values of ², however, experience with CPMMC shows that relatively large values of ² are sufficient without loss of clustering accuracy. Note that the objective function of problem (3.9) with the scaled C n instead of C is essential for this theorem. Putting everything together, we arrive at the following theorem regarding the time complexity of CPMMC. Theorem 4.4. For any dataset X = {x1 , . . . , xn } with n samples and sparsity of s, and any fixed value of C > 0 and ² > 0, the CPMMC algorithm takes time O(sn). Proof. Since theorem 4.3 bounds the number of iterations in our CPMMC algorithm to a constant CR ²2 which is independent of n and s. Moreover, each iteration of the algorithm takes time O(sn). The CPMMC algorithm has time complexity O(sn). 2 5 Experiments In this section, we will validate the accuracy and efficiency of the CPMMC algorithm on several real world datasets. Specifically, we will analyze the scaling behavior of CPMMC with the sample size. Moreover, we will also study the sensitivity of CPMMC to ² and C, both in accuracy and efficiency. All the experiments are performed with MATLAB 7.0 on a 1.66GHZ Intel CoreTM 2 Duo PC running Windows XP with 1.5GB main memory. 5.1 Datasets We use seven datasets in our experiments, selected to cover a wide range of properties. Specifically, experiments are performed on a number of datasets from the UCI repository (ionosphere, digits, letter and satellite), MNIST database1 and 20newsgroup dataset2 . For the digits data, we follow the experimental setup of [21] and focus on those pairs (3 vs 8, 1 vs 7, 2 vs 7, and 8 vs 9) that are difficult to differentiate. For the letter and satellite datasets, there are multiple classes and we use their first two classes only [21]. For the 20-newsgroup dataset, we choose the topic rec which contains autos, motorcycles, baseball and hockey from the version 20-news-18828. We preprocess the data in the same manner as [22] and obtain 3970 document vectors in a 8014-dimensional space. Similar with the digits dataset, we focus on those pairs (autos vs. motorcycles (Text-1), and baseball vs. hockey 1 http://yann.lecun.com/exdb/mnist/ 2 http://people.csail.mit.edu/jrennie/20Newsgroups/

Data Ionosphere Letter UCI digits Text-1 Text-2 Satellite MNIST digits

Size (n) 351 1555 1797 1980 1989 2236 70000

Feature (N ) 34 16 64 8014 8014 36 784

Sparsity 88.1% 98.9% 51.07% 0.70% 0.79% 100% 19.14%

Table 4: Descriptions of the datasets. 5.2 Clustering Accuracy We will first study the clustering accuracy of the CPMMC algorithm. We use k-means clustering (KM) and normalized cut (NC) as baselines, and also compared with MMC [19], GMMC [18] and IterSVR [21] which all aim at clustering data with the maximum margin hyperplane. For CPMMC, a linear kernel is used, while for IterSVR, both linear kernel and Gaussian kernel are used and we report the better result. Parameters involved are tuned using grid search, unless noted otherwise. To assess clustering accuracy, we follow the strategy used in [19] where we first remove the labels for all data samples and run the clustering algorithms, then we label each of the resulting clusters with the majority class according to the original training labels, and finally measure the number of misclassifications made by each clustering [19]. The clustering accuracy results are shown in table 5, from which we clearly see that the CPMMC algorithm achieves higher accuracy than other methods on most of the datasets.

Data Digits 3-8 Digits 1-7 Digits 2-7 Digits 8-9 Ionosphere Letter Satellite Text-1 Text-2

5.4 Average Number of CCCP Iterations We state in section 4 that in each round of the CPMMC algorithm, less than 10 iterations is required for solving the CCCP problem, even for large scale datasets. Moreover, the number of iterations required is independent of

NC 0.12 0.13 0.11 0.11 0.12 2.24 5.01 6.04 5.35

GMMC 276.16 289.53 304.81 277.26 273.04 -

IterSVR 19.72 20.49 19.69 19.41 18.86 2133 6490 5844 6099

CPMMC 1.10(5.42) 0.95(6.15) 0.75(7.55) 0.85(3.77) 0.78(2.33) 0.87(4.00) 4.54(4.08) 19.75(3.80) 16.16(4.00)

n and s. We validate this statement with experimental results on various datasets in figure 1, where we show how the average number of CCCP iterations r scales with the sample size. Moreover, the last column of table 6 provides the average number of CCCP iterations in CPMMC until convergence, from which we see regardless of the size of the datasets, in each round of CPMMC algorithm, less than 10 CCCP iterations are required on average. 10

Letter Satellite Text−1 Text−2 MNIST 1vs2 MNIST 1vs7

8 6 4 2 2

10

5.3 Speed of CPMMC Table 6 compares the CPUtime of CPMMC with k-means clustering, normalized cut, GMMC and IterSVR on 9 real world clustering problems. According to table 6, CPMMC is at least 18 times faster than IterSVR and 200 times faster than GMMC. As reported in [18], GMMC is about 100 times faster than MMC. Hence, CPMMC is still faster than MMC by about four orders of magnitude. Moreover, as the sample size increases, the CPU-time of CPMMC grows much slower than that of IterSVR, which indicates CPMMC has much better scaling property with the sample size than IterSVR.

KM 0.51 0.54 0.50 0.49 0.07 0.08 0.19 66.09 52.32

Table 6: CPU-time (seconds) on the various datasets. For CPMMC, the number inside the bracket is the average number of CCCP iterations r.

CCCP Iterations

(Text-2)) that are difficult to differentiate. Furthermore, for UCI digits and MNIST datasets, we give a more through comparison by considering all 45 pairs of digits 0 − 9.

3

10 Number of Samples

4

10

Figure 1: Average number of CCCP iterations in CPMMC as a function of sample size n. 5.5 How does Computational Time Scale with the Number of Samples? In the theoretical analysis section, we state that the computational time of CPMMC scales linearly with the number of samples. We present numerical demonstration for this statement in figure 2, where a log-log plot of how computational time increases with the size of the data set is shown. Specifically, lines in the log-log plot correspond to polynomial growth O(nd ), where d is the slope of the line. Figure 2 shows that the CPU-time of CPMMC scales roughly O(n), which is consistent with theorem 4.4. 5.6 How does ² Affect the Accuracy and Speed of CPMMC ? Theorem 4.3 states that the total num-

Data Digits 3-8 Digits 1-7 Digits 2-7 Digits 8-9 Ionosphere Letter Satellite Text-1 Text-2 UCI digits MNIST digits

Size 357 361 356 354 351 1555 2236 1980 1989 1797 70000

KM 5.32± 0 0.55± 0 3.09± 0 9.32± 0 32± 17.9 17.94± 0 4.07± 0 49.47±0 49.62±0 3.62 10.79

NC 35 45 34 48 25 23.2 4.21 6.21 8.65 2.43 10.08

MMC 10 31.25 1.25 3.75 21.25 -

GMMC 5.6 2.2 0.5 16.0 23.5 -

IterSVR 3.36± 0 0.55± 0 0.0± 0 3.67± 0 32.3± 16.6 7.2± 0 3.18± 0 3.18± 0 6.01± 1.82 1.82 7.59

CPMMC 3.08 0.0 0.0 2.26 27.64 5.53 1.52 5.00 3.72 0.62 4.29

Table 5: Clustering errors(%) on the various datasets. (a)

(b)

0.95

CPU−Time (seconds)

Clustering Accuracy

1

0.9 0.85

UCI Digits 3vs8 UCI Digits 1vs7 UCI Digits 2vs7 UCI Digits 8vs9 Letter Text−1 Text−2

0.8 0.75

UCI Digits 3vs8 UCI Digits 1vs7 UCI Digits 2vs7 UCI Digits 8vs9 Letter Text−1 Text−2

2

10

−0.5

O(x

)

0

10

−2

0.7 −4 10

−2

10 Epsilon

10 −4 10

0

10

−2

10 Epsilon

0

10

Figure 3: Clustering results of CPMMC with various values for ². (a) Clustering accuracy of CPMMC as a function of ²; (b) CPU-time (seconds) of CPMMC as a function of ². 3

CPU−Time (seconds)

10

plot in figure 3(b) verifies that the CPU-time of CPMMC decreases as ² increases. Moreover, the empirical 1 scaling of roughly O( ²0.5 ) is much better than O( ²12 ) in the bound from theorem 4.3.

Letter Satellite Text−1 Text−2 MNIST−1vs2 MNIST−1vs7 O(n)

2

10

1

10

Figure 2: CPU-time (seconds) of CPMMC as a function of sample size n.

5.7 How dose C Affect the Accuracy and Speed of CPMMC ? Besides ², C is also a crucial parameter in CPMMC, which adjusts the tradeoff between the margin and the clustering loss. We present in figure 4 how clustering accuracy and computational time scale with C. From this figure we observe that for most of the datasets, the clustering accuracy does not change evidently as long as C reside in a proper region, and the computational time scales linearly with C, which coincides with our theoretical analysis in section 4.

ber of iterations involved in CPMMC is at most CR ²2 , and this means with higher ², the algorithm might converge fast. However, as ² is directly related to the training loss in CPMMC, we need to determine how small ² should be to guarantee sufficient accuracy. We present in figure 3 how clustering error and computational time scale with ². According to figure 3(a), ² = 0.1 is small enough to guarantee clustering accuracy. The log-log

6 Conclusions We propose the cutting plane maximum margin clustering (CPMMC) algorithm in this paper, to cluster data samples with the maximum margin hyperplane. Detailed theoretical analysis of the algorithm is provided, where we prove that the computational time of CPMMC scales linearly with the sample size n and sparsity s with guaranteed accuracy. Moreover, experimental evalua-

0

10

−1

10

2

10

3

10 Number of Samples

4

10

(a)

1

(b)

CPU−Time (seconds)

Clustering Accuracy

2

0.9 0.8 0.7

UCI 3vs8 UCI 1vs7 UCI 2vs7 UCI 8vs9 Letter Satellite Ionosphere

0.6 −2 10

10

UCI 3vs8 UCI 1vs7 UCI 2vs7 UCI 8vs9 Letter Satellite Ionosphere O(x)

0

10

−2

0

10 C

2

10

10 −2 10

0

10 C

2

10

Figure 4: Clustering results of CPMMC with various values for C. (a) Clustering accuracy of CPMMC as a function of C; (b) CPU-time (seconds) of CPMMC as a function of C. tions on several real world datasets show that CPMMC performs better than existing MMC methods, both in efficiency and accuracy. Acknowledgements This work is supported by the project (60675009) of the National Natural Science Foundation of China.

References [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004. [2] P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral kway ratio-cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 13:1088–1096, 1994. [3] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005. [4] P. M. Cheung and J. T. Kowk. A regularization framework for multiple-instance learning. In Proceedings of the 23rd International Conference on Machine Learning, 2006. [5] C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data mining. In Proceedings of the 1st International Conference on Data Mining, pages 107–114, 2001. [6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2001. [7] J. Han and M. Kamber. Data Mining. Morgan Kaufmann Publishers, 2001. [8] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ., 1988. [9] T. Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006. [10] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999.

[11] J. E. Kelley. The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8:703–712, 1960. [12] K. Lange, D.R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Jounal of Computational and Graphical Statistics, 9:1–59, 2000. [13] S. P. Luttrell. Partitioned mixture distributions: An adaptive bayesian network for low-level image processing. In IEEE Proceedings on Vision, Image and Signal Processing, volume 141, pages 251–260, 1994. [14] B. Sch¨ olkopf, A. J. Smola, and K. R. M¨ uller. Kernel principal component analysis. Advances in kernel methods: support vector learning, pages 327–352, 1999. [15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2000. [16] A. J. Smola, S.V.N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In Proceedins of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005. [17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453–1484, 2005. [18] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. In Advances in Neural Information Processing Systems 19, pages 1417–1424, 2007. [19] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in Neural Information Processing Systems, 2004. [20] A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003. [21] K. Zhang, I. W. Tsang, and J. T. Kowk. Maximum margin clustering made practical. In Proceedings of the 24th International Conference on Machine Learning, 2007. [22] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨ olkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems, 2003.

Maximum Margin Embedding Bin Zhao, Fei Wang, Changshui Zhang State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation,Tsinghua University, Beijing, China Abstract We propose a new dimensionality reduction method called Maximum Margin Embedding (MME), which targets to projecting data samples into the most discriminative subspace, where clusters are most well-separated. Specifically, MME projects input patterns onto the normal of the maximum margin separating hyperplanes. As a result, MME only depends on the geometry of the optimal decision boundary and not on the distribution of those data points lying further away from this boundary. Technically, MME is formulated as an integer programming problem and we propose a cutting plane algorithm to solve it. Moreover, we prove theoretically that the computational time of MME scales linearly with the dataset size. Experimental results on both toy and real world datasets demonstrate the effectiveness of MME.

1

Introduction

In machine learning and data mining, we are often faced with data of very high dimensionality. Directly dealing with these data is usually intractable due to the computational cost and the curse of dimensionality. In recent years, many embedding (or dimensionality reduction) methods have been proposed, among which the most well-known ones include Principal Component Analysis (PCA) [5] and manifold based embedding algorithms such as ISOMAP [10], locally linear embedding (LLE) [7], Hessian LLE (HLLE) [3], Laplacian eigenmap [1], local tangent space alignment (LTSA) [13], charting [2], conformal eigenmap [8], semi-definite embedding [11], etc. However, there are still some limitations for directly applying these afore mentioned embedding methods as they are not closely related to the following classification or clustering task. Specifically, although PCA is a popular unsupervised method which aims at extracting a subspace in which the variance of the projected data is maximized (or, equivalently, the reconstruction error is minimized), by its

very nature PCA is not defined for classification tasks. That is to say, after the projection computed by PCA, the Bayes error rate may become arbitrarily bad even though the original input patterns are perfectly classifiable. Moreover, as the majority of manifold based embedding algorithms adopt a similar framework as PCA where certain form of reconstruction loss is minimized, they also suffer from the same problem as PCA. Different from previously proposed embedding methods which find the optimal subspace by minimizing some form of average loss or cost, our goal in this paper is to directly find the most discriminative subspace, where clusters are most well-separated. We propose maximum margin embedding (MME) to achieve this goal by finding the maximum margin separating hyperplanes which separate data points from different classes with the maximum margin and projecting input patterns onto the normal of these hyperplanes. These projections will later result in the maximal margin separation of data points from different classes. Moreover, MME is insensitive to the actual probability distribution of patterns lying further away from the separating hyperplanes. Furthermore, MME is robust to noise as it is insensitive to small perturbations of patterns lying further away from the decision boundary. Finally, the large margin criterion implemented in MME enables the good generalization on future data. We formulate maximum margin embedding (MME) as an integer programming problem and propose a cutting plane algorithm to solve it. Specifically, we construct a nested sequence of successively tighter relaxations of the original MME problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Moreover, we prove theoretically that the cutting plane algorithm takes time O(snd) to converge with guaranteed accuracy, where n is the total number of samples in the dataset, s is the average number of non-zero features, i.e. the sparsity, and d is the targeted subspace dimension. Our experimental evaluations on both toy and real world datasets demonstrate that MME extracts a subspace more suitable for classification or clus-

tering. Moreover, as there are no time-consuming eigendecomposition problems involved, MME converges much faster than several state-of-the-art embedding methods. The rest of this paper is organized as follows. In section 2, we present the principles of maximum margin embedding and formulate it as an integer programming problem. The cutting plane algorithm for efficiently solving this optimization problem is presented in detail in section 3. In section 4, we provide a theoretical analysis on the time complexity of our MME algorithm. Experimental results on both toy and real-world datasets are provided in section 5, followed by the conclusions in section 6.

2

Maximum Margin Embedding

Conventionally, embedding methods target to projecting input patterns into certain subspace, where data points from different classes are scattered far away from each other. However, previous methods calculate the optimal projecting direction by minimizing some form of average loss or cost, such as the reconstruction loss. These attempts could be understood as indirectly maximizing the scattering between data points from different classes. Different from these traditional embedding algorithms, maximum margin embedding targets to directly finding the optimal separating hyperplanes for the input points and projecting the data points onto the normal vectors of these hyperplanes. This projection will later result in the maximal margin separation of data points from different classes. Motivated by the success of large margin methods in the supervised scenario, we define the optimal separating hyperplane as the one that separates data points from different classes with the maximum margin. The benefits of projecting data points onto the maximum margin separating hyperplane are threefold: (1) as the maximum margin separating hyperplanes are determined by those data points lying close to the decision boundary, MME is insensitive to the actual probability distribution of patterns lying further away from the classifying hyperplane. Therefore, when a large mass of the input patterns lies far away from the ideal classifying hyperplane, we can expect MME to win against those methods that minimize some form of average loss since they unnecessarily take into account the full distribution of the input patterns. (2) MME is robust to noise as it is insensitive to small perturbations of patterns lying further away from the separating hyperplane. (3) By projecting the input data points onto the maximum margin separating hyperplane, the obtained projection vectors are likely to provide good generalization on future data. Suppose we want to embed the input patterns into a ddimensional subspace, in order to extract “new” information unrelated to the previously obtained projecting vectors, all extracted separating hyperplanes should be orthogonal

to each other. To accomplish this, we calculate the maximum margin separating hyperplanes serially and incorporate a suitable orthogonality constraint such that the r-th (r = 2, . . . , d) separating hyperplane should be orthogonal to all the previously extracted r − 1 hyperplanes. The problem now boils down to how to find the maximum margin separating hyperplanes. Different from the supervised scenario where the data labels are known, we consider the data embedding problem in the unsupervised case, where the optimal separating hyperplane is the one with the maximum margin over all possible labelings for the data points. Hence, we first need to find the labels for the input patterns such that if we subsequently run an SVM with this labeling, the margin would be maximal among all possible labelings. By projecting the input data points onto the normal of the hyperplane associated with the above calculated data labels, the projected data points from different classes could be separated maximally. Therefore, the major computation involved in maximum margin embedding is finding the maximum margin separating hyperplanes, which could be formulated as the following optimization problem: min

y,w,b,ξi ≥0

n CX 1 T w w+ ξi 2 n i=1

(1)

s.t. yi (wT φ(xi ) + b) ≥ 1 − ξi , i = 1, . . . , n AT w = 0 y ∈ {−1, +1}n Pn where i=1 ξi is divided by n to better capture how C scales with the dataset size and AT w = 0 with A = [w1 , . . . , wr−1 ] constrains that w should be orthogonal to all previously calculated projecting vectors. Moreover, φ(·) maps the data samples in X into a high (possibly infinite) dimensional feature space. Therefore, maximum margin embedding first solve problem (1) without orthogonality constraints and obtain (w1 , b1 ). Then, suppose the hyperplanes (w1 , b1 ), . . . , (wr−1 , br−1 ) have already been calculated, then the r-th hyperplane (wr , br ) is obtained as the solution to problem (1) with A = [w1 , . . . , wr−1 ]. However, the above optimization problem has a trivially “optimal” solution, which is to assign all patterns to the same class, and the resultant margin will be infinite. Moreover, another unwanted solution is to separate a single outlier or a very small group of samples from the rest of the data. To alleviate these trivial solutions, we propose the following class balance constraint −l ≤

n X ¡ i=1

¢ wT φ(xi ) + b ≤ l

(2)

where l ≥ 0 is a constant controlling the class imbalance. For the multi-class scenario, MME is still formulated as in Eq.(1). This could be understood as implicitly grouping data samples into two “super-clusters” and the extracted

projecting direction is the most discriminative one for these two “super-clusters”.

3

The Proposed Method

The difficulty with problem (1) lies in the fact that we have to minimize the objective function w.r.t. the labeling vector y, in addition to w, b and ξi . In this section, we will propose a cutting plane algorithm to solve this problem

3.1

Cutting Plane Algorithm

Theorem 1 Maximum margin embedding could be equivalently formulated as min

w,b,ξi ≥0

s.t.

n CX 1 T ξi w w+ 2 n i=1

(3)

|wT φ(xi ) + b| ≥ 1 − ξi , i = 1, . . . , n n X ¡ T ¢ −l ≤ w φ(xi ) + b ≤ l i=1

increased by 2n −n. The algorithm we propose in this paper targets to find a small subset of constraints from the whole set of constraints in problem (4) that ensures a sufficiently accurate solution. Specifically, we employ an adaptation of the cutting plane algorithm[6], where we construct a nested sequence of successively tighter relaxations of problem (4). Moreover, we will prove theoretically in section 4 that we can always find a polynomially sized subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (4) up to a precision of ² [4]. That is to say, the remaining exponential number of constraints are guaranteed to be violated by no more than ², without the need for explicitly adding them to the optimization problem. Specifically, the cutting plane algorithm keeps a subset Ω of working constraints and computes the optimal solution to problem (4) subject to the constraints in Ω. The algorithm then adds the most violated constraint in problem (4) into Ω. In this way, a successively strengthening approximation of the original MME problem is constructed by a cutting plane that cuts off the current optimal solution from the feasible set [6]. The algorithm stops when no constraint in (4) is violated by more than ², i.e.,

T

A w=0

where the labeling vector y is calculated as yi sign(wT φ(xi ) + b).

n

∀c ∈ {0, 1}n : =

By reformulating MME as problem (3), the number of variables involved is reduced by n, but there are still n slack variables ξi in problem (3). To further reduce the number of variables, we have the following theorem Theorem 2 Problem (3) could be equivalently formulated as follows: min

w,b,ξ≥0

1 T w w+Cξ 2

s.t. ∀c ∈ {0, 1}n : −l ≤

n X ¡ i=1

AT w = 0

(4) n n 1X 1X ci |wTφ(xi )+b| ≥ ci −ξ n i=1 n i=1

¢ wTφ(xi ) + b ≤ l

Any solution (w∗ , b∗ ) to problem (4) is also Pna solution to problem (3) (and vice versa), with ξ ∗ = n1 i=1 ξi∗ .

Although problem (4) has 2 constraints, one for each possible vector c = (c1 , . . . , cn ) ∈ {0, 1}n , it has only one slack variable ξ that is shared across all constraints, thus, the number of variables is further reduced by n − 1. Each constraint in this formulation corresponds to the sum of a subset of constraints from problem (3), and the vector c selects the subset. However, the number of constrains in problem (4) is n

n

1X 1X ci |wTφ(xi )+b| ≥ ci−(ξ +²) (5) n i=1 n i=1

Here, the feasibility of a constraint is measured by the corresponding value of ξ, therefore, the most violated constraint is the one that would result in the largest ξ. With each constraint in problem (4) represented by a vector c, we have

Theorem 3 The most violated constraint could be computed as follows ½ 1 if |wT φ(xi )+b| < 1 ci = (6) 0 otherwise

The cutting plane algorithm iteratively selects the most violated constraint under the current hyperplane parameter and adds it into the working constraint set Ω until no violation of constraints is detected. Assume the current working constraint set is Ω, MME could be formulated as follows min

w,b,ξ≥0

s.t.

1 T w w+Cξ (7) 2 n n 1X 1X ∀c ∈ Ω : ci |wTφ(xi )+b| ≥ ci −ξ n i=1 n i=1 −l ≤

n X ¡ i=1

T

A w=0

¢ wTφ(xi ) + b ≤ l

The cutting plane algorithm for MME is presented in Algorithm 1. As there are two “repeat” steps involved in Algorithm 1, for the rest of this paper we denote the outer loop as LOOP-1 and the inner loop as LOOP-2.

Algorithm 1 Cutting plane algorithm for maximum margin embedding Initialize the values for C, l and ², set A = φ, r = 0 repeat Set Ω = φ repeat Solve problem (7) for (w, b) under the current working constraint set Ω, select the most violated constraint c with Eq.(6), and set Ω = Ω ∪ {c} until The newly selected constraint c is violated by no more than ² A = A ∪ {w} and r = r + 1 until r equals to the expected subspace dimension

Optimization via the CCCP

3.2

In each iteration of the LOOP-2, we need to solve problem (7) to obtain the optimal classifying hyperplane under the current working constraint set Ω. Although the objective function in (7) is convex, the constraints are not, and this makes problem (7) difficult to solve. Fortunately, the constrained concave-convex procedure (CCCP) is designed to solve those optimization problems with a concave-convex objective function under concave-convex constraints [9]. Specifically, the objective function in (7) is quadratic and the last two constraints are linear. Moreover, the first constraint as shown in Eq.(8) is, though non-convex, a difference between two convex functions. n

∀c ∈ Ω :

n

1X 1X ci −ξ − ci |wT φ(xi )+b| ≤ 0 (8) n i=1 n i=1

Hence, we can solve problem (7) with the CCCP. Given an initial point (w0 , b0 ), the CCCP Pn computes (wt+1 , bt+1 ) from (wt , bt ) by replacing n1 i=1 ci |wT φ(xi ) + b| in the constraint with its first-order Taylor expansion at (wt , bt ), 1 T w w+Cξ (9) 2 n n 1X 1X s.t. ∀c ∈ Ω : ci −ξ − ci sign(wtTφ(xi )+bt ) n i=1 n i=1 £ ¤ · wTφ(xi )+b ≤ 0 n X ¡ T ¢ w φ(xi ) + b ≤ l −l ≤

min

w,b,ξ≥0

i=1

AT w = 0

and the above QP problem could be solved in polynomial time. Following the CCCP, the obtained solution (w, b) from this QP problem is then used as (wt+1 , bt+1 ) and the iteration continues until convergence. Putting everything together, according to the formulation of the CCCP [9], we solve problem (7) with Algorithm 2.

Algorithm 2 Solve problem (7) via the constrained concave-convex procedure Initialize (w0 , b0 ) repeat Find (wt+1 , bt+1 ) as the solution to the quadratic programming problem (9). Set w = wt+1 , b = bt+1 and t = t + 1 until stopping criterion satisfied.

3.3

Justification of the Cutting Plane Algorithm

The following theorem characterizes the accuracy of the solution computed by the cutting plane algorithm. Theorem 4 For any dataset X = (x1 , . . . , xn ) and any ² > 0, the cutting plane algorithm for maximum margin embedding returns a point (w, b, ξ) for which (w, b, ξ + ²) is feasible in problem (4). Based on the above theorem, ² indicates how close one wants to be to the error rate of the best projecting hyperplane and can thus be used as the stopping criterion.

4

Time Complexity Analysis

In this section, we will provide theoretical analysis on the time complexity of the cutting plane algorithm. We will mainly focus on the high-dimensional sparse data, where s ¿ N and N is the dimension of the original feature space. For non-sparse data, by simply setting s = N , all our theorems still hold. Assuming the targeted subspace dimension is d, we will first analyze the computational time involved in LOOP-2. By multiplying this time with d, we get the time complexity of our cutting plane algorithm. Specifically, we will show that while calculating wr , each iteration of the cutting plane algorithm takes polynomial time. Since the algorithm is iterative, we will next prove that the total number of constraints added into the working set Ω, i.e. the total iterations involved in LOOP-2, is upper bounded. Theorem 5 Each iteration of LOOP-2 takes time O(sn) for a constant working set size |Ω|. Theorem 6 For any ² > 0, C > 0, l > 0 and any dataset X = {x1 , . . . , xn }, LOOP-2 terminates after adding at most CR ²2 constraints, where R is a constant number independent of n and s. Theorem 7 For any dataset X = {x1 , . . . , xn } with n samples and sparsity of s, and any fixed value of C > 0, l > 0 and ² > 0, assuming the targeted subspace dimension is d, the cutting plane algorithm takes time O(snd).

Experiments

In this section, we will validate the accuracy and efficiency of MME on both toy and real world datasets. Moreover, we will also analyze the scaling behavior of MME with the dataset size. All the experiments are performed on a 1.66GHZ Intel CoreTM 2 Duo PC running Windows XP with 1.5GB main memory.

We use 11 datasets in our experiments, which are selected to cover a wide range of properties: Ionosphere, Sonar, Wine, Breast, Iris and Digits from the UCI repository, 20 newsgroup, WebKB, Cora, USPS and ORL. For 20 newsgroup, we choose the topic rec which contains autos, motorcycles, baseball and hockey from the version 20-news-18828. For WebKB, we select a subset consists of about 6000 web pages from computer science departments of four schools (Cornell, Texas, Washington, and Wisconsin). For Cora, we select a subset containing the research paper of subfield data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS) and programming language (PL). For USPS, we use images of digits 1, 2, 3 and 4 from the handwritten 16 × 16 digits dataset.

5.2

Visualization Capability of MME

We will first demonstrate the visualization capability of MME, where we compare with PCA, ISOMAP and LLE. Two dimensional projections on four datasets from UCI repository are shown in Figure 1, from which we observe that MME separates data samples from different classes with a larger margin, which will clearly improve the performance of the following classification or clustering algorithms.

5.3

ISOMAP

LLE

MME

PCA

ISOMAP

LLE

MME

PCA

ISOMAP

LLE

MME

Datasets

Comparisons and Accuracy

To investigate whether MME can extract useful subspaces that preserve information necessary for the classification task, we ran MME on a number of classification problems and calculate the accuracy of Nearest Neighbor classifier on the embedded data samples. We changed the number of dimensions of the estimated subspace and measured the leave-one-out classification accuracy that could be achieved by projecting data onto the extracted subspace. Besides our MME algorithm, we also implements some other competitive algorithms and present their results for comparison, i.e., PCA[5], ISOMAP[10], LLE[7] and unsupervised LDA (uLDA)[12]. For MME, a linear kernel is used, and the values of C and l are tuned using grid search. The value of ² is

PCA

ISOMAP

LLE

MME

Figure 1. Dataset visualization results of (from left to right) PCA, ISOMAP, LLE and MME applied to (from top to bottom) the “digits”, “breast cancer”, “wine” and “iris” datasets. fixed at 0.01. For ISOMAP and LLE, a k-NN graph is used, where we select k from 1, . . . , 50 and report the highest accuracy. Results in Figure 2 clearly show advantage of MME over other embedding methods.1

5.4

Dataset Size n vs. Speed

In the theoretical analysis section, we state that the computational time of MME scales linearly with the number of samples. We present numerical demonstration for this statement in figure 3, where a log-log plot of how computational time increases with the size of the dataset is shown. Specifically, lines in the log-log plot correspond to polynomial growth O(nd ), where d is the slope of the line. Figure 3 shows that the CPU-time of MME scales roughly O(n), which is consistent with theorem 7. WebKB & 20−News

2

10

CPU−Time (seconds)

5.1

PCA

WK−CL WK−TX WK−WT WK−WC 20−News O(n)

1

10

0

10

Cora−DS Cora−HA Cora−ML Cora−OS Cora−PL USPS O(n)

1

10

0

10

−1

−1

10

Cora & USPS

2

10

CPU−Time (seconds)

5

2

10

3

10 Number of Samples

10

2

10

3

10 Number of Samples

Figure 3. CPU-time (seconds) of MME as a function of the dataset size n. 1 uLDA failed to work on USPS and text datasets due to an out-ofmemory problem.

ISOMAP

0.6

PCA ISOMAP LLE uLDA MME

LLE

0.7

uLDA

0.5

MME

0.65 2

4

6 Dimension

8

0.4

10

WebKB−Washington

1

2

4

6 Dimension

8

2

4

6 Dimension

8

0.7 0.65

PCA

0.6

LLE

0.5 0

10

20

40 60 Dimension

80

0.7 0.65

PCA ISOMAP LLE MME

0.6

MME

0.55 0.5 0

100

Cora−HA

0.7

20

0.6

0.5

0.5

0.4

40 60 Dimension

80

100

Cora−ML

0.6

0.65 20

40 60 Dimension

80

PCA ISOMAP LLE MME

0.6 0.55 0

100

Cora−OS

0.7

20

40 60 Dimension

80

0.25

PCA ISOMAP LLE MME

0.2 100

0

Cora−PL

0.5

0.3

20

40 60 Dimension

80

Accuracy

0.65

0.35

0.4

PCA ISOMAP LLE MME

0.3 0.2 0

100

20−Newsgroup

1

0.4

0.8

0.9

0.35

0.7

0.85

0.4 0.35

0.3 0.25 PCA ISOMAP LLE MME

0.2 0.15

20

40 60 Dimension

80

100

0.1 0

20

40 60 Dimension

80

0.6 0.5

PCA ISOMAP LLE MME

0.4 0.3 100

0.2 0

80

0.1 0

100

20

40 60 Dimension

80

0.8 0.75

PCA ISOMAP LLE MME

0.65 0

20

40 60 Dimension

80

100

ORL

0.8

0.7

100

PCA ISOMAP LLE MME

1

0.95

Accuracy

PCA ISOMAP LLE MME

Accuracy

0.5 0.45

Accuracy

0.45

0.6 0.55

40 60 Dimension

USPS

1

0.9

0.65

20

0.3 0.2

Accuracy

PCA ISOMAP LLE MME

0.7

Accuracy

0.8 0.75

0

ISOMAP

0.55

Cora−DS

0.5

Accuracy

Accuracy

Accuracy

PCA ISOMAP LLE uLDA MME

0.4

0.7

Accuracy

0.7

0.45

0.75

0.85

0

0.75

0.55

0.95 0.9

0.8

0.75

0.6

WebKB−wisconsin

0.8

0.8

0.8

0.65

10

0.85

0.85

0.75

WebKB−Texas

0.9

0.85

Accuracy

PCA

0.75

0.7

WebKB−Cornell

0.9

0.9

Accuracy

0.8

Accuracy

0.85

Wine

0.95

0.8

0.9 Accuracy

Sonar

0.9

Accuracy

Ionosphere

1 0.95

20

40 60 Dimension

80

0.6 0.4

PCA ISOMAP LLE uLDA MME

0.2

100

0 0

20

40 60 Dimension

80

100

Figure 2. Leave-one-out classification accuracy of Nearest Neighbor classifier on the embedded data samples.

6

Conclusions

We propose maximum margin embedding (MME), to project data samples into the most discriminative subspace, where clusters are most well-separated. Detailed theoretical analysis of the algorithm is provided, where we prove that the computational time of MME scales linearly with the sample size n with guaranteed accuracy. Moreover, experimental evaluations on several real world datasets show that MME performs better than several state-of-the-art embedding methods, both in efficiency and accuracy.

Acknowledgments This work is supported by the project (60675009) of the National Natural Science Foundation of China.

References [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003. [2] M. Brand. Charting a manifold. In NIPS, 2003. [3] D. L. Donoho and C. E. Grimes. Hessian eigenmaps: locally linear embedding techniques for highdimensional data. In Proceedings of the National Academy of Arts and Sciences, volume 100, 2003.

[4] T. Joachims. Training linear svms in linear time. In SIGKDD, 2006. [5] I. T. Jolliffe. Principal Component Analysis. SpringerVerlag, New York, 1986. [6] J. E. Kelley. The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8:703–712, 1960. [7] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000. [8] F. Sha and L. K. Saul. Analysis and extension of spectral methods for nonlinear dimensionality reduction. In ICML, 2005. [9] A. J. Smola, S. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In AISTATS, 2005. [10] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. [11] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In ICML, 2004. [12] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In NIPS. 2007. [13] Z. Y. Zhang and H. Y. Zha. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal on Scientific Computing, 26(1):313–338, 2004.

ACTIVE MODEL SELECTION FOR GRAPH-BASED SEMI-SUPERVISED LEARNING Bin Zhao, Fei Wang, Changshui Zhang, Yangqiu Song State Key Laboratory of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation,Tsinghua University, Beijing 100084, China ABSTRACT The recent years have witnessed a surge of interest in GraphBased Semi-Supervised Learning (GBSSL). However, despite its extensive research, there has been little work on graph construction, which is at the heart of GBSSL. In this study, we propose a novel active learning method, Active Model Selection (AMS), which aims at learning both data labels and the optimal graph by allowing the learner the flexibility to choose samples for labeling. AMS minimizes the regularization function in GBSSL by iterating between the active sample selection step and the graph reconstruction step, where the samples querying which leads to the optimal graph are selected. Experimental results on four real-world datasets are provided to demonstrate the effectiveness of AMS. Index Terms— Graph Based Semi-Supervised Learning (GBSSL), Model Selection, Active Learning, Gaussian Function, Gradient Descent 1. INTRODUCTION In many practical applications of pattern classification and data mining, one often faces a lack of sufficient labeled data, since labeling often requires expensive human labor and much time. However, in many cases, large number of unlabeled data can be far easier to obtain. For example, in text classification, one may have an easy access to a large database of documents (e.g. by crawling the web), but only a small part of them are classified by hand. Consequently, Semi-Supervised Learning (SSL) methods, which aim to learn from partially labeled data, are proposed[1]. In recent years, GBSSL has become one of the most active research areas in SSL community [2]. GBSSL uses a graph G =< V, E > to describe the structure of a dataset, where V is the node set corresponding to the labeled and unlabeled samples, and E is the edge set. In most of the traditional methods [3, 4, 5], each edge eij ∈ E is associated with a weight wij , which reflects the similarity between pairwise samples. The weight is usually computed by certain parametric function, i.e., wij = hθ (xi , xj , θ) (1)

1-4244-1484-9/08/$25.00 ©2008 IEEE

1881

Here, a specific choice of hθ and related parameters θ is called a model, with which we construct the graph. The choice of the model can affect the final classification result significantly, which can be seen from the toy example shown in Fig. 1, where hθ is fixed to Gaussian function, wij = exp (−xi − xj 2 /(2σ 2 ))

(2)

and classification results with different values of variance σ are shown. However, as pointed out by [1], although at the heart of GBSSL, model selection is still a problem that has not been well studied. To address such a problem, we propose an active learning method, Active Model Selection (AMS), which aims at learning both data labels and the optimal model by allowing the learner the flexibility to choose samples for labeling. Traditionally, active learning methods aim to query samples that could decrease most the generalization error of the resulting classifier. However, since graph construction is at the heart of GBSSL, the active learning method we employ here targets to select the most informative samples for model selection. More concretely, the AMS algorithm selects samples, querying which could lead to the optimal model. The AMS algorithm first establishes an objective function composed of two parts, i.e. the smoothness and the fitness of the data labels, to measure how good the classification result of the Semi-Supervised Learning task is. Then AMS will minimize this objective function by alternating between the active sample selection step and the graph reconstruction step. Fig. 2 presents the flow charts of traditional active learning methods and AMS to show the difference.

Fig. 2. Flow charts of (a) traditional active learning methods and (b) AMS. The rest of this paper is organized as follows. In section 2, we introduce some works related to this paper. The AMS algorithm is presented in detail in section 3. In section 4,

ICASSP 2008

Toy Data (Two−moon)

2

2

unlabeled point labeled point +1 labeled point −1

1.5

Classification Result with Sigma=0.15

2

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1 −1.5 −2

−1 −1

0

(a)

1

2

3

−1.5 −2

Classification Result with Sigma=0.4

−1 −1

0

(b)

1

2

3

−1.5 −2

−1

0

(c)

1

2

3

Fig. 1. Classification results on the two-moon pattern using the method in [3], a powerful transductive approach operating on graph with the edge weights computed by a Gaussian function. (a) toy data set with two labeled points; (b) classification results with σ = 0.15; (c) classification results with σ = 0.4. We can see that a small variation of σ will cause a dramatically different classification result. we provide experimental results on four real-world datasets, followed by the conclusions in section 5. 2. NOTATIONS AND RELATED WORKS In this section we will introduce some notations and briefly review some related works of this paper. Given a point set X = {x1 , · · · , xl , xl+1 , · · · , xn } and a label set T = {−1, +1} (generalization to multi-class scenario can be obtained in the same manner), where the first l points in X are labeled as yi ∈ T , while the remaining points are unlabeled. Our goal is to predict the labels of the unlabeled points1 . We denote the initial labels in the dataset by an n × 1 vector y with yi = 1 or −1 if xi is labeled as positive or negative, and 0 if xi is unlabeled. The classification result on the dataset X is also represented as an n × 1 vector f = [f1 , . . . , fn ]T , which determines the label of xi by yi = sgn(fi ). In GBSSL, we construct the n × n weight matrix W for graph G with its (i, j)-th entry Wij = wij computed by Eq.(1), and Wii = 0. The degree matrix D for graph G is defined as an n × n diagonal matrix with its (i, i)-entry equal to the sum of the i-th row of W . Finally, the normalized graph Laplacian [7] for graph G is defined as 1 1 L = I − S = I − D− 2 (D − W )D− 2 . Based on the above preliminaries, Zhou et al. proposed the following regularization function [3] for GBSSL: Q = (f − y)T (f − y) + λf T (I − S)f

(3)

The first term in Eq.(3) restricts that a good classifying function should not change too much from the initial label assignment, and the second term measures the smoothness of the data labels. The regularization parameter λ > 0 adjusts the tradeoff between these two terms. Thus, the optimal classification function can be obtained as: f ∗ = arg minf Q = λ (1 − α)(I − αS)−1 y, where α = 1+λ . By letting f = f ∗ in Eq.(3), regularization function Q is fully determined by initial labels y and the model {hθ , θ} Q(y, hθ , θ) = yT [I − (1 − α)(I − αS)−1 ]y = yT Ay (4) 1 In

this paper we concentrate on the transductive setting. One can easily extend our algorithm to inductive setting using the method introduced in [6].

1882

where A = I−(1−α)(I−αS)−1 depends on hθ and θ. As we noted in section 1, one of the problems existing in these graph based methods is that the model (i.e. hθ and θ in Eq.(1)) can affect the final classification results significantly. Moreover, as shown in Eq.(4), the model and the labels y are dependent. Specifically, the optimal model {hθ , θ} relies on the vector y. 3. ACTIVE MODEL SELECTION In this section, we first propose a gradient descent based model selection method for GBSSL. Then we provide details of the Active Model Selection algorithm. 3.1. Model Selection via Gradient Descents We fix the parametric function hθ to Gaussian Function in this paper, as shown in Eq.(2) and select the optimal variance σ. The derivative of Q w.r.t. σ can be calculated as follows ∂Q(y, σ) ∂S = −α(1−α)yT (I −αS)−1 (I −αS)−1 y (5) ∂σ ∂σ W Since Sij = √ ij

Dii Djj

∂Sij ∂σ

=

, we get

ii jj ij 1 Wij D 1 Wij D W − 3 − (6) 2 Dii Djj 2 D D3 Dii Djj ii

jj

ij W

ii D

d2 exp(− 2σij2 )

d2 d2ij exp(− 2σij2 ) σ3

∂ ∂Wij = = ∂σ ∂σ ∂Wij ∂ j Wij ∂Dii = = ∂σ ∂σ ∂σ j

(7) (8)

where dij is the distance between samples xi and xj , and we employ Euclidean distance in this paper unless further noticed. With the derivative of Q w.r.t. σ calculated above, the model selection problem can be tackled with gradient descent. 3.2. Active Model Selection Since the optimal model is determined via gradient based method, the most informative samples for model selection would be those that maximize the derivative of the objective

function Q w.r.t. the model hyperparameter, in Gaussian function, the variance σ. However, since querying those samples that maximize | ∂Q ∂σ | might increase the objective function, we control the acceptance of such a query by introducing an acceptance probability determined by the increase of Q. After the sample is queried, AMS retrains the model and constructs the graph in GBSSL with this new model. We assume only one sample is added to y at a time, therefore y = y0 + yk ek , where y0 is the label vector before querying sample xk , yk is the actual label for sample xk and ek = (0, . . . , 0, 1, 0, . . . , 0) is the unit vector with only the k-th element equal to 1. y∗ = arg min Q = arg min(Q − Q0 ) = arg min Q (9) y

y

y

where Q0 stands for the value of the regularization function before querying the selected sample. The decrease of Q after querying sample xk is Q = 2yk [Ay0 ](k) + Akk , where [Ay0 ](k) denotes the k-th element of the vector Ay0 with A defined the same as in Eq.(4). The algorithm is as follows: Initialization. Randomly initialize σ. Iterate between the following two steps until convergence; Active sample selection. Denote the present value of variance by σ ∗ , actively select sample xk , querying which max ∗ and calculate Q = Q(f +xk ) − Q0 . imizes ∂Q(σ,y) | σ=σ ∂σ If Q ≤ 0, accept xk as the next sample

else for querying; accept xk with probability P (k) = exp − kBQ·l (l−q) , where l is the total number of samples to query, q is the number of samples already queried so far, and kB is Boltzmann’s constant [8], which is chosen to be the largest possible decrease of Q while selecting the first sample. If xk is not accepted, check the next sample that leads to ∂Q(σ,y) |σ=σ∗ only less ∂σ than xk . Proceeds until one sample is accepted. Query xk and set y = y0 + yk ek ; Graph reconstruction. Calculate σ ∗ = arg minσ Q(σ, y) by gradient descent. Actually, P (k) is the Boltzmann probability [8] with the temperature selected as T = l−q l . In the above algorithm, instead of discarding those samples querying which might increase Q, they can also be incorporated into the label vector with controlled acceptance probability. Consequently, AMS obtains the ability to jump out of local minima. With the temperature T decreasing as more samples are queried, the probability for accepting an uphill step also decreases. Now we present how to select the sample querying which maximizes | ∂Q ∂σ |. According to Eq.(5), the derivative of Q = yT B(σ)y, with respect to σ can be computed as ∂Q(σ,y) ∂σ −1 ∂S where B(σ) = −α(1−α)[(I−αS) ∂σ (I−αS)−1 ] only depends on σ. Since in each iteration of AMS, the graph reconstruction step optimizes the regularization function Q w.r.t. σ, suppose the present label vector is y = y0 , ∂Q(σ, y) T |σ=σ∗ ,y=y0 = y0 B(σ ∗ )y0 = 0 ∂σ

(10)

1883

Hence, we only need to compute the increase of ∂Q(σ,y) w.r.t. ∂σ the newly labeled sample. Denote the index of the newly labeled sample by k, ∂Q(σ, y) |σ=σ∗ ∂σ

= yT B(σ ∗ )y = 2yk

m

Bkj (σ ∗ )yj0 + Bkk (σ ∗ )

j=1

= 2yk [B(σ ∗ )y0 ](k) + Bkk (σ ∗ ) (11) where [B(σ ∗ )y0 ](k) denotes the k-th element of the vector B(σ ∗ )y0 . Define the gain for labelling the k-th sample as the increase of the derivative of Q w.r.t. σ after querying it, therefore: G(f +(xk ,yk ) ) = 2yk [B(σ ∗ )y0 ](k) + Bkk (σ ∗ ) (12) Since we don’t know what answer yk we will receive, we assume the answer is approximated with p+1 (yk ) p(yk = 1) ≈

1 1 + e−fk

(13)

where p+1 (yk ) denotes the probability of yk = 1. The expected gain after querying node k is therefore: G(f +xk )=p−1 (yk )G(f +(xk ,−1) )+p+1 (yk )G(f +(xk ,+1) ) 1 |− 2[B(σ ∗ )y0 ](k)+Bkk (σ ∗ )| ≈ 1− 1 + e−fk 1 |2[B(σ ∗ )y0 ](k)+Bkk (σ ∗ )| (14) + 1 + e−fk Hence, the next sample xk is selected as k = arg maxk G(f +xk ). After the sample xk is selected, AMS checks if xk is accepted, if not, check the next sample that leads to ∂Q(σ,y) |σ=σ∗ only ∂σ less than xk until one sample is accepted. 4. EXPERIMENTS We validate the effectiveness of Active Model Selection on four real-world datasets, the Breast Cancer and Ionosphere datasets from UCI database2 , USPS3 , and 20-newsgroup4 datasets. Breast Cancer dataset contains 683 samples, and Ionosphere contains 351 samples. The USPS handwritten digits dataset contains images of 0, . . . , 9 as 10 classes. Finally, we choose the topic rec which contains autos, motorcycles, baseball and hockey from the 20-newsgroup dataset version 20-news-18828. We preprocess the data in the same manner as [3] and obtain 3970 document vectors in a 8014-dimensional space. For document classification, the distance between points xi and xj is defined to be xi ·xj d(xi , xj ) = 1 − ||xi ||||x . j || 2 http://www.ics.uci.edu/∼mlearn/

3 http://www.kernel-machines.org/data.html

4 http://people.csail.mit.edu/jrennie/20Newsgroups/

4.1. Comparison with Other Model Selection Methods for GBSSL In this section, we compare our active model selection (AMS) method with three state-of-the-art model selection algorithms for GBSSL, i.e., label entropy minimization (MinEnt) [5], leave-one-out cross validation (LOO) [9] and evidence maximization (LEM) [10]. Moreover, to validate the novel ac(a) Breast Cancer

1

0.8 0.75 0.7

20

40 60 Labeled Set Size

80

0.4 20

AMS LOO LEM MinEnt SIC

0.8 0.75 40 60 Labeled Set Size

80

Accuracy

0.9

40 60 Labeled Set Size

80

0.7

0.95

0.9

0.9

0.85 0.8 AMS LOO LEM MinEnt SIC

0.7 0.65

100

20

40 60 Labeled Set Size

20

80

40 60 Labeled Set Size

80

0.85 0.8 AMS LOO LEM MinEnt SIC

0.75 0.7 0.65

100

We propose an active learning method, Active Model Selection, to solve the model selection problem for GBSSL. Different from traditional active learning methods, AMS queries the most informative samples for model selection. Experimental results on both toy and real-world datasets show the effectiveness of AMS even with only few samples labeled.

100

(f) Baseball vs Hockey

1

0.95

0.75

AMS LOO LEM MinEnt SIC

0.75

100

5. CONCLUSIONS

0.85 0.8

(e) Autos vs Motorcycles

1

0.85

20

AMS LOO LEM MinEnt SIC

100

0.95

0.7

0.6 0.5

(d) 7 vs 9

1

0.9

0.7

Accuracy

AMS LOO LEM MinEnt SIC

Accuracy

Accuracy

Accuracy

0.85

(c) 1 vs 2

1 0.95

0.8

0.9

Accuracy

(b) Ionosphere

1 0.9

0.95

0.6

20

40 60 Labeled Set Size

80

100

Fig. 3. Test accuracies on UCI, USPS and 20-newsgroup datasets. (a) Breast cancer; (b) Ionosphere; (c) 1 vs 2; (d) 7 vs 9; (e) autos vs motorcycles; (f) baseball vs hockey. The number of labeled samples increases from 2 to 100. tive learning framework we propose in this paper, we also compare with the method which simply iterates between traditional active learning and gradient descent based model selection. We call this method Simple Iterative Combination (SIC). Since MinEnt, LOO and LEM only consider the binary classification scenario, we provide experimental results on 6 two-way classification tasks. Test accuracies averaged over 20 random trials are reported. From Fig. 3 we can clearly see the advantage of active model selection, i.e., with the same amount of samples labeled, AMS achieves the highest classification accuracy. Moreover, the advantage of AMS over SIC demonstrates the effectiveness of our active learning framework, i.e., for GBSSL, active model selection is better than combining traditional active learning and model selection. However, as controlled uphill steps are incorporated in AMS, it might take more time to converge than SIC. 4.2. Comparison with Other Classification Methods (a) Digit Classification on USPS

0.9

0.95

0.8

0.9

0.7

0.85

SIC AMS LLGC Harmonic SVM k−NN

0.8 0.75 0.7

Accuracy

Accuracy

1

10

20 30 Labeled Set Size

We compare the performance of AMS on multi-category classification tasks with two supervised methods, k-NN, SVM and two semi-supervised methods, LLGC [3] and harmonic function [5] in this section. The parameters in k-NN, SVM, LLGC and harmonic function are tuned by grid search. The number of labeled samples increases from 4 to 50 and test accuracies averaged over 50 random trials are reported. Figure 4 shows a clear advantage of AMS on multi-category classification.

40

(b) Text Classification on 20−Newsgroup

0.6

SIC AMS LLGC Harmonic SVM k−NN

0.5 0.4 50

10

20 30 Labeled Set Size

40

50

Fig. 4. Multi-category classification accuracies on USPS and 20-newsgroup datasets. (a) Digit recognition with USPS digits dataset for a total of 3874 samples (a subset containing digits from 1 to 4). (b) Text classification with 20-newsgroup dataset for a total of 3970 document vectors.

1884

6. ACKNOWLEDGEMENT This work is supported by the project (60675009) of the National Natural Science Foundation of China. 7. REFERENCES [1] X. Zhu, “Semi-supervied learning literature survey,” Computer Sciences Technical Report, 1530, University of WisconsinMadison, 2006. [2] O. Chapelle, B. Scholkopf, and A. Zien, Semisupervised Learning, MIT Press: Cambridge, MA, 2006. [3] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems, 2004, vol. 16. [4] O. Chapelle, J. Weston, and B. Scholkopf, “Cluster kernels for semi-supervised learning,” in Advances in Neural Information Processing Systems, 2003, vol. 15. [5] X. Zhu, Semi-Supervised Learning with Graphs, Doctoral thesis, Carnegie Mellon University, May 2005. [6] O. Delalleu, Y. Bengio, and N. Le Roux, “Non-parametric function induction in semi-supervised learning,” in Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. [7] F. Chung, Spectral Graph Theory, American Mathematical Society, 1997. [8] S. Kirkpatrick, C. D. Gelatt, and Jr. M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, May 1983. [9] Xinhua Zhang and Wee Sun Lee, “Hyperparameter learning for graph based semi-supervised learning algorithms,” in Advances in Neural Information Processing Systems, 2007, vol. 19. [10] Ashish Kapoor, Yuan (Alan) Qi, Hyungil Ahn, and Rosalind W. Picard, “Hyperparameter and kernel learning for graph based semi-supervised classification,” in Advances in Neural Information Processing Systems, 2006, vol. 18.

SMOOTHNESS MAXIMIZATION VIA GRADIENT DESCENTS

Bin Zhao, Fei Wang, Changshui Zhang State Key Laboratory of Intelligent Technologies and Systems, Department of Automation, Tsinghua University, Beijing 100084, P.R.China ABSTRACT

The recent years have witnessed a surge of interest in graph based semi-supervised learning. However, despite its extensive research, there has been little work on graph construction. In this study, employing the idea of gradient descent, we propose a novel method called Iterative Smoothness Maximization (ISMA), to learn an optimal graph automatically for a semi-supervised learning task. The main procedure of ISM is to minimize the upper bound of semi-supervised classification error through an iterative gradient descent approach. We also prove the convergence of ISM theoretically, and finally experimental results on two real-world data sets are provided to demonstrate the effectiveness of ISM. Index Terms- Semi-Supervised Learning (SSL), Cluster Assumption, Gaussian Function, Gradient Descent 1. INTRODUCTION In many practical applications of pattern classification and data mining, one often faces a lack of sufficient labeled data,

since labeling often requires expensive human labor and much time. However, in many cases, large number ofunlabeled data can be far easier to obtain. Consequently, Semi-Supervised Learning (SSL) methods, which aim to learn from partially labeled data, are proposed[1]. In general, these methods can be categorized into two classes: transductive learning(e.g. [2]) and inductive learning(e.g. [3]). The goal of transductive learning is to estimate the labels of the given unlabeled data, whereas inductive learning tries to induce a decision function which has a low error rate on the whole sample space. In recent years, graph based semi-supervised learning has become one of the most active research areas in SSL community [4]. The key to graph based SSL is the cluster assumption [5]. It states that (1) nearby points are likely to have the same label; (2) points on the same structure (such as a cluster or a submanifold) are likely to have the same label. Based on the above assumptions, graph-based SSL uses a graph 9 =< V, S > to describe the structure of a data set, where V is the node set corresponding to the labeled and unlabeled examples in the data set, and S is the edge set. In

1-4244-0728-1/07/$20.00 C2007 IEEE

most of the traditional methods[2][5][6], each edge e C S is associated with a weight, usually computed by Gaussian functions which reflect the similarities between pairwise data points, i.e.

wij

exp (-

x2 /(2(2)) xi-xj

(1)

However, as pointed out by [1], although graph construction is at the heart of graph based SSL, it is still a problem that has not been well studied. More concretely, the variance a in Eq.(1) can affect the final classification result significantly, which can be seen from the toy example shown in Fig. 1, but according to [2], there has been no reliable approach that can determine an optimal a automatically so far. To address such a problem, we propose a gradient based method, Iterative Smoothness Maximization (ISM), which aims at learning both data labels and the optimal hyperparameter of the Gaussian function for constructing the graph. The ISM algorithm first establishes a cost function composed of two

parts, i.e. the smoothness and the fitness of the data labels,

to measure how good the classification result of the SemiSupervised Learning task is. Then ISM will minimize this cost function by alternating the smoothness maximization step and the graph reconstruction step. We will prove theoretically the convergence of the ISM algorithm. The rest ofthis paper is organized as follows. In section 2, we introduce some works related to this paper. The ISM algorithm is presented in detail in section 3. In section 4, we analyze the convergence property ofthe algorithm. Experimental results on two real-world data sets are provided in section 5, followed by the conclusions and future works in section 6.

2. NOTATIONS AND RELATED WORKS In this section we will introduce some notations and briefly review some related works of this paper. Given a point set X = {x1,~.. , xl,~x1+, ,Xn } and a label set L {1,. , c}, where the first I points in X are labeled as ti C L, while the remaining points are unlabeled. Our goal is to predict the labels of the unlabeled points'. We 1in this paper we will concentrate on the transductive setting. One can easily extend our algorithm to inductive setting using the method introduced in [7].

II - 609

ICASSP 2007

Classification Result with Sigma=0. 15

Toy Data (Two-moon) unlabeled point labeled point +1 llabeled point -1

1.5

0.5

Classification Result with Sigma=0.4

2-

1.5

1.5

0.5

0.5 0.

05

-0.5

0.5-

:~~~~~~~~-0.5. - ;1

-1.5L

0

(a)

2

3

-15 -2

-1

0

2

(b)

3

-2

(c)

Fig. 1. Classification results on the two-moon pattern using the method in [2], a powerful transductive approach operating on graph with the edge weights computed by a Gaussian function. (a) toy data set with two labeled points; (b) classification results with x = 0.15; (c) classification results with x = 0.4. We can see that a small variation of (x will cause a dramatically different classification result.

denote the initial labels in data set by an n x c matrix T. For each labeled point xi, Tij = 1 if xi is labeled as ti = j and Tij = 0 otherwise. For unlabeled points, the corresponding rows in T will be zero. The classification result on the data set X is also represented as an n x c matrix F = [FT, . . ., FI]T, which determines the label of xi by ti = arg max, <
3.1. Gradient Computing We propose to learn the optimal (x through a gradient descent procedure. More concretely, employing the cost function proposed in Eq.(2), we can compute the gradient of Q(F, a) w.r.t. (x with F fixed as F* = arg minF Q. DQ(F,

D)0

[

j[

L

Wij

_

Ti

12]

(2)

The first term measures the smoothness of the data labels, and the second term restricts that a good classifying function should not change too much from the initial label assignment. The regularization parameter p > 0 adjusts the tradeoff between these two terms. Thus, the optimal classification function can be obtained by: F* = arg minF Q. As we noted in section 1, one of the problems existing in these graph based methods is that the hyperparameter (i.e. o( in Eq.(1)) can affect the final classification results significantly. Therefore, many methods have been proposed to determine the optimal hyperparameter automatically, such as the Local Scaling [8] and Minimum Spanning Tree [6] methods. Although they can work well empirically, they are heuristic, and thus lack a theoretical foundation. 3. ITERATIVE SMOOTHNESS MAXIMIZATION In this section we will introduce the main procedure of the Iterative Smoothness Maximization (ISM) algorithm.

II-

Fi L

i

Wjj F, Fj Oa -VD-ii .,/D-jj

n

E i,j=l

Q=

V

D

Dii

i

ag

)2

(

Fj

Fi

W

=

=

i,j=l

i

a

n

Jjj

-

2

wij j

Fj

Fi

-VD---ii V//-D---jj

Fj DjjD Djj DO

(3)

3

Using the Gaussian function to measure similarity, we get

W_ 0 exp(ADL1

")_ d

=1S

=

Oa -

d2

E dt exp(W..

DWj

Wij

exp(

:

3

(4) )

(5)

3.2. Learning Rate Selection in Gradient Descent In gradient descent, the hyperparameter is updated as =new gold-r1 aQ |C=1old; learning rate r affects the performance of gradient descent severely. In ISM, r is selected dynamically to accelerate convergence of the algorithm. If we fix F, the cost function Q only varies with (X. Therefore, we plot the cost function curve in Fig.2, in which the current value of (x is denoted by point A. In the case where

610

i

i. If Q(Fn+1i,u2) < Q(Fn+l,07n) and sgn(
Learning Rate Selection

(b)

4. If n > No, quit iteration and output classification result; else, go to step 2. 40

Sigma

60

100

Fig. 2. Possible regions cx might appear during iteration, where B represents the region between A and the minimum point, 0. Regions C and D are separated by point A* satisfying Q(c0A*) = Q(5A)-

the cost function has multiple local minima, this figure could be considered as a local region of the cost function curve. After updating cx with gradient descent, its new value (X1 might appear in three regions: B, C or D. In ISM, we hope the cost function decreases monotonically to guarantee the convergence of the algorithm. Also, to simplify our method, we hope the value of cx avoids oscillation. Hence, B is the ideal region for (X1, and (X1 should be near the minimum point 0. If (X1 appears in region C or D, then the learning rate is too large and we set rj to equal Tj, the value ofwhich is small enough to guarantee that with Tj, the new variance cx appears in region B. On the other hand, if (X1 appears in region B, we increase the learning rate by doubling it and calculate cr2 with the new r1. If cr2 appears in region C or D, we resume the original learning rate and output (X1 as the variance in this iteration; otherwise, we output Cr2. 3.3. Implementation of Iterative Smoothness Maximization

The implementation details of the ISM algorithm is as following: 1. Initialization. (x = o7o, total iteration steps No, initial learning rate rTO and small learning rate Tq2. 2. Calculate the optimal F. aQ a)(I- aS(o7))- 1T where S a = 1/(1 -+ ).3[2]

°

=># Fn+

=

4. CONVERGENCE STUDY OF ITERATIVE SMOOTHNESS MAXIMIZATION

Since the algorithm proposed here is iterative, it is crucial to study its convergence property. Without loss of generality, we only consider binary classification here, which can be easily extended to the multi-class case. The cost function could be written as Q(f,C) = fT(I S)f + t(f t)T(f t). =Q 0#X f= (1- a)(I-

Of

) (a) If Q (Fn+oi1)
0

H

Vx C Rn, X

xTHX

7

-Q

Of 2

+ I

(7)

U,

XT(I _S)X + XTX

=

DWj-i-

VID. tj) +Pl xi >0 (8)

Hence, the Hessian matrix is positive definite, moreover, f * is the unique zero point of . Therefore, f * is the global minimum of Q with o fixed. The property of f * being the global minimum of Q leads to the following inequality: Q(fn+±i, 07n) Q(fn 07n), where fn+I = f*. In step 3 of ISM, C0n+1 iS calculated as on+l = 07r1 j ( , with the learning rate in gradient descent selected to guarantee that Q(fn+1, 0n+I) < Q(fn+1, 0n). Thus, in every iteration of ISM, the following inequality is guaranteed:

(1-

IC=Q

(6)

While the Hessian matrix is as follows:

Q (fn+ I . 07n+ 1)

D- 1/2WD- 1/2 and

3. Update (x with gradient descent and adjust learning rate T1. 651 = 07n-1Tv

aS))t

<

Q (fn+ 1, 07n)

<

Q (fn, 07n)

(9)

which implies that the cost function Q(oJ, f) decreases monotonically. Moreover, since Q(oJ, f) > 0, Q is lower bounded by zero. Hence, the algorithm is guaranteed to converge. Actually, our method converges after 29 steps of iteration on Two-moon data set with Cro = 1. 5. EXPERIMENTS We will validate our method on two real world data sets in this section.

II - 611

5.1. Digit Recognition In this experiment, we will focus on classifying handwritten digits. We use images of digits 1, 2, 3 and 4 from USPS4 handwritten 16 x 16 digits data set and there are 1005, 731, 658 and 652 samples in each class, with a total of 3046. We employ Nearest Neighbor classifier and SVM[9] as baselines. For comparison, we also provide the classification results of Digit Recognition Accuracy on USPS

Sigma Step of Iteration

~~~~~~~~~~~~0.95 089

1.95 19

0.8 ~~~~~

1.85 l

0.75

~~~~0.7'7 ~

20

40

(a)

60

80

100

06

-eSM -LLGC

0.65 0.76+N

1.85

0

~

svm

10

20

(b)

30

40

50

Fig. 3. Classification results on USPS data. (a) Changes of a during iteration (b) Recognition accuracies, with the horizontal axis representing the number of randomly labeled samples. LLGC method[2] in which the affinity matrix is constructed by a Gaussian function with variance 1.25, which is tuned with grid search. The number of labeled samples increases from 4 to 50 and the test errors averaged over 50 random trials are summarized in Fig.3(b), from which we can clearly see the advantages of ISM and LLGC. And obviously the hyperparameter in LLGC resulted from grid search is suboptimal.

18 and the test errors averaged over 50 random trials are summarized in Fig.4(b), from which we can clearly see the advantages of ISM and LLGC, while in ISM the hyperparameter a is determined automatically. 6. CONCLUSIONS AND FUTURE WORKS

Employing the idea of gradient descent, we propose the Iterative Smoothness Maximization algorithm in this paper, to learn an optimal graph automatically for a semi-supervised learning task. Moreover, the algorithm is guaranteed to converge. Experimental results on both toy and real-world data sets show the effectiveness of ISM for parameter selection when only few labeled samples are provided. Although we focus on learning hyperparameter a in the Gaussian function, our method can be applied to other kernel functions as well. Besides the significant advantages of ISM, there are still certain aspects that we can research more to improve the efficiency of our method. This includes employing more advanced optimization algorithms to speed up convergence and extending ISM to allow the use of different a along different directions. 7. ACKNOWLEDGEMENT

This work is supported by the project (60675009) of the National Natural Science Foundation of China. 8. REFERENCES

5.2. Object Recognition

[1] X. Zhu, "Semi-supervied learning literature survey," Computer Sciences Technical Report, 1530, University of WisconsinMadison, 2006. [2] D. Zhou, 0. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, "Learning with local and global consistency," Advances in NeuSigma Step of Iteration Object Recognition Accuracy on COIL-20 ral Information Processing Systems, 2003. 68 [3] M. Belkin, P. Niyogi, and V. Sindhwani, "On manifold regu66 larization," Proceedings of the 10th International Workshop on 0'8 64 Intelligence and Statistics, 2005. Artificial 62 [4] 0. Chapelle, B. Scholkopf, and A. Zien, Semisupervised Learn60 -N ing, MIT Press: Cambridge, MA, 2006. -0 LLGC 58 ~~~~~~~~~~~~~SVM [5] 0. Chapelle, J. Weston, and B. Scholkopf, "Cluster kernels 10 20 30 40 50 5 10 15 50 (a) (b) for semi-supervised learning," Advances in Neural Information Fig. 4. Object recognition results on COIL-20 data. (a) Processing Systems 15, 2003. Changes of a during iteration (b) Recognition accuracies, [6] X. Zhu, Semi-Supervised Learning with Graphs, Doctoral thewith the horizontal axis representing the number of randomly sis, Carnegie Mellon University, May 2005. labeled samples per class. [7] 0. Delalleu, Y. Bengio, and N. Le Roux, "Non-parametric function induction in semi-supervised learning," Proceedings oj est Neighbor classifier and SVM as baselines. For comparithe 10th International Workshop on Artificial Intelligence and Statistics, 2005. son, we also test the recognition accuracy of LLGC method in which the affinity matrix is constructed by a Gaussian func[8] L. Zelnik-Manor and P. Perona, "Self-tuning spectral clustering," Advances in Neural Information Processing Systems, tion with variance 40, which is also tuned with grid search. 2004. The number of labeled samples per class increases from 2 to [9] B. Scholkopf and A. Smola, Learning with Kernels, MIT Press, 4http //www.kemel-machines.org/data.html Cambridge, MA, 2006. 5http //wwwl .cs.columbia.edu/CAVE/software/softlib/coil- 1 OO.php

In this section, we will address the task of object recognition using the COIL-205 data set, a database of gray-scale images of 20 objects. For each object there are totally 72 images of size 128 x 128. Similar as digit recognition, we employ Near-

\

05

II- 612

7. Smoothness Maximization via Gradient Descents: Published on. ICASSP 2007 ..... where v = (v1,..., vM )T . Note that the objective function and all constraints ...

Download PDF

3MB Sizes 6 Downloads 275 Views

Report

PUBLICATION ETHICS AND PUBLICATION MALPRACTICE ...

PUBLICATION REQUEST

DEFENSIVE PUBLICATION

Publication 5201 - IRS

Accelerated Publication

Publication rejection among ecologists

NRDC publication

ISSMGE Conference & Publication Manual

Publication Tanushree.pdf

publication -

Publication 530 - IRS.gov

Clinical data publication

Publication 530 - IRS.gov

Master's Thesis: - Chalmers Publication Library

ISSMGE Conference And Publication Manual

publication-2.pdf

Publication of Gazette.PDF

International Seminar Scientific Publication - eLibrary

Master's Thesis - Chalmers Publication Library

publication details updateded.pdf

Accepted for publication

ISSMGE Conference And Publication Manual

constraints - final publication version

Publication rejection among ecologists

Publication

PUBLICATION ETHICS AND PUBLICATION MALPRACTICE ...

PUBLICATION REQUEST

DEFENSIVE PUBLICATION

Publication 5201 - IRS

Accelerated Publication

Publication rejection among ecologists

NRDC publication

ISSMGE Conference & Publication Manual

Publication Tanushree.pdf

publication -

Publication 530 - IRS.gov

Clinical data publication

Publication 530 - IRS.gov

Master's Thesis: - Chalmers Publication Library

ISSMGE Conference And Publication Manual

publication-2.pdf

Publication of Gazette.PDF

International Seminar Scientific Publication - eLibrary

Master's Thesis - Chalmers Publication Library

publication details updateded.pdf

Accepted for publication

ISSMGE Conference And Publication Manual

constraints - final publication version

Publication rejection among ecologists

Publication

Recommend Documents