Self-Taught Spectral Clustering via Constraint ...

Viewer
Transcript

Self-Taught Spectral Clustering via Constraint Augmentation Xiang Wang∗

Jun Wang∗

Buyue Qian∗

Abstract Although constrained spectral clustering has been used extensively for the past few years, all work assumes the guidance (constraints) are given by humans. Original formulations of the problem assumed the constraints are given passively whilst later work allowed actively polling an Oracle (human experts). In this paper, for the first time to our knowledge, we explore the problem of augmenting the given constraint set for constrained spectral clustering algorithms. This moves spectral clustering towards the direction of selfteaching as has occurred in the supervised learning literature. We present a formulation for self-taught spectral clustering and show that the self-teaching process can drastically improve performance without further human guidance.

1

Introduction

Spectral clustering is a fundamental unsupervised learning technique and has been extensively studied in the literature [18,19]. It has been drawing increasing attention in the past decade thanks to the prevalence of networked data. As compared to other clustering schemes, spectral clustering has many advantages, such as a simple yet principled objective function, a closed-form solution with easy implementation, and the ability to model clusters of arbitrary shapes. A limitation of spectral clustering is that its effectiveness solely relies on the quality of the graph, which is often noisy and/or incomplete in practice. To mitigate the problem, a variety of constrained spectral clustering algorithms [9–11, 13, 22, 26] were introduced to incorporate prior knowledge in the forms of pairwise constraints into the spectral clustering process. This marks the progression of spectral clustering from unsupervised learning to semi-supervised learning. Existing constrained spectral clustering algorithms generally work well when a sufficient amount of constraints are given. However, their performance becomes questionable when the expert guidance is scarce. To address this challenge, researchers proposed to optimize the performance of constrained spectral clustering by actively selecting the most helpful constraints, with the ∗ IBM

T. J. Watson Research Center. [email protected], [email protected], [email protected], [email protected]. † University of California, Davis. [email protected].

Fei Wang∗

Ian Davidson†

assumption that an Oracle (human experts) is available to provide the ground truth answer to any unknown pairwise relations [21, 23, 25]. This marks the progression of spectral clustering from passive learning to active learning. Active spectral clustering addresses the issue of how to (efficiently) learn unknown constraints from human experts. In this paper we explore the other side of the same coin: how to learn unknown constraints through self-teaching. This is an important problem because 1) in many applications it is impossible to incrementally poll an Oracle for new constraints, and 2) even if an Oracle is available, self-teaching can reduce the number of queries and thus the burden of human experts. Our work marks the next step of the progression of spectral clustering: from expert-teaching to self-teaching. The term “self-teaching” was coined in the transfer learning literature [5,16], where it means the knowledge transfer from unlabeled auxiliary data to the target domain. Analogously, our framework can be seen as transferring knowledge to unconstrained instances by adding augmented constraints to them. Specifically, we extend the objective function of constrained spectral clustering in such a way that unknown constraints can be self-taught based on the given graph affinity matrix and the known constraints. Our formulation exploits the simple but crucial fact that the ground truth constraint matrix is low-rank. This is so since if our underlying ground truth consists of say two clusters, then the ground truth constraint matrix can be exactly represented as the outer-product of the one indicator vector. Therefore the known constraint matrix can be viewed as an incomplete version of the ground truth constraint matrix and the missing entries can be recovered by rank regularized matrix completion techniques [3, 4]. Unlike standard matrix completion problems, our formulation also incorporates the graph affinity matrix by using it as a regularizer, which further improves the effectiveness of our approach when the known constraints are extremely scarce. Our contributions are: • Our work is, to the best of our knowledge, the first to extend constrained spectral clustering to incorporate self-teaching, which is useful in numerous application scenarios where expert guidance is

Accuracy=0.591

(a) Spectral Clustering

Accuracy=0.623

Accuracy=0.904

1

1

1

0.9

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0

(b) Constrained Spectral Clustering

(c) Self-Taught Spectral Clustering

Figure 1: An illustrating example shows, using the same set of 200 randomly selected constraints, our selftaught algorithm can significantly improve the clustering accuracy (in terms of Adjusted Rand Index) whereas traditional constrained spectral clustering offers little improvement. (a) is the original graph affinity matrix. (b) is the combination of the affinity matrix and the given constraint matrix. (c) is the self-taught constraint matrix. scarce and polling an Oracle is infeasible. • We propose a principled framework where we leverage both the affinity structure of the graph and the low-rank property of the constraint matrix to augment the given constraint set. • We show how to solve our objective function efficiently with alternating minimization. • We justify the effectiveness of our approach on real datasets with comparison to state-of-the-art constrained spectral clustering techniques. An Illustrative Example. Figure 1 demonstrates the advantage of our proposed method using the Iris dataset from the UCI Archive [1], which consists of 150 instances belonging to 3 clusters. We can see that with 200 constraints, traditional constrained spectral clustering algorithm can only find a marginally better partition (as compared to the one found by unconstrained spectral clustering). As a contrast, our algorithm can recover an almost perfect constraint matrix via selfteaching, which leads to a significantly better partition. 2

Related Work

in the graph affinity matrix; Kulis et al. [11] used the constraint penalty matrix to modify the affinity matrix of the graph, which is essentially a linear kernel combination; Wang and Davidson [22] introduced a quadratic constrained optimization objective where a certain portion of constraints must be satisfied; Kawale and Boley [10] proposed a linear programming objective with a sparsity regularization term. Later constrained spectral clustering was extended to the active learning setting [21, 23, 25], where it is assumed that an Oracle (human experts) is available and new constraints that are considered most informative can be queried incrementally. Existing techniques differ in the way how informativeness is defined. For example, Xu et al. [25] proposed to query boundary nodes in the graph to resolve ambiguity. Wauthier et al. [23] proposed to query edges in the graph that lead to maximum perturbation. Wang and Davidson [21] considered the current cut and the constraint set to query the pairwise relation with maximum disagreement. Existing constrained spectral clustering algorithms, passive or active, all use the constraint set as is, i.e. if a pairwise relation is not specified in the current constraint set, the algorithm will treat the relation as unknown. As a result, these algorithms do not work well when the constraints are scarce. There have been attempts in the literature [6, 20] to augment the given constraint set using the transitivity and entailment property. However, this rudimentary technique cannot handle soft constraints or noise, and can easily lead to over-constraining in practice.

2.1 Spectral Clustering and Its Extensions Spectral clustering [18, 19] was originally an unsupervised learning algorithm. Its output is solely decided by the affinity structure of the graph, which means it does not perform well when the graph is noisy or incomplete. To address the problem, spectral clustering was extended to the semi-supervised setting and a variety of constrained spectral clustering algorithms were 2.2 Self-Taught Learning Self-taught learning is a introduced [9–11, 13, 22, 26]. For instance, Kamvar et term coined in the transfer learning literature. Raina al. [9] used constraints to directly reset the edge weights et al. [16] introduced the first self-taught learning al-

gorithm which uses irrelevant/auxiliary unlabeled data Table 1: Notations to improve the performance of classification. Dai et Symbol Meaning al. [5] proposed the self-learning strategy for clustering. W Graph affinity matrix Both techniques used the auxiliary data to learn a better D Graph degree matrix feature representation and the knowledge is transferred L Graph Laplacian matrix through a shared feature space. Zhang et al. [28] used V Graph cut indicator matrix the term to describe the unsupervised learning from Q Constraint matrix data to gather underlying structure for the supervised Q∗ Ground truth constraint matrix learning in the next stage. In this paper we borrow the Ω Index set of observed entries term “self-teaching” because our framework can been PΩ (·) Mask function for missing entries viewed as transferring knowledge to unconstrained inIK K × K identity matrix stances by adding augmented constraints to them, and tr(·) Matrix trace the process does not involve human guidance. Howk · k∗ Matrix nuclear norm ever, we want to emphasize that our problem setting is not a typical transfer learning setting: our problem has only one domain (the graph) and one task (clustering) represents the edge weight, or similarity, between node whereas standard transfer learning normally assumes a i and j. The diagonal matrix D = diag(d1 , . . . , dN ) is source domain and a target domain [15]. called the degree matrix of the graph, whose diagonal elements are defined as X 2.3 Matrix Completion The self-teaching of our di = Wij . framework is realized by exploiting the low-rank propj erty for matrix completion [2–4, 14]. Matrix completion via rank minimization has drawn much attention dur- L = D−W is called the (unnormalized) graph Laplacian ing the past few years after the theoretical breakthrough matrix. The objective function of K-way spectral showing that a low-rank matrix can be perfectly recov- clustering is [19]: ered with high probability by using only a small numargmin tr(V T LV ), s.t. V T V = IK . ber of observed entries. Most work in this area assumed (3.1) V ∈RN ×K that the entries are missing at random. Lee and Shraibman [12] discussed the case where the entries are not It is well-known that Eq.(3.1) is an approximation to missing at random. the ratio-cut problem; if we replace the unnormal¯ = It has been proven that given enough observed con- ized Laplacian L with the normalized Laplacian L 2 −1/2 −1/2 straints (O(N log N )), matrix completion techniques D LD then Eq.(3.1) is an approximation to the can perfectly recover the ground truth constraint ma- normalized min-cut problem [19]. trix. However, the purpose of our work is to further The optimal solution to Eq.(3.1) is V ∗ , whose reduce the number of observed constraints needed to columns are the top-K smallest eigenvectors of L. To recover the ground truth. Not until very recently did re- recover the actual graph partition from V ∗ , the common searchers start to consider how to incorporate side infor- practice is to run K-means on the rows of V ∗ to get node mation to assist the matrix completion [24,27]. Shang et assignment. al. [17] explored matrix completion under the Laplacian embedding of a graph. To the best of our knowledge, 3.2 Matrix Completion via Rank Minimization we are the first to combine graph min-cut with matrix Let X be an incomplete observation of a low-rank completion into an iterative self-teaching process. matrix X ∗ . Let Ω denote the set of observed entries in X ∗ . The objective of matrix completion is to recover the 3 Background and Preliminaries missing entries in X by exploiting its low-rank structure: In this section we introduce notations (summarized (3.2) argmin rank(X), s.t. PΩ (X) = PΩ (X ∗ ), in Table 1) and the standard formulation for spectral X clustering and matrix completion. Readers who are where PΩ (·) is a mask function defined as familiar with these materials can skip to Section 4. ( Xij , (i, j) ∈ Ω PΩ (X) = 3.1 Spectral Clustering Given a graph with N 0. (i, j) 6∈ Ω nodes, it is represented by an N × N affinity matrix W , which is real, symmetric, and non-negative. Without Eq.(3.2) is intractable due to the non-convexity loss of generality we can assume W ∈ [0, 1]N ×N . Wij of the rank function. Therefore we replace the rank

function with its convex surrogate, matrix nuclear norm, and solve

a linear kernel combination. It is easy to see when Q → 0, V will approach the min-cut of the graph (unconstrained spectral clustering); when Q → Q∗ and (3.3) argmin kXk∗ , s.t. PΩ (X) = PΩ (X ∗ ). β/α is sufficiently large, V will approach the ground X truth partition. Recent progress in matrix completion [2–4, 14] shows When the constrained cut V is fixed, our objective that Eq.(3.3) can be solved efficiently and a perfect leads to the self-teaching of unknown constraints: recovery is possible with a sufficient amount of observed X entries. argmin rank(Q) − Gij Qij Q i,j (4.7) 4 Problem Formulation s.t. PΩ (Q) = PΩ (Q∗ ). In this section we introduce a novel optimization frameT work for self-taught constrained spectral clustering where G = V V is the Gram matrix of V . Eq.(4.7) can where the given constraint set is automatically aug- be viewed as a regularized matrix completion problem, mented by combining its low-rank property with the i.e. if the pairwise relation between node i and j is observed then Qij is set to the ground truth value Q∗ij , affinity structure of the graph. otherwise it should be optimized to minimize the rank of 4.1 Objective Function Assume we have N data Q and to approximate Gij , the pairwise relation derived instances that can be mapped into K ground truth from the current cut V . When both V and Q are to be optimized, our clusters. We can encode the ground truth cluster ∗ objective in Eq.(4.5) is combining them through the assignment by an N × N binary matrix Q : term tr(V T QV ) such that the affinity structure of L ( 1, i ∼ j will help estimate the missing entries in Q∗ and the (4.4) Q∗ij = augmented Q in turn will lead to a better cut V . 0. i 6∼ j i ∼ j means i and j belong to the same cluster and i 6∼ j means i and j belong to different clusters. Note that the rank of Q∗ is exactly K. In the context of constrained clustering, the constraint set, Q, can be viewed as an incomplete observation of the ground truth assignment Q∗ , with the index set of the observed entries denoted by Ω. In other words we should always have PΩ (Q) = PΩ (Q∗ ). Let L be the graph Laplacian and K be the number of clusters. Our objective function is: (4.5) argmin αtr(V T LV ) − βtr(V T QV ) + rank(Q), V ∈RN ×K Q∈RN ×N

s.t. V T V = IK , PΩ (Q) = PΩ (Q∗ ). α, β > 0 are weighting parameters and IK is a K × K identity matrix.

5

Algorithm

5.1 Efficient Solution Since the rank minimization part of Eq.(4.5) is intractable, it is common practice to approximate it with nuclear norm minimization: argmin αtr(V T LV ) − βtr(V T QV ) + kQk∗ (5.8)

V,Q

s.t. V T V = IK , PΩ (Q) = PΩ (Q∗ ). Eq.(5.8) can be rewritten in an equivalent regularization form: argmin αtr(V T LV ) − βtr(V T QV ) V,Q

(5.9)

1 + µkQk∗ + kPΩ (Q − Q∗ )k2F 2 s.t. V T V = IK .

α, β, µ > 0 are weighting parameters. Our objective in Eq.(5.9) can be decomposed into two sub-problems via alternating minimization. Given Q, Eq.(5.8) becomes

˜ ), 4.2 Interpretation When the constraint matrix Q (5.10) argmin tr(V T LV V T V =IK is fixed, our objective becomes a constrained spectral clustering problem: where ˜ = αL − βQ. T L (4.6) argmin tr(V (αD − αW − βQ)V ). T V V =IK Let V ∗ be the optimal solution to Eq.(5.10), it is easy This is a formulation with the graph affinity matrix to show that the columns of V ∗ are the top-K smallest ˜ directly modified by the constraint matrix Q through eigenvectors of L.

Algorithm 1 Fixed Point Continuation for Eq.(5.11) Input: Partially observed ground truth PΩ (Q∗ ); Graph cut V ; Parameters µ, β; Output: Self-taught constraint matrix Q; 1: Initialize: η ← 0.5, τ ← 1.9, µ0 ← η −10 µ; 2: Initialize: Q ← 0; 3: G ← V V T ; 4: repeat 5: µ0 ← max(µ, µ0 η); 6: repeat 7: Y ← Q − τ (PΩ (Q − Q∗ ) − βG); 8: Compute the SVD of Y : Y ← UY ΣY VYT ; 9: for i = 1, . . . , N do 10: ΣY ii ← max(ΣY ii − τ µ0 , 0); 11: end for 12: Q ← UY ΣY VYT ; 13: until convergence 14: until µ0 = µ

Given V , Eq.(5.9) becomes: 1 (5.11) argmin µkQk∗ + kPΩ (Q−Q∗ )k2F −βtr(V T QV ). 2 Q Eq.(5.11) is a regularized nuclear norm minimization problem. Since the regularization term tr(V T QV ) is a convex function of Q, Eq.(5.11) can be solved efficiently with the Fixed Point Continuation algorithm proposed by Ma et al. [14]. In summary, our algorithm alternates between solving Eq.(5.10) and (5.11) until convergence. We summarize it in Algorithm 2. 5.2 Parameter Settings There are three main parameters in our objective function Eq.(5.9) that decide the relative important of among four terms. α > 0 decides the influence of the original graph structure. µ > 0 decides how much we should enforce the lowrank requirement on the self-taught constraint matrix. β > 0 decides the mutual influence between the graph and the constraint matrix. The magnitudes of α, β, and µ decide how much we allow the self-taught constraint matrix Q to deviate from the partially observed ground truth PΩ (Q∗ ). In our experiments we simply set α = β = µ = 0.1. 5.3 Convergence For the outer loop in Algorithm 2, we use the relative change in the objective value of Eq.(5.9), out , as the convergence criterion. For the inner loop in Algorithm 1, we use the relative change in the objective value of Eq.(5.11), in , to determine convergence. In our experiments we set out to 10−2 and the outer loop converges within 10 iterations. We

Algorithm 2 Self-Taught Spectral Clustering Input: Partially observed ground truth PΩ (Q∗ ); Graph Laplacian L; Number of clusters K; Parameters µ, α, β; Output: Self-taught constraint matrix Q; Constrained cut V ; 1: Initialize V to be the K smallest eigenvectors of L; 2: repeat 3: Update Q using Algorithm 1; ˜ ← αL − βQ; 4: L ˜ 5: Update V to be the K smallest eigenvectors of L; 6: until convergence

set in to 10−5 and the inner loop converges within 2,000 iterations. 5.4 Runtime The runtime of our algorithm is dominated by the inner loop of Algorithm 1, which is in turn dominated by computing the SVD of the N × N matrix Y (line 8), normally taking O(N 3 ) time. Fortunately in our case we do not need to compute the full SVD of Y but only a low-rank approximation. More specifically, we are only interested in the singular values of Y that are greater than τ µ (line 10). Therefore, there is a fast low-rank approximation algorithm available [7] that returns the top-K singular values and singular vectors in linear time with respect to N . This algorithm has been successfully integrated into the Fixed Point Continuation algorithm by Ma et al. [14] and can easily scale to a graph with ten thousand nodes using a modern desktop computer. Overall the runtime of Algorithm 2 is the number of iterations to converge multiplying the runtime of SVD. Another way to put it is that the scalability of our algorithm is similar to popular matrix completion algorithms [2, 14]. 6

Empirical Study

In this section, we use benchmark datasets to validate the effectiveness of our approach with comparison to various baseline methods, including the state-of-theart constrained spectral clustering algorithm. We also study the convergence behavior and parameter sensitivity of our algorithm. 6.1 Experiment Setup We chose five commonly used benchmark datasets from the UCI Archive [1], namely Hepatitis, Iris, Glass, Heart, and Sonar. We also used the preprocessed 20 Newsgroups dataset1 . We sub-sampled it to exclude short documents. In our experiment we chose these six small datasets for the 1 http://www.cs.nyu.edu/

~roweis/data.html

Table 2: Datasets used in our experiment ID Hepatitis Iris Heart Glass Sonar 20 Newsgroups

#Instances 80 150 270 214 208 290

#Attributes 19 4 12 9 60 100

#Clusters 2 3 2 2 2 4

clarity of presentation. Our algorithm is scalable to graph with thousands of nodes with partial SVD (see the runtime analysis in Section 5). For each dataset, we constructed the graph affinity matrix using the RBF kernel. The statistics of the datasets are listed in Table 2 and the constructed graphs are shown in Figure 2. These datasets represented a variety of problem settings and difficulties, including very noisy graph and very imbalance clusters. We implemented the following algorithms:

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

(a) Hepatitis

(b) Iris

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

(c) Glass

(d) Heart

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

(e) Sonar

(f) 20 Newsgroups

• Self-Taught: This is our approach described in Figure 2: The graph affinity matrix W and the ground Algorithm 2. For all experiments, we fixed α, β, truth partition Q∗ for each dataset. and µ to 0.1. • Spectral: This is the unconstrained spectral clustering baseline. It represents the quality of the given graph Laplacian. • Constrained Spectral Clustering (CSC): This is the baseline where we use the given constraint set as is without self-teaching. This can be viewed as a variation of the Spectral Learning algorithm proposed in [9], where instead of resetting the entries in the graph affinity matrix we simply add the constraint matrix to the affinity matrix. This can also be viewed as an extreme case of our objective where α, β → ∞. Note that this baseline will converge to ground truth partition when the constraint set is large enough. We chose this baseline to validate the benefit of augmenting the constraint set. • Matrix Completion (MC): This is the baseline where we augment the constraint set using only matrix completion technique without the regularization from the graph affinity matrix. This can be viewed as another extreme case of our objective where α, β → 0. This baseline will also converge to ground truth partition when the constraint set is large enough. We chose this baseline to validate the benefit of adding the graph affinity matrix as a regularizer during the self-teaching process. • Flexible Constrained Spectral Clustering (FCSC): This is the algorithm proposed in [22],

which compared favorably to several existing constrained spectral clustering algorithms. We used the 2-way version of the algorithm (which performs better than the K-way version), thus it was not compared against on datasets with more than 2 clusters. We chose this baseline to validate the advantage of our algorithm against alternative constrained spectral clustering schemes. The MATLAB codes and the datasets used in our experiments will be made publicly available upon publication. To evaluate the accuracy of the clustering methods, we used Adjusted Rand Index (ARI) [8], which is considered a better evaluation metric when the cluster sizes are imbalance. Larger ARI means better performance. 1 means the partition perfectly matches the ground truth and 0 means the partition is as good as a random assignment. 6.2 Results and Analysis We first compare the accuracy of our algorithm against the baseline methods. For each dataset, we randomly generated a constraint set, then applied all algorithms to the dataset with the same constraint set. We repeated the process over 10 different random constraint sets and reported the average ARI and standard deviation in Figure 3. The first thing to notice is that the average accuracy of our algorithm (Self-Taught) significantly outperformed the baseline methods, including FCSC algorithm. Comparing our algorithm to its two extreme cases, CSC and

0.8 0.6 0.4

Spectral Self−Taught CSC MC

1 0.8

Adjusted Rand Index

1

Adjusted Rand Index

Adjusted Rand Index

1

Spectral Self−Taught CSC MC FCSC

0.8

0.6 0.4 0.2

Spectral Self−Taught CSC MC FCSC

0.2 0 −0.2

0

0.6 50

100

150 200 Number of Constraints

250

50

100

150

200

500

550

−0.2 50

Spectral Self−Taught CSC MC FCSC

1

0.8 0.6 0.4

0.8

Spectral Self−Taught CSC MC FCSC

0.6 0.4 0.2 0

0 100

150

200

250 300 350 400 Number of Constraints

(d) Heart

150

200

450

500

550

−0.2 50

150

450

500

550

450

500

550

0.6

0.4

0.2

0 100

250 300 350 400 Number of Constraints

0.8

0.2

−0.2 50

100

(c) Glass

Adjusted Rand Index

1

450

(b) Iris

Adjusted Rand Index

Adjusted Rand Index

(a) Hepatitis

250 300 350 400 Number of Constraints

200

250 300 350 400 Number of Constraints

450

500

550

(e) Sonar

50

Spectral Self−Taught CSC MC 100

150

200

250 300 350 400 Number of Constraints

(f) 20 Newsgroups

Figure 3: Comparison of different methods in terms of accuracy. FCSC was only applied to 2-cluster datasets. MC, on the one hand, with only a few hundred constraints, the CSC method can only find marginally better partitions. On the other hand, although the MC method worked better than CSC on some datasets, it still performed much worse than Self-Taught when the number of observed constraints was small. This justifies the benefit of our self-teaching process where we combine the graph affinity matrix and matrix completion technique to recover unobserved constraints. We also notice that the standard deviation of our algorithm over 10 random constraint sets is usually small, which suggests more stable performance in practice. In Figure 4 we show the self-taught constraint matrix Q learned from a set of N random constraints, where N is the number of instances in the dataset. We can see that our self-teaching algorithm was able to accurately recover the ground truth constraint matrix with a very sparse input constraint matrix. Even on the Sonar dataset (e) where ARI seems low (partially because the input graph is very noisy, see Figure 2(e)), our algorithm still captured the overall structure of the underlying dataset. From this experiment, we empirically show that our algorithm can provide a reasonably good recovery of the constraint matrix with only a linear number of observed entries, much less than the sample complexity of O(N log2 N ) of the standard matrix completion setting [4]. This is due to the use of the affinity matrix, similar to the recent work using data patterns to reduce the sample complexity [27]. Throughout our experiments, the convergence be-

havior of our algorithm is consistent. In Figure 7 we show that on the Sonar dataset our algorithm only took 5 outer iterations (see Algorithm 2) to converge. On all the datasets Q usually converged after 5 to 10 outer iterations while within each outer iteration it usually took 1,000 to 2,000 inner iterations to converge (see Figure 5), depending on how the convergence criterion was set. In our case we set out to 10−5 . In practice this convergence result means that we need to compute 5,000 to 10,000 approximate SVD (a few top singular values only) for each run. In Figure 6 we show the parameter sensitivity of our algorithm with respect to α and β. In our experiments we fixed both parameters to 0.1. In Figure 6 we set them to a range of values from 0.02 to 0.2. Overall the performance of our algorithm is stable over different parameter combinations. One noticeable trend is that the performance will decrease when α/β becomes large. This is expected because β is the weight for the constraint matrix Q and a large ratio α/β suppresses the influence of learned constraints. 7

Conclusion

In this work we investigate a novel problem: how to incorporate self-teaching into constrained spectral clustering. Our objective is to augment the given constraint set without expert guidance such that existing constrained spectral clustering schemes can achieve better performance using a small amount of constraints. To achieve this goal, we formulate a bivariate minimization prob-

1

1

1

0.9

0.9

0.8

0.8

0.8

0.6

1

0.7

0.7

0.4

0.6

0.6

0.2

0.5

0.5

0

0.4

0.4

−0.2

0.3

0.3

−0.4

0.2

0.2

−0.6

0.1

0.1

−0.8

0.5

0

−0.5 0

0

(a) Hepatitis (ARI=0.60) 1

(b) Iris (ARI=0.83) 1

1

0.9

1 0.9

0.8

0.8

0.8

0.8 0.6

0.6

0.7

0.7 0.4

0.4 0.6

0.6 0.2

0.2 0.5

0.5 0

0

0.4

0.4 −0.2

0.3

−0.2

0.3

0.2

−0.4

0.2

−0.4

0.1

−0.6

0

−0.8

0.1

−0.6

0

(c) Glass (ARI=0.96)

(d) Heart (ARI=0.88)

1

1

0.9

1

0.9

1

0.8

0.8

0.8 0.6

0.7

0.7

0.5

0.4 0.6

0.6 0.2

0.5

0.5 0

0.4

0

0.4 0.3

−0.2

0.2

0.2

−0.4

0.1

0.1

−0.6

0.3 −0.5

−1 0

0

(e) Sonar (ARI=0.33)

(f) 20 Newsgroups (ARI=0.79)

Figure 4: The self-taught constraint matrix Q using N initial constraints (N is the size of the dataset). For each dataset, on the left is the visualization of Ω, where 1 means the entry is observed. On the right is Q obtained from the self-teaching process. With only a linear number of observed entries, our algorithm can achieve a good recovery of the ground truth constraint matrix. 250 1 Adjusted Rand Index

Objective

200 150 100 50 0 −50 0

500

1000 1500 Number of Iterations

2000

0.8 0.6 0.4 0.2 0 .02 .04 .06 .08.1 .12 .14 .16 .18.2 α

.02 .04 .06

.14 .16 .08 .1 .12

.18 .2

Figure 5: A typical convergence curve of our objective β in Eq.(5.9) within one outer iteration. Each “step” corresponds to a change in µ0 (see Algorithm 1). Figure 6: Parameter sensitivity of our algorithm on Iris.

lem that simultaneously exploits the affinity structure of the graph and the low-rank property of the ground truth constraint matrix. We present an efficient alternating minimization scheme to derive the optimal so-

lution. Empirical results on real benchmark datasets clearly demonstrate the effectiveness of our proposed approach. Our future work includes further improving the efficiency to adapt to large-scale applications.

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2 0

0

0

0

0 −0.2

−0.2

−0.2

−0.2

−0.2 −0.4

−0.4

−0.4

−0.4

−0.4

−0.6

−0.6

−0.6

−0.6

−0.6

−0.8

−0.8

−0.8

−0.8

−0.8

1 0.8 0.6

(a) Q@Out Iter 1

(b) Q@Out Iter 2

(c) Q@Out Iter 3

(d) Q@Out Iter 4

(e) Q@Out Iter 5

Figure 7: The convergence behavior of our algorithm on Sonar with 300 random constraints. The self-taught constraint matrix Q quickly converged to Q∗ only after 5 outer iterations (Accuracy=0.906). Acknowledgments This work was supported in part by ONR grants N00014-09-1-0712, N00014-11-1-0108, and NSF Grant NSF IIS-0801528. References [1] A. Asuncion and D. Newman. UCI machine learning repository, 2007. [2] J.-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. [3] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009. [4] E. J. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010. [5] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Self-taught clustering. In ICML, pages 200–207, 2008. [6] I. Davidson and S. S. Ravi. Identifying and generating easy sets of constraints for clustering. In AAAI, pages 336–341, 2006. [7] P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM J. Comput., 36(1):158–183, 2006. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning. In IJCAI, pages 561–566, 2003. [10] J. Kawale and D. Boley. Constrained spectral clustering using l1 regularization. In SDM, pages 103–111, 2013. [11] B. Kulis, S. Basu, I. S. Dhillon, and R. J. Mooney. Semi-supervised graph clustering: a kernel approach. In ICML, pages 457–464, 2005. [12] T. Lee and A. Shraibman. Matrix completion from any given set of observations. In NIPS, 2013. ´ Carreira-Perpi˜ [13] Z. Lu and M. A. na ´n. Constrained spectral clustering through affinity propagation. In CVPR, 2008.

[14] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 128(1-2):321–353, 2011. [15] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [16] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unlabeled data. In ICML, pages 759–766, 2007. [17] F. Shang, L. C. Jiao, Y. Liu, and F. Wang. Learning spectral embedding via iterative eigenvalue thresholding. In CIKM, pages 1507–1511, 2012. [18] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, 2000. [19] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007. [20] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr¨ odl. Constrained k-means clustering with background knowledge. In ICML, pages 577–584, 2001. [21] X. Wang and I. Davidson. Active spectral clustering. In ICDM, pages 561–568, 2010. [22] X. Wang and I. Davidson. Flexible constrained spectral clustering. In KDD, pages 563–572, 2010. [23] F. L. Wauthier, N. Jojic, and M. I. Jordan. Active spectral clustering via iterative uncertainty reduction. In KDD, pages 1339–1347, 2012. [24] M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application to multilabel learning. In NIPS, 2013. [25] Q. Xu, M. desJardins, and K. Wagstaff. Active constrained clustering by examining spectral eigenvectors. In Discovery Science, pages 294–307, 2005. [26] Q. Xu, M. desJardins, and K. Wagstaff. Constrained spectral clustering under a local proximity structure assumption. In FLAIRS Conference, pages 866–867, 2005. [27] J. Yi, L. Zhang, R. Jin, Q. Qian, and A. K. Jain. Semisupervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML, 2013. [28] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In SIGIR, pages 18– 25, 2010.

Multi-view clustering via spectral partitioning and local ...