2010 IEEE International Conference on Data Mining
minCEntropy: a Novel Information Theoretic Approach for the Generation of Alternative Clusterings Nguyen Xuan Vinh & Julien Epps School of Electrical Engineering and Telecommunications The University of New South Wales, Sydney, Australia & National ICT Australia (NICTA) {n.x.vinh,j.epps}@unsw.edu.au
clustering is unlikely to be rediscovered, encouraging a clustering algorithm to find novel aspects of the data. An earlier related work is the distance metric learning algorithm by Xing et al. [1] in the context of constrained clustering, which transforms the data so that a number of pre-specified data pairs appear closer to each other in the transformed space. Davidson & Qi [2] used distance metric learning to learn the characteristic transformation matrix of a given clustering. They then used singular value decomposition and inverted the stretcher matrix to obtain the Alternative Distance Function Transformation (ADFT). ADFT transforms the data so that data points that previously appeared in the same cluster now lie far apart and vice-versa. Using a similar idea, recently Qi & Davidson [3] proposed another transformation approach, which transforms the data such that the new data would preserve most statistical characteristics of the original data in terms of the Kullback-Leibler divergence between the empirical distributions, while each transformed individual data point is placed further from its current cluster center, so as to encourage subsequent clustering algorithms to discover a different cluster structure. Cui et al. [4] motivated their transformation approach from the view point of orthogonalization, and proposed two orthogonal transformations, called orthogonal clustering and orthogonal subspaces. Both orthogonal transforms successively project the data to the residue space that is orthogonal to the space containing the current clustering. - Objective-function-oriented approaches: Works in this category can be further divided into semi-supervised methods, which take an existing target clustering as negative information, and unsupervised methods. Semi-supervised methods include the seminal Conditional Information Bottleneck (CIB) method [5], which extends the information bottleneck framework by conditioning on a known, given, existing clustering; the COALA hierarchical technique [6] which attempts to satisfy as many of the cannot-link constraints generated from a given clustering as possible; and the recently proposed NACI [7] algorithm, which uses the mutual information to quantify both clustering quality and distinctiveness from a given clustering. Unsupervised methods can be concurrent methods such as the decorrelated K-means and convolutional-EM algorithm [8] which concurrently fit two clusterings to the data subject to maximizing a
Abstract—Traditional clustering has focused on creating a single good clustering solution, while modern, high dimensional data can often be interpreted, and hence clustered, in different ways. Alternative clustering aims at creating multiple clustering solutions that are both of high quality and distinctive from each other. Methods for alternative clustering can be divided into objective-function-oriented and data-transformation-oriented approaches. This paper presents a novel information theoreticbased, objective-function-oriented approach to generate alternative clusterings, in either an unsupervised or semi-supervised manner. We employ the conditional entropy measure for quantifying both clustering quality and distinctiveness, resulting in an analytically consistent combined criterion. Our approach employs a computationally efficient nonparametric entropy estimator, which does not impose any assumption on the probability distributions. We propose a partitional clustering algorithm, named minCEntropy, to concurrently optimize both clustering quality and distinctiveness. minCEntropy requires setting only some rather intuitive parameters, and performs competitively with existing methods for alternative clustering. Keywords-clustering; alternative clustering; transformation; multi-objective optimization; information theoretic clustering.
I. I NTRODUCTION Traditional clustering algorithms have mostly focused on creating a single good clustering solution. Data, however, often bear multiple equally reasonable clusterings, especially in high-dimensional settings, so because they can often be interpreted in different ways. This observation has led to the recent emergence of the field of alternative clustering, aiming at creating different clustering solutions that are both of high quality and distinctive from each other, thus providing users with multiple, alternative views on the data structure. In our observation, current approaches for alternative clustering can be roughly divided into two categories: (i) the objective-function-oriented approach, in which the alternative clustering process is guided by a diversity-aware objective function, which drives the search away from one or multiple existing target clusterings; and (ii) the datatransformation-oriented approach, wherein the the alternative clustering process is mainly guided by a data transformation prior to using a regular clustering algorithm. Herein we give a brief review of approaches in these two categories: - Transformation approaches take in an existing target clustering, then transform the data so that the previous 1550-4786/10 $26.00 © 2010 IEEE DOI 10.1109/ICDM.2010.24
521
measure of orthogonality between clusterings; and the CAMI algorithm [9] which concurrently fits two Gaussian mixture models to the data subject to minimizing the mutual information between the two clusterings. Unsupervised methods can also proceed in a sequential manner, in which clusterings are discovered one after another, such as the minCEntropy algorithm to be proposed in this paper. Apart from the mentioned approaches, there are other hybrid approaches that make use of both data transformation and a diversity-aware objective function. Meta clustering [10] can be regarded as such a method, though the transformation is done in a random fashion via feature weighting, and an objective function is only used in the post-processing step to group the clusterings. A more recent approach is the work by Niu et al. [11] which combines data transformation with spectral clustering. This paper proposes a novel information theoretic based, objective-function-oriented approach to generate clusterings and alternative clusterings. We employ the conditional entropy to quantify both clustering quality and diversity, resulting in an analytically consistent combined objective. We propose a partitional clustering algorithm, named minCEntropy, to concurrently optimize both objectives. Compared with other approaches, minCEntropy has several distinct features: (i) despite being built based upon measures of information, it does not place any assumption on the data distribution. Indeed, our method does not make use of the conventional Shannon entropy, but is based on a generalized measure of Entropy namely the Havrda-Charvat’s α structural entropy. In combination with Parzen window density estimation, the final resulting clustering formulation is conceptually clear, simple and easy to implement. (ii) We propose a simple yet effective heuristic for setting the kernel parameter and the quality-diversity trade-off parameter. (iii) minCEntropy can be flexibly used in both unsupervised or semi-supervised settings. It can be used in conjunction with other datatransformation-oriented approaches, and can take in either one or multiple target existing clusterings. The paper is planned as follows: Section II and III lay out our minCEntropy approach for generating clustering and alternative clusterings. Experimental results are presented in section IV, followed by discussions and conclusions.
or equivalently, minimizes the conditional entropy: C ∗ = arg min{H(X|C)}
(2)
C∈C
since I(C; X) = H(X) − H(X|C), and H(X) is constant. Such an objective function for clustering has been previously motivated and used in various works [7], [12]–[14]. Intuitively, it says that a good clustering should maximize the information shared between the clustering and the data, or equivalently minimize the data uncertainty given the cluster labels. This objective also has an interesting interpretation similar to the likelihood objective L(X|Θ) in model based clustering, where Θ is the set of parameters governing the model. Let a clustering C be a hypothesized structure or model imposed on the data, then H(X|C) measures how well the model fits our data. Without many assumptions on the data distribution, it is generally hard to estimate the conditional entropy in (2). Here we will employ the original idea of Principe et al. [15], which allows us to proceed. These authors have shown that using a suitable entropy and distance measure, in conjunction with the Parzen window method for density estimation, the resulting distance between the probability distribution functions (pdf) admits a computationally efficient form. More specifically, they proposed using the quadratic Renyi’s entropy, and the squared Euclidean distance IED between the pdf’s. Strictly speaking, IED is a measure of shared information, but is not the mutual information itself by definition. In this paper, we present a different approach that allows us to proceed more straightforwardly, in the sense that it uses the mutual information directly. We shall make use of the general Havrda-Charvat’s α-structural entropy [16], defined as: K 1−α −1 α − 1) pk − 1 , α > 0, α = 1 (3) Hα = (2 k=1
In this paper we employ the following quadratic HavrdaCharvat’s entropy (α = 2, with the constant coefficient discarded for simplicity): H2 = 1 −
K
p2k
(4)
k=1
II. MIN CE NTROPY – AN I NFORMATION T HEORETIC C RITERION FOR C LUSTERING
The rational behind this choice is that, as we shall see shortly, this entropy admits a very computationally efficient estimator. The conditional quadratic Havrda-Charvat’s entropy of X given C is defined as:
A. Method Given a data set X = {x1 , x2 , . . . , xN } of N data items in Rd , a partitional clustering C = {c1 , c2 , . . . , cK } is a way to divide X in to K non-overlapped subsets. Let C be the space of all possible K-cluster partitions of X. We are interested in identifying the clustering C ∗ in C which maximizes the mutual information between the data and the clustering: (1) C ∗ = arg max{I(C; X)}
H2 (X|C) =
K
p(ck )H2 (X|C = ck )
(5)
k=1
With this measure of entropy, our objective in (2) becomes: K ∗ p(ck )H2 (X|C = ck ) (6) C = arg min C∈C
C∈C
522
k=1
where μk = x∈ck x/nk is the centroid of cluster ck . This objective can also be expressed in terms of the pairwise distances between data items within clusters [17]:
Now let us revisit the Parzen window method for probability density estimation, and see how this method when combined with the objective above will result in a nice and neat estimation. Let’s G(·) be the Gaussian kernel in d-dimensional space: −x − a2 1 2 G(x − a, σ ) = exp (7) 2σ 2 (2πσ)d/2
1 SS(C) ≡ 2 K
k=1
N 1 G(x − xi , σ 2 ) N i=1
(8)
H2 (X)
= =
1 1 − p (x) = 1 − 2 N x 1−
2
N x
2 2
G(x − xi , σ )
(9)
wherein we have employed a nice property of the Gaussian kernel, which is that the convolution of two Gaussians remains a Gaussian: G(x − xi , σ 2 )G(x − xj , σ 2 ) dx = G(xi − xj , 2σ 2 ) (10) x
Similarly we have: H2 (X|C = ck ) = 1 −
1 G(xi − xj , 2σ 2 ) n2k x ∈c x ∈c i
k
j
nk
(13)
Despite its simple looking appearance, the K-means clustering problem has been shown to be NP-hard even for K = 2 [18]. Therefore, although the hardness of the minimum conditional entropy optimization problem is still an open question, a heuristic approach seems to be indicated. In this section, we develop a hill climbing approach, called minCEntropy, to iteratively improve the conditional entropy objective function. The strategy resembles that of K-means: in each iteration, each data point is considered to be moved to a new cluster that is closest to it. While in K-means, the point-to-cluster closeness is judged by the distance from a point to a cluster center, in minCEntropy clustering, this is judged by the total similarity between a point and all the other data points in the clusters. Our algorithm will make use of the following three data structures:
dx
i=1
N N 1 G(xi − xj , 2σ 2 ) N 2 i=1 j=1
xi − xj 2
C. minCEntropy - a Hill Climbing Strategy to Optimize the Minimum Conditional Entropy Objective
Now the quadratic entropy of p(x) can be estimated as:
xi ,xj ∈ck
Interestingly, this objective function looks exactly analogous to our objective in (11). While K-means aims to minimize the weighted sum average intra-cluster distances in terms of the squared Euclidean distance, the minimum conditional entropy criterion aims to maximize the weighted sum average intra-cluster similarity as judged by the Gaussian kernel.
where σ is the kernel width parameter and a is the center of the Gaussian window, then the density estimation of X is given by: p(x) =
k
B. Comparison with the K-means Clustering Formulation
SN ×N : an N × N array which contains the pairwise similarities as judged by the (simplified) Gaussian −xi −xj 2 kernel, i.e. Sij = exp . 4σ 2 • SP CN ×K : an N × K array which contains the pointto-cluster similarity, as judged by the total similarity between a point and all the other points in a cluster, i.e. SP Cik = xj ∈ck Sij . • Q1×K : a length-K vector which contains the cluster quality, as judged by the total pairwise similarity between points within a cluster, i.e. Qk = xi ,xj ∈ck Sij = xi ∈ck SP Cik . K Note that using these notations, CE(C) = k=1 Qk /nk . Now suppose that we are considering moving a data item xi from its current cluster cl to cluster ck . With this change, only the qualities of cl and ck are affected, i.e. :
The K-means clustering algorithm minimizes the following total sum-of-squares objective function:
Qnew l
•
where nk is the number of data items in cluster ck . Given this estimation, our objective in (6) becomes: C ∗ = arg max C∈C
⎧ K ⎨ ⎩
k=1
⎫ ⎬ 1 G(xi − xj , 2σ 2 ) p(ck ) 2 ⎭ nk x ∈c x ∈c i
k
j
k
Note that the probability of encountering the cluster ck in C is nk /N , we finally have C ∗ = arg maxC∈C CE(C) with CE(C) =
K
xi ,xj ∈ck
exp
−xi −xj 2 4σ 2
nk
k=1
(11)
being our objective function. By maximizing CE(C) we minimize the conditional entropy criterion.
SS(C) =
K
=
G(xp − xj , 2σ 2 ) = Ql − 2SP Cil
xp ,xj ∈Cl \{xi } 2
x − μk
Qnew k
(12)
=
xp ,xj ∈Ck ∪{xi }
k=1 x∈ck
523
G(xp − xj , 2σ 2 ) = Qk + 2SP Cik
Proposition 1: minCEntropy converges to a local minimum of H(X|C). Indeed, H(X|C) is a discrete function of the discrete variable C, and the problem of minimizing H(X|C) is a discrete optimization problem on the space of clustering C. At convergence, no single change in membership could increase the objective, meaning no neighbor clustering of C ∗ is better than itself. Therefore, C ∗ is a local optimum of H(X|C). Complexity analysis: - The initialization step costs O(N 2 d) time since the complete kernel matrix need to be set up. In reality, only N (N − 1)/2 distances need to be computed due to its symmetry. - Finding the best target cluster for each datum costs O(K) time. The Update procedure costs O(N ) time. The cost of the main loop of the algorithm is therefore O(IN (K + ηN )) where I is the number of iterations and 0 < η < 1 is the expected ratio of data items that change membership. According to our observation, the number of membership changes is large for the first few iterations, then quickly reduces as the algorithm converges. Overall, the time complexity of minCEntropy is dominated by the quadratic cost of computing the kernel matrix. With a fixed kernel, after computing the kernel matrix S, it is a good strategy to run the algorithm several times with different initialization to obtain a better optimum.
The change in the objective value is therefore: Qnew Ql Qnew Qk l + k − − n l − 1 nk + 1 nl nk Qk 2SP Cil 2SP Cik Ql − − + (14) = nl (nl − 1) nk (nk + 1) nl − 1 nk + 1
ΔCE(C|xi , cl → ck ) =
For each datum xi , the membership re-assignment that results in the largest objective improvement is ck = arg maxc∈C ΔCE(C|xi , cl → c). If k = l, the datum is moved using the following procedure: Update(xi , cl → ck ): cl ← cl \{xi } ck ← ck ∪ {xi } for j = 1 to N do SP Cjl ← SP Cjl − Sji SP Cjk ← SP Cjk + Sji end for Ql ← Ql − 2SP Cil Qk ← Qk + 2SP Cik We are now ready to state the minCEntropy algorithm: Algorithm 1 minCEntropy Input: The data set X = {x1 , x2 , . . . , xN }, the number of desired clusters K. Optional: A hard initial clustering C = {c1 , c2 , . . . , cK } on X. Initialization: - Calculate the pairwise similarity kernel matrix S. - Generate a random initial hard clustering C (if none provided). - For C, calculate the point-to-cluster similarity matrix SP C, and the cluster quality vector Q. - continue ← TRUE Begin: while continue do continue ← FALSE for i = 1 to N do ck = arg maxc∈C ΔCE(C|xi , mem(xi ) → c) if ΔCE(C|xi , mem(xi ) → ck ) > 0 then Update(xi , mem(xi ) → ck ) continue ← TRUE end if end for end while Note: mem(xi ) = {cl |xi ∈ cl }
III. MIN CE NTROPY+ FOR D ISCOVERING A LTERNATIVE C LUSTERINGS In this section, we develop an extension of minCEntropy to generate one alternative clustering. Let C 1 be a preprovided clustering of R clusters {c11 , c12 , . . . , c1R }, presumably by either minCEntropy or any other clustering method, we would like to find a clustering C ∗ which concurrently satisfies two objectives: (i) high quality; and (ii) distinctiveness from C 1 . As clustering distinctiveness can be well measured by means of the mutual information between two clusterings, we can formulate the following multi-objective optimization problem: C ∗ = arg max{I(C; X) − λI(C; C 1 )}
(15)
C∈C
where λ is a trade-off factor used to equilibrate the two objectives. Note that I(C, C 1 ) = H(C 1 ) − H(C 1 |C), and H(C 1 ) is a constant, we have the following equivalent optimization problem: C ∗ = arg min{H(X|C) − λH(C 1 |C)}
(16)
C∈C
Convergence: The algorithm loops until there is no possible single membership change that strictly decreases the conditional entropy. Since the conditional entropy is lowerbounded by zero, and since there is only a finite number of possible clusterings, it can be seen that the algorithm will eventually terminate after a finite number of steps.
Using the notation of quadratic Havrda-Charvat entropy, we have: H2 (C 1 |C) =
1−
K R 1 n2lk N nk k=1 l=1
524
(17)
B. minCEntropy++ for Discovering Alternative Clustering that is Different from Multiple Given Clusterings
with nlk denote the number of data items shared by cluster ck in C and c1l in C 1 . The resulting optimization problem is: C∗
= +
arg min{− C∈C
λ
K
R
K
1 xi ,xj ∈ck (2πσ)d/2
nk
n2lk
exp
−xi −xj 2 4σ 2
minCEntropy+ can be easily extended to discover an alternative clustering that is different from multiple pregiven clusterings. Using the material developed from the previous sections, this extension is pretty straightforward. Let {C 1 , C 2 , . . . , C M } be M given clusterings, each with {K 1 , K 2 , . . . , K M } clusters respectively, we would like to discover a clustering C ∗ that is (i) of high quality; and (ii) distinctive from {C 1 , C 2 , . . . , C M } . That means, we would like to optimize all the CE(C) + λCE(C 1 |C), CE(C) + λDI(C 2 |C), . . . , CE(C) + λCE(C M |C) terms. A reasonable objective would therefore be:
nk
k=1
l=1
k=1
}
(18)
A. Objective Calibration and Optimization Since the scale of the two objectives, quality and distinctiveness, is different, it is important to choose a suitable factor λ to calibrate them. Note that for quality, for each ck we have: −xi −xj 2 1 xi ,xj ∈ck (2πσ)d/2 exp 4σ 2 1 ≤ nk nk (2πσ)d/2 (19) while for distinctiveness, for each ck we have: R 2 n2 l=1 nlk ≤ k = nk (20) nk nk We therefore omit the kernel normalization factor (2πσ)−d/2 , bringing the two objectives to a common scale. This omission does not affect our formulation, since the kernel normalization factor is essentially a weighting factor, which now becomes the sole duty of λ. Our final optimization problem takes the form: C∗
arg max{CE(C) + λDI(C 1 |C)}
=
C ∗ = arg max{M.CE(C) + λ C∈C
ΔDI(C |C) = − −
R
r=1 r=j
n2rl
nl − 1 2 (njk + 1)2 + R r=1 nrk r=j
nk + 1
+ +
R
r=1
R
Kernel width parameter σ: There is a vast literature on kernel width parameter tuning. Herein, we adapt a simple heuristic proposed in [19]. In his work in feature extraction using transformation, Torkkola suggested using a kernel width σ of half of the distance between two furthest data points in the output space. A tempting idea is therefore to choose σ as half of the the distance between the furthest data point in our data space. This choice is however, susceptible to outliers. We have experimentally taken σ as half the average pairwise distances, i.e. σ0 = 2N1 2 xi ,xj ∈X xi − xj , which seems to work fairly well as compared to some other approaches that we have implemented, such as kernel width annealing, in which one starts with a large kernel and then gradually reduces it in a deterministic annealing fashion; or variable kernel width [20] in which each datum has its own kernel width based on local density. The trade-off parameter λ plays an important role in calibrating the two objectives, quality and diversity. As λ → 0, minCEntropy+ /++ behave like the original minCEntropy, while as λ → ∞, very low quality clusterings that are maximally different from the given clustering are created. A general rule of thumb is to equalize the range of quality and diversity, and probably put slightly more weight on the quality side. We propose the following simple heuristic: at the end of the initialization step, set λ = (1/m)λ0 M with λ0 ≡ M.CE(C)/ u=1 |DI(C u |C)| (roughly meaning that we judge quality m times as important as diversity). Generally we have found that m ∈ [1, 3] works reasonably well for minCEntropy+ . Note that if the algorithm is to be restarted with different initializations, λ needs to be set
(21)
n2rl
nl
r=1
n2rk
nk
(24)
u=1
C. Parameter Setting
where the new objective is a weighted combination of quality, measured by CE(C) as defined in (11), and diversity, 2 K R 1 l=1 nlk measured by DI(C |C) = − k=1 . nk Now suppose that we are moving a single data item xi from cluster cl to cluster ck in C. Assume that xi ∈ c1j in C 1 . The change in CE(C) is ΔCE(C) as in (14), while the change in DI(C 1 |C) can be shown to be: (njl − 1)2 +
DI(C u |C)}
The potential target cluster for each datum xi is = arg maxc∈C {M.ΔCE(C) + now given by ck M u λ u=1 ΔDI(C |C)|xi , mem(xi ) → c}. It can be shown ++ converges to a local minimum of that minCEntropy M u H(C |C), and admits a complexity of H(X|C) − λ u=1 O(IN (K u K u + ηN )) for the main loop.
C∈C
1
M
(22)
In order to improve our objective we must have: ΔCE(C|xi , cl → ck )+λΔDI(C 1 |C, xi , cl → ck ) > 0 (23) The potential target cluster for each datum xi is therefore given by ck = arg maxc∈C {ΔCE(C) + λΔDI(C 1 |C)|xi , mem(xi ) → c}. By replacing this search operation in minCEntropy (Algorithm 1), we obtain the minCEntropy+ algorithm for discovering an alternative clustering distinctive from a given clustering. It can be easily shown that minCEntropy+ converges to a local minimum of H(X|C) − λH(C 1 |C), and admits a complexity of O(IN (KR + ηN )) for the main loop.
525
first generated the first dominant clustering. Transformed data were generated with input from either the first clustering generated by the algorithms, or the ground-truth first clustering. Note that since the ADFT transform (with the distance metric learning algorithm) is almost three orders of magnitude slower than the other transforms, we only tested it once using the ground-truth first clustering as input. S-Dec. K-means was tested with the first ground-truth clustering given. We also tested objective-function-oriented approaches on top of transformation approaches. 1) Initialization and Parameter Setting: The parameters for minCEntropy/+ were set according to Section III-C. Unless otherwise stated, we set m = 2 for the trade-off parameter. On each data set, minCEntropy/+ were executed 10 times, each with 10 different random initializations (for 100 runs in total). Note that for minCEntropy+ these 10 objective values are not strictly comparable due to the difference in the trade-off factor λ. We also ran other competitor algorithms 10 times, each with 10 different initializations. The trade-off parameters for Dec. K-means and S-Dec. K-means were set according to the authors’ suggestions [8]. 2) Quality Assessment and Report: Clustering quality was assessed using the Adjusted Mutual Information (AMI) [22], an adjusted-for-chance version of the normalized mutual information [23], between the clustering and the groundtruth classification. For visualization, we selected the best clustering in terms of the objective value, while for tabulation, we reported the AMI mean±standard deviation values.
only once in the beginning, so that the objective values are comparable across runs. IV. E XPERIMENTAL R ESULTS A. Method We compared the minCEntropy and minCEntropy+ algorithms (herein referred to as minCEntropy/+ ) against several recently proposed methods: - Transformation approaches: we employed the two orthogonal transformations proposed by Cui et al. [4], herein referred to as Ortho1 and Ortho2; the ADFT transform [2]; the transformation in [3], herein referred to as Qi’s transform. All authors reported good performance with the K-means algorithm. In this work, we used K-means and spectral clustering to accompany these transformations. On image data, we employed the regular K-means algorithm, while on text data, we used the spherical K-means variant (S. K-means), which works better for the text domain [21]. On image data, we employed the self-tuning spectral clustering algorithm (S.T. Spectral) [20], with a Gaussian kernel and variable kernel width for each datum, while on text data we used the regular spectral clustering algorithm with the dot product between normalized feature vectors as the similarity measure. - Objective-function-oriented approaches: We took the decorrelated K-means (Dec. K-means) algorithm for comparison [8]. Note that Dec. K-means concurrently generates 2 clusterings, therefore it is not possible to incorporate side information, such as ground-truth for the first clustering in this case. We now digress a bit and introduce a modified, semi-supervised version of the Dec. K-means algorithm (SDec. K-means), that can take in a clustering as negative information. More specifically, the Dec. K-means employs gradient descent to optimize the following objective function: z − μi 2 + z − νj 2 SK(μ1...K , ν1...R ) = i
z∈c1i
+λ
i,j
j
(βjT μi )2
B. Artificial data As a sanity check, we first tested the minCEntropy/+ /++ clustering algorithms on several artificial data sets. minCEntropy discovered the first clustering. Its output was then given to minCEntropy+ /++ to discover the second/third clustering. Set 1: We generated 6 Gaussian clusters, each with 100 points in 2 dimensions, as presented in Fig. 1(a). When the number of clusters K is set to 3, it can be seen that there are two reasonable clustering solutions, discovered sequentially by minCEntropy/+ as expected (Fig. 1 b-c). On the same data set, if we set K = 2, minCEntropy/+ discovered the first two clusterings as in Fig. 1(d-e). Taking minCEntropy/+ clusterings as input, minCEntropy++ generated the third clusterings as in Fig. 1(f). If m was set to 3 for the tradeoff parameter, then minCEntropy+ and minCEntropy++ generated the two alternative clusterings as in Fig. 1(g-h), which indeed consist of more compact clusters, since now the weight on clustering quality has been increased. Set 2: We generated 8 Gaussian clusters of different sizes and densities, each with 100 points in 2 dimensions, as presented in Fig. 1(i). With K = 4, minCEntropy/+ discovered the two reasonable clusterings as in Fig. 1(j-k). Given K = 2, minCEntropy/+ /++ sequentially discovered the three clusterings as in Fig. 1(l-n).
z∈c2i
+λ
(αiT νj )2
i,j
where αi , μi are respectively the mean and representative vectors of cluster c1i in clustering C 1 with K clusters, and βj , νj are respectively the mean and representative vectors of cluster c2j in clustering C 2 with R clusters. Now suppose that we are given a target clustering C 1 , we simply set μi = αi for every cluster in C 1 , and fix this set of mean and representative vectors during the iterations. S-Dec. K-means then optimizes the objective with respect to the second clustering only. We gave the true number of clusters in each clustering to all algorithms. Dec. K-means generated two clusterings concurrently, which were then matched against all the groundtruth clusterings to find the best match. For sequential methods, K-means, Spectral clustering and minCEntropy
526
(a) Original data
(c) minCEntropy+ clustering, K=3
(b) minCEntropy clustering, K=3
(e) minCEntropy+ clustering, K=2, m=2
(d) minCEntropy clustering, K=2
10
10
5
5
5
5
0
0
0
0
0
−5
−5
−5
−5
−5
−10
10
−10 −10
−5
0
5
−10
−5
0
5
10
−10
(g) minCEntropy+ clustering, K=2, m=3
10
−5
0
5
−10
10
−10
−5
5
0
−5
−5
−5
−10
−10
5
10
−10
25
25
20
20
15
15
10
10
−5
0
5
10
−10
−5
0
5
0
−5
−5
−10
−10
−15
−15
(k) minCEntropy+ clustering, K=4
−10
−5
5
−25
−20
−15
−10
−5
0
5
25
20
20
15
15
10
10
10
5
5
0
0
0
0
−5
−5
−5
−5
−10
−10
−10
−10
−15
−15
−15
−15
−20
−20
−20
−25
−25
−25
15
15
10 5
−25
Figure 1.
−20
−15
−10
−5
0
5
10
15
20
minCEntropy/+ /++
−25
−20
−15
−10
−5
0
5
10
15
20
10
15
20
−25
−20
−15
−10
−5
0
5
10
15
20
(n) minCEntropy++ clustering, K=2
25
20
20
10
−25
10
(m) minCEntropy+ clustering, K=2
(l) minCEntropy clustering, K=2 25
25
0
5
−20
−25
10
0
5
0
−20
−10
−5
(j) minCEntropy clustering, K=4
5
0
−10
0
(i) Original data
10
5
0
5
(h) minCEntropy++ clustering, K=2, m=3
10
5
10
−10
−10
10
(f) minCEntropy++ clustering, K=2, m=2
10
5
−20 −25 −25
−20
−15
−10
−5
0
5
10
15
20
−25
−20
−15
−10
−5
0
5
10
15
20
clustering results on the 6-Gaussian-cluster dataset (a-h), and 8-Gaussian-cluster dataset (i-n).
C. Real data
We first ran the minCEntropy algorithm, for which the result is presented in Fig. 3(top row), where it can be seen that the objects are clustered by color, which is the dominant first clustering for this data set. We next executed the minCEntropy+ algorithm on the same data set, with the minCEntropy clustering result given as C 1 . The result is presented in Fig. 3(bottom row), where it can be seen that now the objects are grouped by their shapes as expected. The full clustering results for all algorithms are presented in Table I.
Due to the lack of real datasets with multiple good ground-truth clusterings, we only tested the minCEntropy/+ and other algorithms on several datasets with two known, good ground-truth classifications. 1) The ALOI data set: The Amsterdam Library of Object Images (ALOI) [24] consists of 110,250 images of 1000 common objects. For each object, a number of photos are taken from different angles and under various lightning conditions. Based on color and texture histograms, we first extracted a dense 641-dimensional feature vector for each object, using the method described in [25]. For a test set, we chose 9 objects of different shapes and colors as exemplified in Fig. 2, for a total of 984 images. We further applied principle component analysis (PCA) to further reduce the number of dimensions to 16, which retains more than 90% variance of the original data.
1.0
1.0
1.0
0.8
0.9
1.0
Figure 3. minCEntropy/+ clustering on the ALOI dataset. Displayed is the mean image of each cluster: (top row) minCEntropy clustering, objects are grouped by color; (bottom row) minCEntropy+ alternative clustering, objects are grouped by shape. The numbers denote the percentage of the dominant color(shape) in each cluster .
Figure 2.
2) The CMUFace data set: The CMUFace data set, drawn from the UCI repository [26], contains 624 32×30images of 20 people, taken with varying pose (straight, left, right, up), expression (neutral, happy, sad, angry) and eyes (wearing sunglasses or not). We applied PCA to reduce the number of dimensions to 39, which explains more than 90% of the original data variance. minCEntropy was first applied, with K1 = 20 (for the 20 people in the database). It can be observed on Fig. 4 that the objects are clustered by person.
9 objects of different shapes and colors chosen for clustering.
It can be seen that there are 2 reasonable ways to cluster the objects: by color (K1 = 3) or by shape (K2 = 3).
527
Table I C LUSTERING RESULTS ON THE ALOI AND CMUFACE DATA SETS (AMI Algorithm minCEntropy/+ ADFT Qi’s trans. Ortho1 Ortho2 ADFT Qi’s trans. Ortho1 Ortho2 ADFT Qi’s trans. Ortho1 Ortho2
minCEntropy/+
K-means
S.T. Spectral
ALOI 2nd Clustering 2nd Clustering∗ (Shape) (Shape) 0.68±0.06 0.71±0.00 N/A 0.64±0.12 0.70±0.05 0.66±0.11 0.98±0.00 0.70±0.05 0.65±0.11 0.69±0.07 0.71±0.00 N/A 0.58±0.10 0.42±0.15 0.32±0.16 0.97±0.00 0.70±0.00 0.66±0.09 0.65±0.10 0.63±0.11 N/A 0.10±0.00 0.20±0.00 0.20±0.00 0.99±0.00 0.22±0.01 0.27±0.00 0.26±0.02 0.26±0.02 0.82±0.14 0.20±0.06 N/A N/A N/A 0.24±0.09 N/A N/A 0.68±0.00 N/A N/A 0.39±0.11 N/A N/A 0.69±0.00 N/A N/A 0.66±0.05 ground-truth clustering given to the algorithm. 1st Clustering (Color) 0.98±0.00
Dec. K-means S-Dec. K-means ADFT Qi’s trans. Ortho1 Ortho2 ∗ : 2nd clustering, with the 1st S-Dec. K-means
Given the ground-truth person clustering and set the number of clusters K2 = 4, minCEntropy+ discovered the second clustering where the objects are clustered by four different poses: straight, up, left and right as demonstrated in Fig. 5. The full results for all algorithms are presented in Table I. 1.0
0.9
1.0
0.8
0.6
0.4
0.2
1.0
0.7
0.5
0.3
0.3
1.0
1.0
0.7
0.5
1.0
0.4
1.0
1.0
VALUES )
1st Clustering (ID) 0.69±0.03 0.69±0.03
0.72±0.03
0.64±0.01 0.72±0.04 N/A N/A N/A N/A N/A
CMUFace 2nd Clustering (Pose) 0.13±0.05 N/A 0.13±0.05 0.13±0.05 0.13±0.06 N/A 0.05±0.05 0.18±0.05 0.13±0.06 N/A 0.03±0.01 0.11±0.02 0.05±0.01 0.15±0.07 N/A N/A N/A N/A N/A
0.7
0.6
0.9
0.9
2nd Clustering∗ (Pose) 0.47±0.01 0.47±0.01 0.47±0.01 0.47±0.01 0.47±0.01 0.01±0.00 0.41±0.06 0.45±0.00 0.55±0.01 0.01±0.00 0.08±0.02 0.55±0.00 0.55±0.01 N/A 0.55±0.01 0.06±0.02 0.31±0.08 0.45±0.00 0.52±0.00
Figure 5. minCEntropy+ alternative clustering result on the CMUFace dataset - clustering by pose. Displayed is the mean image of each cluster. The numbers denote percentage of the dominant pose in each cluster.
practice for text data, the feature vector for each document was normalized to have unity norm. The two ground-truth clusterings for this data set are thus topic (ECAT, MCAT, CCAT), and country (U.S., Japan, China), with topic being the dominant clustering, discovered first by all algorithms. The full clustering results for all algorithms are presented in Table II. 4) The WebKB data set: (www.cs.cmu.edu/~webkb) contains html documents collected mainly from four universities: Cornell, Texas Austin, Washington and Wisconsin Madison. We selected all documents from those four universities that fall under one of the four page types namely course, faculty, projects and students. After a preprocessing step similar to that done on the RCV1 data set, the final data matrix contains 1041 documents and 486 words. This data set can be either clustered by the four universities or by the four page types. The full clustering results for all algorithms are presented in Table II. Overall, minCEntropy/+ with the default parameter setting performed competitively. Albeit not always producing the best clusterings, its results were more consistent
Figure 4. minCEntropy clustering result on the CMUFace dataset clustering by identity. Displayed is the mean image of each cluster. The numbers denote percentage of the dominant person in each cluster.
3) The Reuters Corpus Volume I (RCV1) dataset: We tested the clustering algorithms on the RCV1, an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes [27]. For a test set, we selected all documents in one of the three meta topics, namely Corporate/Industrial (CCAT), Economics (ECAT) and Markets (MCAT), that also contain one of the three countries names U.S., Japan and China in their titles. We preprocessed the data by removing common words and rare words, stemming, and used TF-IDF weighting to construct the feature vectors. The final data matrix contains 378 documents and 242 words. As per common
528
Table II C LUSTERING RESULTS ON THE RCV1 AND W EB KB DATA SETS (AMI VALUES ) RCV1 2nd Clustering 2nd Clustering∗ (Country) (Country) minCEntropy/+ 0.26±0.07 0.54±0.12 ADFT N/A 0.50±0.12 Qi’s trans. 0.26±0.09 0.44±0.10 0.24±0.02 minCEntropy/+ Ortho1 0.26±0.07 0.52±0.12 Ortho2 0.23±0.09 0.51±0.11 ADFT N/A 0.04±0.02 Qi’s trans. 0.00±0.00 0.00±0.01 Spherical K-means 0.31±0.01 Ortho1 0.29±0.04 0.62±0.06 Ortho2 0.29±0.04 0.61±0.05 ADFT N/A 0.02±0.00 Qi’s trans. 0.01±0.00 0.00±0.00 Spectral 0.12±0.00 Ortho1 0.46±0.02 0.62±0.01 Ortho2 0.45±0.00 0.63±0.01 Dec. K-means 0.23±0.04 0.40±0.10 N/A S-Dec. K-means N/A N/A 0.49±0.10 ADFT N/A N/A 0.04±0.03 Qi’s trans. N/A N/A 0.01±0.00 S-Dec. K-means Ortho1 N/A N/A 0.61±0.01 Ortho2 N/A N/A 0.61±0.01 ∗ : 2nd clustering, with the 1st ground-truth clustering given to the algorithm. Algorithm
1st Clustering (Topic) 0.24±0.02
and competitive for both clusterings across all the data sets. Our experiments also confirmed good performance of the K-means algorithm on top of a transformation approach as reported in previous works [3], [4], especially the Spherical K-means which is fine-tuned for text data. Spectral clustering performance appeared to fluctuate more. Of the transformation approaches, the performance of the two orthogonal transforms Ortho1 and Ortho2 was more consistent. The ADFT and Qi’s transforms (using the implementation provided by the author, available at http://wwwcsif.cs.ucdavis.edu/~qiz/code/code.html) with the default parameter performed poorly on the two text datasets. Although transformations have a relatively slight effect on minCEntropy+ , their effect on another objective-functionoriented approach, namely the semi-supervised Dec. Kmeans, was more substantial, with marked improvement on 3 over the 4 real data sets. For all sequential approaches to generating alternative clusterings, having access to the first ground-truth clustering significantly improved the quality of the second clustering.
1st Clustering (University) 0.38±0.05 0.38±0.05
0.52±0.02
0.12±0.01 0.38±0.01 N/A N/A N/A N/A N/A
WebKB 2nd Clustering (Page type) 0.22±0.08 N/A 0.22±0.08 0.23±0.08 0.21±0.07 N/A 0.01±0.00 0.22±0.07 0.20±0.08 N/A 0.02±0.00 0.17±0.03 0.09±0.03 0.15±0.03 N/A N/A N/A N/A N/A
2nd Clustering∗ (Page types) 0.28±0.01 0.27±0.02 0.25±0.04 0.27±0.01 0.26±0.02 0.08±0.02 0.01±0.00 0.23±0.03 0.24±0.03 0.05±0.00 0.02±0.00 0.15±0.01 0.15±0.01 N/A 0.20±0.08 0.07±0.03 0.01±0.02 0.33±0.03 0.36±0.01
both quality and diversity, resulting in a more analytically consistent combined criterion. The closest approach to ours is the recently proposed NACI method, which uses the mutual information for both clustering quality and diversity. NACI employs a greedy hierarchical strategy to optimize its objective, and therefore presumably shares all the advantages and disadvantages with traditional hierarchical clustering approaches. While greedy agglomerative approaches generally possess strong capability for handling non-convex, irregular shaped clusters, their main disadvantage is that finding even a local optimal clustering solution is not guaranteed [28]. minCEntropy on the other hand employs a partitional clustering approach, for which convergence to a local optimal solution is assured. Methods for revealing alternative clusterings can proceed in a sequential manner, in which alternative clusterings are created one after another (CIB, COALA, NACI, SDec. Kmeans, minCEntropy); or in a concurrent manner, in which a pair of clusterings is created in parallel (Dec. K-means, convolutional-EM, CAMI). In our opinion, concurrent schemes for discovering alternative clustering suffer from several disadvantages: (i) their optimization problem becomes “twice” as hard, as the number of variables is doubled since 2 clusterings need to be handled concurrently. As the dimension of the optimization problem increases, local optimization procedures such as gradient descent (employed by Dec. K-means) or Expectation-Maximization (employed by CAMI, convolutional-EM) become less effective (the curse of dimensionality). (ii) Extension to more than 2 clusterings is likely to be much more involved. (iii) It is not possible to combine these approaches with data-transformationoriented approaches. (iv) Finally, side information, such as an existing classification, cannot be incorporated into these
V. R ELATED W ORK AND D ISCUSSION Most objective-function-oriented approaches such as CIB [5], Dec. K-means, convolutional-EM [8], semi-supervised Dec. K-means, CAMI [9], NACI [7] and minCEntropy/+ are based on a common formulation, with a combined criterion of clustering quality and diversity. Dec. K-means quantifies clustering quality by the regular sum-of-squareddistance criterion, while diversity is quantified by the sum of squared dot product between the mean and representative vectors of clusters in the two clustering. CAMI combines the traditional likelihood criterion for clustering quality with the mutual information for clustering diversity. minCEntropy on the other hand, employs the conditional entropy for
529
algorithms. On the other hand, sequential approaches for alternative clustering, which generally do not suffer from the mentioned shortcomings, appear to be a more flexible choice.
[9] X. H. Dang and J. Bailey, “Generation of alternative clusterings using the CAMI approach,” in Procs. SDM ’10, 2010. [10] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith, “Meta clustering,” in Procs. ICDM ’06, 2006, pp. 107–118. [11] D. Niu, J. G. Dy, and M. I. Jordan, “Multiple non-redundant spectral clustering views,” in Procs. ICML ’10, 2010. [12] S. J. Roberts, R. Everson, and I. Rezek, “Maximum certainty data partitioning,” Pattern Recognition, vol. 33, no. 5, pp. 833 – 839, 2000. [13] Y. Lee and S. Choi, “Minimum entropy, k-means, spectral clustering,” in Procs. IJCNN ’04, 2004. [14] H. Li, K. Zhang, and T. Jiang, “Minimum entropy clustering and applications to gene expression analysis,” Proc IEEE Comput Syst Bioinform Conf, pp. 142–51, 2004. [15] J. C. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” Unsupervised adaptive filtering, (Ed.) S. Haykin, pp. 265–321, 2000. [16] J. Havrda and F. Charvat, “Quantification method of classification processes. concept of structural α-entropy,” Kybernetika, no. 3, pp. 30–35, 1967. [17] P. Hansen, B. Jaumard, and N. Mladenovic, “Minimum sum of squares clustering in a low dimensional space,” Journal of Classification, vol. 15, no. 1, pp. 37–55, 1998. [18] S. Dasgupta, “The hardness of k-means clustering,” in Tech. Report CS2007-0890. Uni. California San Diego, 2008. [19] K. Torkkola, “Feature extraction by non parametric mutual information maximization,” J. Mach. Learn. Res., vol. 3, pp. 1415–1438, 2003. [20] L. Zelnik-manor and P. Perona, “Self-tuning spectral clustering,” in Advances in Neural Information Processing Systems 17. MIT Press, 2004, pp. 1601–1608. [21] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, vol. 42, no. 1-2, pp. 143 – 175, January-February 2001. [22] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: is a correction for chance necessary?” in Procs. ICML ’09, 2009. [23] A. Strehl and J. Ghosh, “Cluster ensembles - a knowledge reuse framework for combining multiple partitions,” Journal of Machine Learning Research, vol. 3, pp. 583–617, 2002. [24] J.-M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, “The amsterdam library of object images,” Int. J. Comput. Vision, vol. 61, no. 1, pp. 103–112, 2005. [25] N. Boujemaa, J. Fauqueur, M. Ferecatu, F. Fleuret, V. Gouet, B. LeSaux, and H. Sahbi, “Ikona: Interactive specific and generic image retrieval,” in in International workshop on Multimedia ContentBased Indexing and Retrieval (MMCBIR 2001), 2001. [26] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [27] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A new benchmark collection for text categorization research,” J. Mach. Learn. Res., vol. 5, pp. 361–397, 2004. [28] S. Gordon, H. Greenspan, and J. Goldberger, “Applying the information bottleneck principle to unsupervised clustering of discrete and continuous image representations,” in ICCV ’03: Procs. 9th IEEE Int. Conf. Computer Vision, 2003, p. 370.
VI. C ONCLUSIONS AND F UTURE W ORK This paper has introduced minCEntropy/+ , an objectivefunction-oriented approach to sequentially generate clusterings and alternative clusterings, in either a semi-supervised or unsupervised manner. Being built upon the strong foundation of information theory, minCEntropy nevertheless requires no specific assumption on the density functions, but employs an efficient procedure to estimate their information quantities. minCEntropy with the recommended parameter setting showed consistent and competitive performance throughout our experiments. As a final note, it is possible to develop a hierarchical version of minCEntropy, which will probably handle well data with non-convex, irregular-shaped clusters. Also, a directional version of minCEntropy, which employs a directional distribution, such as the von Mises Fisher, for its density estimation will probably handle better directional data, such as text and microarray data. Our future work includes investigating these variants of minCEntropy, together with a more comprehensive evaluation against the most recently proposed methods [7], [11], which are concurrent with the time of this submission. ACKNOWLEDGEMENT James Bailey and Xuan-Hong Dang are warmly acknowledged for their insightful comments on an early draft. We thank the anonymous reviewers for their constructive comments. This work was partially supported by a NICTA Research Project Award. Availability: Matlab implementation of the minCEntropy/+ algorithms will be made available via our web site. R EFERENCES [1] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with sideinformation,” in Advances in Neural Information Processing Systems 15, vol. 15, 2002, pp. 505–512. [2] I. Davidson and Z. Qi, “Finding alternative clusterings using constraints,” in Procs. ICDM ’08, 2008, pp. 773–778. [3] Z. Qi and I. Davidson, “A principled and flexible framework for finding alternative clusterings,” in Procs. KDD ’09, 2009. [4] Y. Cui, X. Z. Fern, and J. G. Dy, “Non-redundant multi-view clustering via orthogonalization,” in Procs. ICDM ’07, 2007. [5] D. Gondek and T. Hofmann, “Conditional information bottleneck clustering,” in Procs. ICDM ’03, 2003. [6] E. Bae and J. Bailey, “COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity,” in ICDM ’06, 2006, pp. 53–62. [7] X. H. Dang and J. Bailey, “A hierarchical information theoretic technique for the discovery of non linear alternative clusterings,” Procs. KDD’10, 2010. [8] P. Jain, R. Meka, and I. S. Dhillon, “Simultaneous unsupervised learning of disparate clusterings,” Stat. Anal. Data Min., vol. 1, no. 3, pp. 195–210, 2008.
530