KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Knowledge-Based Systems xxx (2012) xxx–xxx 1
Contents lists available at SciVerse ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
A modification of the k-means method for quasi-unsupervised learning
2 3
Q1
a
4 5
b
6
Department of Telematics Engineering, Technical University of Catalonia (UPC), E-08034 Barcelona, Spain Department of Computer Architecture, Technical University of Catalonia (UPC), E-08034 Barcelona, Spain
a r t i c l e
2 8 0 9 10 11 12 13 14 15 16 17 18 19
David Rebollo-Monedero a, Marc Solé b, Jordi Nin b, Jordi Forné a,⇑
i n f o
Article history: Received 28 November 2011 Received in revised form 30 July 2012 Accepted 31 July 2012 Available online xxxx Keywords: k-Means method Quasi-unsupervised learning Constrained clustering Size constraints
Q3
a b s t r a c t Since the advent of data clustering, the original formulation of the clustering problem has been enriched to incorporate a number of twists to widen its range of application. In particular, recent heuristic approaches have proposed to incorporate restrictions on the size of the clusters, while striving to minimize a measure of dissimilarity within them. Such size constraints effectively constitute a way to exploit prior knowledge, readily available in many scenarios, which can lead to an improved performance in the clustering obtained. In this paper, we build upon a modification of the celebrated k-means method resorting to a similar alternating optimization procedure, endowed with additive partition weights controlling the size of the partitions formed, adjusted by means of the Levenberg–Marquardt algorithm. We propose several further variations on this modification, in which different kinds of additional information are present. We report experimental results on various standardized datasets, demonstrating that our approaches outperform existing heuristics for size-constrained clustering. The running-time complexity of our proposal is assessed experimentally by means of a power-law regression analysis. Ó 2012 Elsevier B.V. All rights reserved.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37
1. Introduction
38
Essentially, data clustering aims to create groups of similar objects, while keeping different objects in separate groups, according to a certain quantifiable measure of similarity. Clustering is commonly used in a large variety of domains, including machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics. In machine learning, for example, one frequently deals with the important problem of classification, in which a training dataset of observations, tagged or labeled with categories of interest, is employed by an algorithm allowing a classifier to learn via inductive inference to automatically categorize new unlabeled data. Techniques that in this context employ both the data and the tags of the training set are referred to as supervised learning [5,32]. At the opposite extreme is unsupervised learning [10], covering techniques which face a complete unavailability of labels in the training data, and must therefore resort only to properties of the data itself in order to partition it, guided by a preestablished metric of dissimilarity or distortion. Unsupervised clustering may also be of great value in reducing the complexity of overwhelmingly fine-grained data, thus facilitating the training and operation of a subsequent supervised classifier.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
Q2
⇑ Corresponding author. E-mail address:
[email protected] (J. Forné).
In a middle ground lies the case when only a fraction of the data is labeled, and techniques known as semisupervised learning [6] are then most suitable, but by no means exclusive. In this realistic case, the unsupervised preprocessing just described may come in handy to take advantage of the properties of a large portion of untagged data. Ultimately, this may improve the performance of a supervised classifier, trained on the smaller portion of tagged data, preprocessed in the same manner. Such improvement may come in the form of reduced overfitting issues, as described in [2]. A fundamental question, which we shall address in this paper, is whether an unsupervised learning technique may be slightly modified, retaining the convenient low complexity stemming from the essence of its operation on the data, while incorporating only a small amount of information about the labels, in order to finally improve the suitability of the partitions obtained. In accordance with the provision of such partial label information, we shall rightfully call such modified technique, potentially quite cost-effective, quasi-unsupervised. A practically simple piece of information about the data, possibly computed from labeled data or known by any other means, is the relative frequency of each category of interest in the classification. A conceptually simple way of incorporating such information would consist in introducing, in a thus modified clustering algorithm, constraints on the sizes of the partitions, in keeping with the prevalence of the data labels. The algorithm should then meet those size constraints, while striving to minimize a given measure of dissimilarity within the clusters, measure based solely on the geometric properties of the unlabeled data.
0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2012.07.024
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1
2
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
98
Finally, the k-means method [27] has undoubtedly played a preeminent role in the field of unsupervised learning, particularly as a feature-extraction strategy prior to classification, and more specifically as an algorithm to cluster unlabeled data, according to a distortion metric, commonly mean squared Euclidean distance, between cluster points and a cluster representative value. We postulate that incorporating even small pieces of information regarding the labeling of some of the data provided may noticeably improve classification performance. This raises the question of how the k-means method may be modified not only to minimize the overall distortion due to clustering, but also to observe appropriate cluster size constraints, or similarly simple restrictions reflecting partial label information.
99
1.1. Overview of the k-means method and size-constrained clustering
100
There exists extensive literature on algorithms for unsupervised clustering [13,10,43,24,42], being the k-means method [27] one of the most popular choices. This algorithm satisfies several relevant properties and has numerous variants [18,16,4,34]; most notably, k-means iterates by alternatingly fulfilling two necessary optimality conditions in the minimization of the distortion incurred by replacing the clusters formed by a common representative data point [14,15]. This algorithm has been rediscovered several times in the statistical clustering literature [15], under various names. The term k-means was first used by MacQueen in [27] although the idea goes back to Steinhaus in [36]. The algorithm was rediscovered independently by Lloyd in 1957 as a quantization technique for pulse-code modulation, but it was not published until much later, in 1982 [26]. Accordingly, in the field of data compression, the algorithm is often named Lloyd algorithm, but also LloydMax algorithm [30], among other names. Regardless of the research area and the algorithm name, in many real-world problems, for instance document clustering [25,22] and localization of merchandise assortment [11], to name a few, practitioners have some background knowledge about the number of clusters and their approximate size. This kind of knowledge may be incorporated into the original k-means, or any of its variants, adding size constraints to the data clustering problem. Of course, size constraints are a relevant feature because algorithms able to use such information may lead to better clusters than traditional algorithms not designed to take advantage of it, and the potential applications of size-constrained clustering algorithms abound. Among many others, size-constrained clustering algorithms may find applications in k-anonymous microaggregation [38,8,9], and a variety of resource-allocation and operations research problems, such as the one mentioned [11], particularly applications of similarity-based allocation of resources or workload according to predetermined volume constraints. In the context of our application of interest, namely quasi-unsupervised learning, we focus our attention on a recent heuristic addressing the size-constrained variation of the clustering problem. Specifically, we shall compare our proposal with a heuristic method called size-constrained clustering (SCC) algorithm proposed in 2010 by Zhu et al. [44]. SCC is designed to solve the data clustering problem with size constraints, under the assumption that those sizes are available as one of the types of prior knowledge on the labels we have introduced. Although the authors report an excellent performance for SCC compared to the vanilla k-means method, their approach does not inherit the distortion-optimality properties of kmeans, in the sense that the necessary nearest-neighbor condition of [14,15] would not be satisfied if the constraints were removed. On the other hand, in our own previous work [35], a significant modification of the k-means algorithm is proposed, to address constraints on cell probabilities, while striving to inherit the optimality characteristics of this method, named, accordingly, probability-con-
86 87 88 89 90 91 92 93 94 95 96 97
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
strained Lloyd (PCL) algorithm. In the cited work, PCL is proposed strictly as a heuristic method, merely pointing out the formal similarities between the necessary optimality conditions in conventional quantization and the modified conditions, without further theoretical analysis. Although PCL outperforms the state of the art contender in k-anonymous microaggregation for statistical disclosure control [41], the algorithm is only analyzed in said field, without contemplating its potential in the area of quasi-unsupervised learning, nor adapting it to other kinds of side information on the labels. To the best of our knowledge there has been no investigation on whether the desirable properties of this size-constrained variation of the k-means algorithm, PCL, could result in more compact clusters than the ones obtained with the SCC heuristic, and if such compactness would prove beneficial from the classification point of view.
150
1.2. Contribution and organization
165
Motivated by the optimality properties of the k-means algorithm and its long history of application across numerous fields, in this paper, we put forth a heuristic method for size-constrained clustering, loosely inspired by this celebrated algorithm. Precisely, the contribution in this work is twofold. First, we propose the application of a very recent algorithm, namely the aforementioned probability-constrained Lloyd algorithm, originally formulated for k-anonymous microaggregation, to quasi-unsupervised learning. More specifically, after certain adjustments, we demonstrate that PCL is suitable to the problem of size-constrained clustering, despite being a completely different context. Effectively, PCL is a substantial modification of the k-means method, a modification with excellent performance in its original field. We experimentally analyze its performance against the SCC algorithm, mentioned above, a state-of-the-art method in the field of quasi-unsupervised learning, with various standardized datasets. Our second contribution consists in further modifying the PCL algorithm beyond its ability to perform clustering in an analogous manner to the k-means algorithm, although satisfying cluster-size constraints. The modifications in this work endow said algorithm with the possibility of initialization of the reconstruction centroids from (a subset of) labeled data, when such information is available, and also with the possibility of fixing the cluster assignment of a small subset of labeled data while clustering the rest of data, unlabeled. The detailed experimental evaluation we conduct on PCL strongly supports its consideration as a state-of-the-art candidate for quasi-unsupervised learning, not only for its intended field of application. Furthermore, the three variations proposed here help address various scenarios concerning the partial availability of information regarding the labels of the data, thus widely extending the applicability range of the celebrated k-means method. Our experimental results confirm that our proposal is capable of outperforming the state-of-the-art SCC method, yielding more compact clusters and higher-quality classification. In addition to evaluating its distortion performance, the running-time complexity of our proposal is assessed experimentally by means of a power-law regression analysis. The rest of the paper is organized as follows. Section 2 is devoted to a more formal overview of the traditional k-means method, where after a few statistical preliminaries, the two necessary conditions for distortion-optimal clustering are laid out. Inspired by those optimality conditions, Section 3 mathematically formulates the application of the aforementioned PCL heuristic to quasi-unsupervised clustering with prior knowledge related to cluster sizes, and provides a running-time complexity analysis. In Section 4, we describe our three variations of said heuristic, widening the types of prior knowledge considered. Our k-means
166
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
151 152 153 154 155 156 157 158 159 160 161 162 163 164
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1
3
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
Original Data Point Original Data Point
Cluster Index
Clustering
Cluster
Reconstructed Data Point
Reconstructed Point
Reconstruction
Fig. 1. Example of two-dimensional clustering.
214 215 216
217 218
modifications are then experimentally evaluated against the stateof-the-art heuristic SCC in Section 5. Finally, conclusions and future work are presented in Section 6. 2. Mathematical preliminaries and background on the traditional k-means method
219
2.1. Mathematical preliminaries
220
It is important to stress that, as a method of strictly unsupervised learning, absolutely no label information is available to k-means. Consequently, even if the method is to be used as a preprocessing for supervised learning of otherwise overwhelmingly fine-grained data, k-means merely exploits geometric properties of the data according to a preset measure of dissimilarity or distortion, commonly, squared Euclidean distance. Prefaced by a few statistical preliminaries, this section overviews the traditional k-means method more formally, and lays out the two necessary conditions for distortion-optimal clustering. Throughout the paper, the measurable space in which a random variable takes on values will be called an alphabet. The cardinality of a set X is denoted by jXj. We shall follow the convention of using uppercase letters for random variables, and lowercase letters for particular values they take on. Thus, the notation X = x represents the event when the random variable X takes on the value x 2 X. Recall that a probability mass function (PMF) is essentially a relativefrequency histogram representing the probability distribution of random variable over its alphabet. The expectation operator is denoted by E. Expectation can model the special case of averages over a finite set of data points {x1, . . . , xn}, simply by defining a random variable X uniformly disP tributed over this set, so that EX ¼ 1n ni¼1 xi . More generally, when X is distributed according to a PMF p(x) on a discrete alphabet X, a function f of X has expectation
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244
245 247 248 249 250
251 253
X Ef ðXÞ ¼ f ðxÞ pðxÞ: x2X
We adhere to the common convention of separating conditioning variables with a vertical bar, so that, for instance, E[f(X)jy] denotes the expectation of f(X) conditioned on the event Y = y:
E½f ðXÞjy ¼
X f ðxÞ pðxjyÞ; x2X
254
where p(xjy) denotes the conditional PMF of X given Y.
255
2.2. Background on the traditional k-means method
256
A clustering method is a function that partitions a range of values x of a random variable X, commonly continuous, approximating each
257
b The resulting cluster by a value ^ x of a discrete random variable X. clustering map ^ xðxÞ may be broken down into two steps. First, an assignment of data points X to a cluster index Q, usually natural numbers, by means of a clustering function q(x), and secondly, a reconb that struction function ^ xðqÞ mapping the index Q into a value X approximates the original data, so that ^ xðxÞ ¼ ^ xðqðxÞÞ. Both the composition ^ xðxÞ and the first step q(x) are often interchangeably interpreted, and referred to, as clustering. This is represented in Fig. 1, along with an example where the random variable X takes on values in R2 . Depending on the field of application, the terms quantizer or (micro) aggregation are used in lieu of clustering, and cell in lieu of cluster, even though they have essentially equivalent meanings. Often, a distinction is made between clustering and microaggregation, to emphasize the large and small size of the resulting cells, respectively. Cluster indices Q = q(X) take on values in a finite alphabet Q ¼ f1; . . . ; jQjg. The size jQj of this alphabet, simply put, the number of clusters, may be given as an application requirement. Clearly, clustering comes at the price of introducing a certain amount of distortion between the original data X and its reconb ¼^ structed version X xðQ Þ. In mathematical terms, we define a nonnegative function dðx; ^ xÞ called distortion measure, and consider the b Þ. A common measure of distortion expected distortion D ¼ EdðX; X b k2 , popular is the mean squared error (MSE), that is, D ¼ EkX X due to its mathematical tractability. Optimal clustering is that of minimum distortion for a given number of possible indices. It is well known [14,15] that optimal clustering must satisfy the following conditions. First, the nearest-neighbor condition, according to which, once a reconstruction function ^ xðqÞ is chosen, the optimal clustering q⁄(x) is given by
259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
288
q ðxÞ ¼ arg mindðx; ^xðqÞÞ;
ð1Þ
q¼1;...;jQj
that is, each value x of the data is assigned to the index corresponding to the nearest reconstruction value. Secondly, the centroid condition, which, in the important case when MSE is used as a distortion measure, for a chosen clustering q(x), states that the optimal reconstruction function ^x ðqÞ is given by
^x ðqÞ ¼ E½Xjq
ð2Þ
that is, each reconstruction value is the centroid of a cluster. Each necessary condition may be interpreted as the solution to a Bayesian decision problem. We would like to stress that these are necessary but not sufficient conditions for joint optimality. Still, these optimality conditions are exploited in the k-means method [27,36], also widely known as Lloyd-Max algorithm [26,30] in the data compression field. The method in question consists in the iterative, alternating optimization of q(x) given ^ xðqÞ and viceversa, according to (1) and (2). A popular method to initialize the algorithm consists
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
258
290 291 292 293 294 295
296 298 299 300 301 302 303 304 305 306 307
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1 308 309 310 311 312 313 314 315 316 317 318 319 320 321
322 323
4
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
in choosing random reconstruction points first. A common stopping criterion compares the relative distortion reduction in each iteration to a given small threshold. In the k-means method, either the clustering or the reconstruction is improved at a time, leading to a successive improvement on the distortion; strictly speaking, the sequence of distortions is nonincreasing. Even though a nonnegative, nonincreasing sequence has a limit, rigorously, the convergence of the sequence of distortions does not guarantee that the sequence of clusterings obtained by the algorithm will tend to a stable configuration, less so to a jointly optimal one. In theory, the algorithm can merely provide a sophisticated upper bound on the optimal distortion. In practice, however, the k-means algorithm often exhibits excellent performance [15, Sections II.E and III]. 3. Theoretical formulation of the k-means method with size constraints, and complexity analysis
333
In this section we first introduce the formulation of the size-constrained clustering problem. A slightly more general formulation is developed, which in fact contemplates probability-constrained clustering. Secondly, we postulate two update steps inspired by the optimality conditions of the k-means method, but taking into account those constraints. Thirdly, we proceed to propose two variations of our own algorithm, for large and small cluster sizes, respectively. We end this section with a detailed running-time complexity analysis of the PCL algorithm, the basic building block of our new proposals.
334
3.1. Formulation of and insight into the problem
335
We consider the design of minimum-distortion clusters satisfying size, or more generally, cluster probability constraints, with the same block structure of traditional clustering, depicted in Fig. 1. For finite sets of discrete data points, one may simply view cluster probability constraints equivalently as size constraints, normalized by the total number of original data points available. Our general formulation in terms of probabilities permits including the case of continuous distributions of the data. Original data values, that is, the data points to be clustered, are modeled by a random variable X in an arbitrary alphabet X, possibly discrete or continuous, for example a set of points in a multidimensional Euclidean space, or a set of words in an ontology. The clustering q(x) assigns X to an index Q in a finite alphabet Q ¼ f1; . . . ; jQjg of a predetermined size. The reconstruction funcb which may be regarded tion ^ xðqÞ maps Q into the reconstruction X, as an approximation to the original data, defined in an arbitrary b commonly but not necessarily equal to the original alphabet X, data alphabet X. Just as in traditional clustering, for any nonnegative (measurable) function dðx; ^ xÞ, called distortion measure, define the associb that is, a measure of the ated expected distortion D ¼ EdðX; XÞ, discrepancy between the original data values and their reconstructed values, which reflects the loss in data accuracy. Recall also from the background section that an important example of distortion measure is dðx; ^ xÞ ¼ kx ^ xk2 , for which D becomes the MSE. Alternatively, dðx; ^ xÞ may represent a semantic distance in an ontob model a conceptual generalization of X, logical hierarchy [7] and X a random variable taking on words as values. pQ(q) denotes the PMF corresponding to the cell probabilities. The cluster size requirement in the quantization problem is established by means of the cell probability constraints pQ(q) = p0(q), for any predetermined PMF p0(q). As we have introduced before, the motivating application of this work is the problem of size-constrained clustering of a set of sam-
324 325 326 327 328 329 330 331 332
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
ples. In this important albeit special case, X would be given by the empirical distribution of the data values in the original set, that is, pX(x) would be the PMF corresponding to the relative frequency of occurrences of the different tuples of data values. Let n be the number of records in the dataset. An m-size constraint could be translated into probability constraints by setting jQj ¼ bn=mc and p0 ðqÞ ¼ 1=jQj, which ensures that n p0(q) P m. We would like to remark that even if we are only interested in the clustering portion q(x) of the quantization process, the reconstruction values may still be only used, although only as a means to compute a measure of similarity, and not as a replacement for the data clustered. Finally, given a distortion measure dðx; ^ xÞ and probability constraints expressed by means of p0(q) (along with the number of quantization cells jQj), we wish to design an optimal clustering q⁄(x) and an optimal reconstruction function ^ x ðqÞ, in the sense that they jointly minimize the distortion D while satisfying the probability constraints.
369
3.2. A heuristic modification of the traditional optimality conditions
387
Next, we propose heuristic optimization steps for probabilityconstrained clustering, analogous to the nearest-neighbor and centroid conditions found in k-means [14,15], reviewed in Section 2.2. We then modify the conventional k-means algorithm by applying its underlying alternating optimization principle to these steps. Finding the optimal reconstruction function ^ x ðqÞ for a given clustering q(x) is a problem identical to that in conventional kmeans:
388
^
x ðqÞ ¼ arg minE½dðX; ^xÞjq: ^x2 b X
q ðxÞ ¼ arg min dðx; ^xðqÞÞ þ cðqÞ:
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386
389 390 391 392 393 394 395 396
397 399
ð4Þ
q¼1;...;jQj
This is a heuristic step inspired by the nearest-neighbor condition of the conventional k-means method (1). According to this formula, increasing the cost of a cluster, leaving the cost of the others and all centroids unchanged, will reduce the number of points assigned to it. Conversely, decreasing the cost will push cluster boundaries outwards and thus increase its size. Coefficients may be construed, for the Euclidian case, as an additional dimension in which centroids can distance themselves from the points to be clustered. In technical words, the costs c(q) adding to a quadratic distance dðx; ^ xðqÞÞ in (4) may be all considered nonnegative without loss of generality, since adding a common base value to all of them would determine the same exact partition. But then,
pffiffiffiffiffiffiffiffiffi 2 dðx; ^xðqÞÞ þ cðqÞ ¼ kx ^xðqÞk2 þ cðqÞ ¼ ðx; 0Þ ^xðqÞ; cðqÞ : This geometric interpretation in illustrated in Fig. 2. The step just proposed naturally leads to the question of how to find a cost function c(q) such that the probability constraints
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
371
ð3Þ
In the special case when MSE is used as distortion measure, this is the centroid step (2). On the other hand, we may not apply the nearest-neighbor condition in conventional clustering directly, if we wish to guarantee the probability constraints pQ(q) = p0(q). We introduce a cell cost function c : Q ! R, a real-valued function of the cluster indices, which assigns an additive cost c(q) to a cell indexed by q. The intuitive purpose of this function is to shift the cluster boundaries appropriately to satisfy the probability constraints. Specifically, given a reconstruction function ^ xðqÞ and a cost function c(q), we propose the following cost-sensitive nearest-neighbor step:
370
400 401 402 403 404 405 406 407 408 409 410
411 413 414 415 416 417 418 419 420 421 422 423 424 425 426
427 429 430 431 432
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
5
Fig. 2. Geometrical interpretation of the c(q) coefficients. (left) Initial assignment of points {x1, . . . , x6} to two clusters q1 and q2. Leftmost cluster has four points assigned, rightmost only two. Initially coefficients c(q1) = c(q2) = 0, thus reconstruction points are on the same plane as data points. (right) The clusters are balanced by increasing the coefficient of the cluster with excess of points. The coefficient can be viewed as the coordinate of the reconstruction points in an additional dimension.
454
pQ(q) = p0(q) are satisfied, given a reconstruction function ^ xðqÞ. Clearly, for given, fixed reconstructions, the cluster probabilities pQ(q) are entirely determined by the costs c(q), through the modified nearest-neighbor condition (4). Considering each of the costs cð1Þ; . . . ; cðjQjÞ an unknown and each of the constraints pQ ð1Þ ¼ p0 ð1Þ; . . . ; pQ ðjQjÞ ¼ p0 ðjQjÞ an equation, we are left with a system of jQj nonlinear equations and jQj unknowns. For smooth probability distributions of the data and large cluster size constraints, we propose the following derivative-based, numerical method, which proved to be very successful in all of our experiments, including those reported in Section 5. Later on, we shall describe a mechanism to extend this derivative-based method to the case of small-size constraints and arbitrary distributions of the data, not necessarily smooth. Specifically, we propose an application of the Levenberg–Marquardt algorithm [28], a powerful, sophisticated algorithm to solve systems of nonlinear equations numerically. A finite-difference estimation of the Jacobian is carried out, by slightly increasing each of the coordinates of c(q) at a time. To do so more efficiently, we exploit the fact that only the coordinates of pQ(q) corresponding to neighboring cells may be changed, and compute the negative-semidefinite approximation in Frobenius norm to the Jacobian.
455
3.3. Running-time complexity analysis
456
We now proceed to investigate the running-time complexity of PCL, the algorithm proposed and adapted in this work for the purpose of quasi-unsupervised learning. Throughout the discussion that follows, n will denote the number of records of a dataset consisting in points in the d-dimensional Euclidean space Rd , on which we wish to find k cells or clusters with a minimum size constraint of s points.1
433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453
457 458 459 460 461 462
463 464 465 466 467 468 469 470 471 472
3.3.1. Theoretical considerations Because PCL is a fairly sophisticated extension of the k-means method, intuition suggests that any theoretical analysis of its running time will be at the very least as intricate as that of k-means, already fairly involved. For a fixed number of clusters k and dimension d, the general problem of finding a partition minimizing the mean squared distance can be solved exactly in time O(nkd+1 logn) in the number of points n [20,21], by bounding the number of iterations by the number O(nkd) of distinct Voronoi partitions on n points. 1
The use of the symbol k here should not be confused with its common use in statistical disclosure control, one of the fields of application of our algorithm, where in the notion of k-anonymity, k represents the minimum cell size constraint, s in our notation, rather than the number of cells.
In practice, the problem is solved approximately by the k-means or Lloyd algorithm itself, usually very fast on a wide variety on datasets. The difficulty in investigating the complexity of k-means from a theoretical perspective is blatantly obvious from the fact that published studies are hardly conducive to any practical application, as extremely loose upper bounds are reported for very specific synthetic datasets. In an attempt to remedy this discrepancy, ‘‘smoothed’’ running-time studies have been conducted more recently [1,3], but still yield fairly loose bounds, valid only for specific data. Precisely, [1] shows that for an arbitrary set of n points in [0, 1]d, if each point is independently perturbed by a zero-mean normal distribution with variance r2, the expected ‘‘smoothed’’ running time of k-means is 3
8
4
474 475 476 477 478 479 480 481 482 483 484 485 486
487
Oðn3 4k 4d log nr6 Þ;
489
polynomial in both the number of records n and the number of cells k. An attempt to provide a tighter bound in [3] resorts to n points in an integer lattice {1, . . . , r}d, obtaining O(dn4r2). Due to the fact that running times in practice are not adequately approximated by the theoretical bounds, and that some hypotheses on dataset properties are not easily verifiable, some complexity studies incorporate extensive experimentation. For example, the complexity of an efficient implementation of k-means by means of kd-trees [40] is investigated both theoretically and experimentally in [23]. The time complexity of the Levenberg–Marquardt algorithm, also part of our proposal, and the more general family of variations of the Gauss–Newton’s method, have also been the object of extensive study [31,39,33]. In any case, neither the theoretical results on the running-time complexity of the k-means method nor those on the Levenberg–Marquardt algorithm are directly applicable to bound the overall complexity of PCL, because of its nontrivial interplay. Similarly, the state-of-the-art SCC algorithm, against which we shall compare PCL, requires an initial clustering step that in [44] was performed with the k-means algorithm. Next, an integer linear programming problem with nk variables is solved. This type of problems are NP-complete, and the fastest solving algorithms currently known have a worst-case exponential running time of Oð2nk Þ. Just as we argued for PCL, these theoretical bounds are hardly representative of their performance in practice. Hence, additional, experimental investigation is highly desirable.
490
3.3.2. Experimental setup The above theoretical considerations leads us to analyze the running-time complexity of PCL and SCC from an experimental perspective. The analysis in question is carried out for a dataset
517
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
473
491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516
518 519 520
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1
6
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
Fig. 3. Running time t of PCL versus estimated running time ^t ¼ s nm sr . The proportionality coefficient s and the exponents m and r are adjusted via multiple linear regression on the logarithms of the recorded values.
521 522 523 524 525 526 527 528 529 530 531 532
533 535
synthetically generated from n independent, identically distributed drawings of d-dimensional vectors with independent, zero-mean, unit-variance, Gaussian entries. The PCL algorithm described in this work and the SCC algorithm of [44] are applied to this dataset, for n ranging from 30,000 to 100,000 in steps of 5000, and d = 2, 3, 5. A minimum cell size constraint s is imposed, for s ranging from 1000 to 5000 in steps of 1000, resulting in a number of cells k = bn/sc. The following robust stopping condition is observed: PCL is automatically interrupted whenever the MSE distortion decrement is lesser than 104 in five consecutive iterations. We consider a power-law model for the overall running time t as a function of n and s. Precisely, we consider the estimator m r
^t ¼ s n s
ð5Þ
538
of t from n and s, where the proportionality coefficient s and the exponents m and r are adjusted via multiple linear regression on the logarithms of the recorded values given by
539 541
log t ’ log s þ m log n þ r log s:
536 537
544
Recall that the best affine MSE estimation of a target variable Y from an observation X, the usual definition of the correlation coefficient q for simple linear regression satisfies the property
545 547
MSE=r2Y ¼ 1 q2 ;
542 543
548 549 550 551 552 553
554 555 556 557 558 559 560 561 562 563 564
where r2Y denotes the variance of Y. In the multiple regression case, we redefine the absolute value of q from said property, and it is thus a measure of the ratio between estimation error and the variance of the target variable. Simply put, values of jqj close to 1 will indicate that the model ^t defined in (5) is an adequately approximate representation of the running time t. 3.3.3. Experimental results Our proof-of-concept implementation of PCL was written in Matlab R2010b, with an integrated MEX C routine to speed up nearest-neighbor search, and run on a modern computer equipped with an Intel Xeon CPU @ 2.67 GHz and Windows 7 64-bit. Running times and the results of our regression analysis are reported for the specific case of d = 5 in Fig. 3, in logarithmic axes. The resulting values of jqj,m and r for the other dimensions analyzed, d = 2 and 3, are very similar. In light of our model (5), the high absolute value of the correlation coefficient jqj ’ 0.9, and the values m ’ 2.14 and r ’ 1.11 of the exponents, suggest that the
Fig. 4. Running time t of SCC versus estimated running time ^t ¼ snv sr . The proportionality coefficient s and the exponents v and r are adjusted via multiple linear regression on the logarithms of the recorded values.
running time is roughly of the order of n2/s, and if k is also approximately n/s, then t is roughly proportional to nk. Interestingly, n2/s ’ nk is precisely the computational complexity of the modified nearest-neighbor routine implemented, which simply checks the nearest centroid for each point, and it is invoked a large number of times in both inner iterations executing the Levenberg–Marquardt algorithm, and outer iterations reshaping the cells produced by PCL. Future research may of course take into consideration the possibility of using efficient methods for nearestneighbor search, such as kd-trees [40], to speed up our algorithm. We would like to conclude this section on the running-time complexity analysis of PCL with a quick comparison against its contender, SCC, applying the same type of experimental regression analysis described. Our implementation of the SCC algorithm was written in C and, since the algorithm is largely independent of the dimensionality of the data (practically all the running time is spent solving the ILP problem which is unaffected by dimensionality), for the regression analysis we have used d = 2. These experimental results for SCC, plotted in Fig. 4, suggest that the running time of SCC is of the order of approximately n4.62 = s1.97 ’ n2.65k1.97, with accuracy supported by a very high absolute correlation coefficient jqj ’ 0: 991. This means that the running time of SCC scales with roughly the square of that of PCL, the latter estimated to be of the order of n2/s ’ nk. In other words, the scalability of PCL amply surpasses that of SCC for the experiments conducted, a result that is supported by the fact that we had to severely reduce the dataset for SCC to obtain reasonable running times.
565
4. Constrained k-means for quasi-unsupervised learning
592
In this section we introduce three different modifications of the traditional k-means algorithm for quasi-unsupervised learning. Firstly, as in [44], we assume that clusters have a specified size. After that, we relax this restriction and we assume that we only have an approximate idea about the clusters sizes but we know an archetype, a possible centroid, of the clusters. Finally, we operate under the assumption that a small portion of the data points available are labeled, information which is exploited in the classification process of the entire dataset.
593
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591
594 595 596 597 598 599 600 601
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
602
4.1. Exact cluster sizes
603
Ideally, we wish to find a clustering and a reconstruction function that jointly minimize the distortion considering the clusters size constrain defined by the user as a parameter. The conventional k-means method, reviewed in Section 2.2, is essentially an alternating optimization algorithm that iterates between the nearestneighbor (1) and the centroid optimality (2) conditions. These are necessary but not sufficient conditions, thus the algorithm can only hope to approximate a jointly optimal pair q⁄(x), ^ x ðqÞ, as it really only guarantees that the sequence of distortions is nonincreasing. We also mentioned that experimentally, however, the k-means algorithm very often shows excellent performance [15, Sections II.E and III]. Bear in mind that our modification of the nearest-neighbor condition (4) for the probability-constrained problem is a heuristic proposal, in the sense that this work does not prove it to be a necessary condition. We still use the same alternating optimization principle, albeit with a more sophisticated nearest-neighbor condition (4), and define the following modified k-means method for probability-constrained quantization:
604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633
1. Choose an initial reconstruction function ^ xðqÞ and initial cost function c(q). 2. Update c(q) to satisfy the probability constraints pQ(q) = p0(q), given the current ^ xðqÞ. To this end, use the method described at the end of Section 3.2, setting the initial cost function as the cost function at the beginning of this step. 3. Find the next clustering q(x) corresponding to the current ^ xðqÞ and the current c(q), according to Eq. (4). 4. Find the optimal ^ xðqÞ corresponding to the current q(x), according to Eq. (3). 5. Go back to Step 2, until an appropriate convergence condition is satisfied or a maximum number of iterations exceeded.
634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663
The initial reconstruction values may simply be chosen as jQj random samples distributed according to the probability distribution pX(x) of X. A simple yet effective cost function initialization is c(q) = 0, because it prevents empty clusters. Note that the numerical computation of c(q) in Step 2 should benefit from better and better initializations as the reconstruction values become stable. That stability may be improved during the initial steps of the algorithm described by replacing previous reconstruction values by convex combinations weighting both the updated reconstruction values and the previous ones, instead of directly replacing the previous value with the updated one, thus slowing the update process and improving the validity of previous costs as initialization to the Levenberg–Marquardt method. If the probability of a cell happens to vanish at any point of the algorithm, this can be tackled by choosing a new, random reconstruction value, with cost equal to the minimum of the rest of costs, or by splitting the largest cell into two, with similar centroids and costs, in order to keep the same number of clusters. The stopping convergence condition might for instance consider slowdowns in the sequence of distortions obtained. For example, the algorithm can be stopped the first time a minimum relative distortion decrement is not attained, or, more simply, when a maximum number of iterations is exceeded. 4.2. Approximate cluster sizes and centroids In scenarios where it is not possible to obtain the exact number of objects in each one of the classes, but still it is possible to have an estimation of these values, we might be able to define a ‘‘representative’’ member for each class as in [29], that we call archetype, or at least provide an approximation of this representative member. When both pieces of information are available, they can be
7
incorporated into the algorithm defined in the previous section as follows:
664
1. Given a representative for each cluster q, assign it as the reconstruction point ^ xðqÞ. Set the initial cost function c(q) (typically c(q) = 0). 2. Update c(q) to satisfy the probability constraints pQ(q) = p0(q), given the current ^ xðqÞ. 3. Find the next clustering q(x) corresponding to the current ^ xðqÞ and the current c(q), according to Eq. (4).
666 667 668 669 670 671 672 673
The idea is to use the representative point (archetype) of each cluster as its reconstruction point and then perform a single outer iteration of the constrained k-means algorithm. Although the archetypes could have been simply used as initial reconstruction points without further modification of the algorithm, if several iterations are allowed the alternating optimization procedure typically produces a drift on the reconstruction points that can yield a final reconstruction point quite different than the archetype. Since we assume the archetypes summarize the properties of the class they represent, that is, they are not very far away from the centroid of the class, it is a better approach to freeze the reconstruction points. To simulate the fact that we have an approximate knowledge of the cluster sizes and the archetypes, in our experiments (Section 5), we have used a 10% of the records randomly selected as a sample to estimate the sizes of the cluster as well as the archetypes. Then, the approximate cluster sizes and archetypes are simply passed to the algorithm, and thus the labels of the records in the sample are never directly used by the algorithm.
674
4.3. Some training labeled records
693
In datasets in which classes are quite mixed, that is, the convex hulls of the points of each class have intersection zones containing a significant amount of records, previous strategies might yield quite rough classifications. In these contexts, some additional information can increase the performance of the classifier. In particular, as in [17], we assume in this case that we have a small sample of records that have been classified. For instance this can be the case where there are too many records to be analyzed by humans, but a small fraction can be classified by hand. In contrast with the strategy presented in Section 4.2, in which the cluster sizes and archetypes were approximated using a 10% sample, but the algorithm was not aware of the records in the sample, this time we assume that we have access to these records, so that the algorithm is aware of the class of each one of these sample points. Instead of performing a classification looking for jQj different classes, we will perform a classification with a number of classes equal to the size of the sampling set (in our experiments this was a 10%). For instance in a dataset containing 150 records belonging to three different classes, the sampling set will contain 15 records. Instead of running the algorithm using jQj ¼ 3, we assume each one of the sample records is used as the archetype of a distinct class, thus jQj ¼ 15. The probabilities of each one of these classes are all the same, so that for all q and q0 , p0(q) = p0(q0 ), and we use the sample point associated with each one of these classes as initial reconstruction point of the class. In our example each one of the 15 clusters would 10 contain 150 ¼ 10 records, thus p0 ðqÞ ¼ 150 . 15
694
Once this initial set up has been performed, the c(q) coefficients are computed so that the probability constraints pQ(q) = p0(q) are satisfied and then, all the records that have been classified in the same cluster are assigned to the same class of the sample that is associated with the cluster. For instance, in our running example
721
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
665
675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692
695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
722 723 724 725
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1
8
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
P n P n HðVÞ ¼ kj¼1 n j log nj , with nj ¼ ci¼1 nij . Since algorithms used have a predefined common number of clusters, in this scenario, c = k.
Table 1 Datasets from UCI [12] used in the experiments.
726 727 728 729
Dataset
Objects
Classes
Attributes
Cluster sizes
Iris Wine Balance scale Ionosphere WDBC Crx Pima Bands
150 178 625 351 569 666 768 366
3 3 3 2 2 2 2 2
4 13 4 33 30 6 9 19
(50, 50, 50) (71, 59, 48) (288, 288, 49) (226, 125) (357, 212) (367, 299) (500, 268) (230, 135)
after the computation of the c(q) coefficients each one of the cluster would effectively contain ten records, that will be all assigned to the same original class as the archetype (sample point) of the cluster.
730
5. Experimental evaluation
731
742
In this section, we describe the experiments that we have carried out to test the performance of the proposed k-means modifications. The main conclusion we may draw from these experiments is that our modifications of the k-means algorithm are perfectly suitable candidates in situations where some prior knowledge is available as in the quasi-unsupervised learning scenario. First, we describe the datasets that we have considered in our experiments, and secondly, we proceed to explain the performance measures used. Next, we show the results obtained using our proposal, as well as using the state-of-the-art SCC algorithm proposed in [44] and briefly reviewed in Section 1.1.
743
5.1. Datasets and description of performance measures
744
We have conducted a large number of experiments to validate our approach on eight UCI datasets from [12], whose main characteristics are summarized in Table 1. In particular, we provide the number of objects (records or datapoints), classes (categories) and attributes (dimensions) of each dataset, as well as the sizes of the classes. These datasets are exactly those used in [44]. To compare the quality of the clusterings obtained by the all the algorithms, we have used the following metrics of clustering performance for classification:
732 733 734 735 736 737 738 739 740 741
745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763
764 766 767 768 769 770 771 772 773
Adjusted rand index (ARI) [19], computed as the number of pairs of objects such that: (i) both are assigned to the same cluster and have the same label, or (ii) they are assigned to different clusters and have different labels, divided by the total number of pairs. This metric always gives a value between 0 and 1, and higher values indicate a higher agreement between the clustering and the class labels. Normalized mutual information (NMI) [37] quantifies the amount of information shared by two random variables that represent the cluster assignments (variable U) and the class labels (variable V). It is computed as
776 777 778
5.2. Experimental results
781
The three variations of the size-constrained k-means (CKM) method developed and experimentally analyzed are listed in Table 2. Table 3 reports the results obtained in the experiments. The ARI and NMI values for the CKM variants are the average of three executions. Among these methods, CKMb and CKMt showed almost no variability among the executions despite the randomness of some of the initial choices (initial reconstruction points for CKMb and the 10% sample used in CKMt), while CKMl showed a significant degree of variability. From these results, we may draw the following conclusions:
782
In general, the performance of the CKM algorithms increases when the prior available knowledge also increases. This effect is common in all the considered datasets, except in the Wine dataset, for both measures. This fact demonstrates that CKM algorithms are able to profit from the available prior knowledge. This is convincing evidence that our modifications of the kmeans algorithm are useful candidates in several quasi-unsupervised learning scenarios. At least one CKM configuration outperforms SCC in all the datasets. This is possible thanks to the large flexibility in the available knowledge CKM can manage. As the considered datasets have very different characteristics, the different prior knowledge allows CKM to reach higher performance than SCC.
793
Table 2 Algorithmic variations of the k-means method implemented and experimentally analyzed. Algorithm
Description
Sec.
CKMb CKMt
Random centroid initialization and moving centroids Approximate cluster sizes and centroids from a 10% sample Preassigned clustering of 10% of labeled records
4.1 4.2
CKMl
4.3
Table 3 Comparison of the clustering quality between the heuristic size-constrained clustering (SCC) of [44] and our size-constrained k-means algorithm, in the three variations described in Table 2. Dataset
Alg
ARI
NMI
Dataset
Alg
ARI
NMI
Iris
SCC CKMb CKMt CKMl
0.6416 0.7859 0.7860 0.5939
0.6750 0.7773 0.7782 0.6419
WDBC
SCC CKMb CKMt CKMl
0.1172 0.2156 0.5312 0.7129
0.1678 0.3134 0.5103 0.5917
Wine
SCC CKMb CKMt CKMl
0.3863 0.7693 0.6756 0.5648
0.4383 0.7818 0.6655 0.5965
Crx
SCC CKMb CKMt CKMl
0.1989 0.1441 0.2127 0.1629
0.1461 0.1043 0.1608 0.1172
Balance scale
SCC CKMb CKMt CKMl
0.1389 0.0762 0.3846 0.3882
0.0931 0.0715 0.3620 0.2803
Pima
SCC CKMb CKMt CKMl
0.0365 0.1930 0.1704 0.1762
0.0138 0.1179 0.1421 0.1085
Ionosphere
SCC CKMb CKMt CKMl
0.4056 0.0216 0.2047 0.4729
0.2926 0.0616 0.1669 0.3538
Bands
SCC CKMb CKMt CKMl
0.0528 0.0083 0.0236 0.0827
0.0268 0.0027 0.0207 0.0422
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
775
Note that such measures are independent of the prior knowledge the algorithms have. The values for SCC were directly extracted, as reported, from the paper it was proposed in [44].
IðU; VÞ NMI ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi HðUÞHðVÞ where I(U, V) is the mutual information between the cluster Pc Pk nij nij n assignment and class labels, defined as i¼1 j¼1 n log ni nj , where nij stands for the number of objects assigned to cluster i with label j, n for the total amount of objects, c for the number of clusters and k for the number of classes (labels). H(U) and H(V) are the corresponding entropies, defined as P P HðUÞ ¼ ci¼1 nn i log nni , where ni ¼ kj¼1 nij , and
774
779 780
783 784 785 786 787 788 789 790 791 792
794 795 796 797 798 799 800 801 802 803 804 805
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
CKMb obtains the best ARI and NMI results in two of the eight considered datasets, largely outperforming SCC in the Iris Wine, WDBC and Pima datasets. Specifically, in the Wine, Pima and WDBC dataset, CKMb vastly outperforms (100% of improvement) SCC using exactly the same kind of prior knowledge. In the balance scale dataset, SCC obtains slightly better performance than CKMb, but CKMt and CKMl reach higher ARI and NMI values than SCC. The Ionosphere dataset has some peculiarities, and the results obtained with this dataset have to be studied in a deeper manner. The clusters of this dataset are completely mixed. For this reason the results obtained by CKMb, which creates geometrically convex cells, are worse than the ones obtained by SCC. However, as in the Ionosphere dataset, very similar, close records share the same label, and CKMl manages to outperform SCC with an improvement around 15%. Note that CKMl considers a small subset of record labels as its prior knowledge, then it can take advantage from this fact. Indeed, this issue is well-known in the classification community, and this is the reason why Ionosphere dataset has been largely used to simulate classification scenarios. SCC and the CKM variants yield poor results with the Balance scale dataset because the records of this dataset are equidistant. This makes the task of building compact clusters more challenging and difficult. Therefore, the results of all the methods are not as good as when we consider other datasets such as Iris.
831 832
6. Conclusion and future work
833
In this paper we have described the traditional k-means clustering algorithm from the point of view of distortion-optimized quantization, stressing the importance of the necessary optimality conditions it is built upon. Then, we have formulated the size-constrained clustering problem and we have shown how to accommodate cluster size constrains into the classical k-means algorithm. Specifically, the introduction of additive weights in the nearest-neighbor condition enables us to control cell sizes, while faithful to the optimality properties of the original algorithm. In order to adjust these additive weights, we have resorted to the Levenberg–Marquardt algorithm, a sophisticated variation of the Gauss–Newton method for numerical optimization. Next, we have argued that incorporating size constrains when clusters are built can be considered as a form of prior knowledge. This reasoning allows us to consider k-means as a clustering method for quasi-unsupervised scenarios. Additionally, we have shown that other types of prior knowledge can be integrated into kmeans, specifically, we have considered the scenario where some clusters archetypes, or frozen cluster centroids, are known and the scenario where some labeled records, the training dataset, are available. In addition to evaluating its distortion performance, the running-time complexity of our proposal is assessed experimentally by means of a power-law regression analysis, suggesting that the overall running time of our clustering algorithm is of the order of nk, with n the number of points in the dataset, and k the number of cells, for several, fixed dimensions. Finally, in the experimental part of this work we have tested our proposal employing well-known and widely accepted real datasets extracted from the UCI repository, and we have compared our results against a state-of-the-art, size-constrained clustering method, SCC, showing that our substantial k-means modifications are capable of attaining lower distortion, thereby outperforming it, for various standardized datasets. As future work, we intend to investigate an extension that exploits the fact that inequality constraints offer more flexibility,
834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869
9
which may lead to a distortion improvement in applications where strict equality is not required.
870
Acknowledgments
872
This work was partly supported by the Spanish Government through projects Consolider Ingenio 2010 CSD2007-00004 ‘‘ARES’’, TEC2010-20572-C02-02 ‘‘Consequence’’ and by the Government of Catalonia under Grant 2009 SGR 1362. D. Rebollo-Monedero is the recipient of a Juan de la Cierva postdoctoral fellowship, JCI-200905259, from the Spanish Ministry of Science and Innovation.
873
References
879
[1] D. Arthur, B. Manthey, H. Roeglin, k-Means has polynomial smoothed complexity, in: Proc. IEEE Annual Symp. Found. Comput. Sci. (FOCS), Atlanta, GA, October 2009, pp. 1157–1160. [2] L. Bai, J. Liang, C. Dang, An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data, Knowl.-Based Syst. 24 (6) (2011) 785–795. [3] A. Bhowmick, A theoretical analysis of Lloyd’s algorithm for k-means clustering, 2009. [4] M. Bilenko, S. Basu, R.J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, in: Proc. Int. Conf. Mach. Learn. (ICML), Banff, Alberta, Canada, July 2004, pp. 81–88. [5] C. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006. [6] O.B. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, MA, 2006. [7] V. Cross, Fuzzy semantic distance measures between ontological concepts, in: Proc. N. Amer. Fuzzy Inform. Process. Soc. (NAFIPS), 2004, pp. 236–240. [8] J. Domingo-Ferrer, J.M. Mateo-Sanz, Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. Knowl. Data Eng. 14 (1) (2002) 189–201. [9] J. Domingo-Ferrer, V. Torra, Ordinal, continuous and heterogenerous kanonymity through microaggregation, Data Min. Knowl. Disc. 11 (2) (2005) 195–212. [10] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiley, New York, 2001. [11] M. Fisher, K. Rajaram, Accurate retail testing of fashion merchandise: methodology and application, J. Market. Sci. 19 (3) (2000). [12] A. Frank, A. Asuncion, UCI machine learning repository, Univ. California, Irvine, Sch. Inform., Comput. Sci., 2010.
. [13] H. Frigui, R. Krishnapuram, A robust competitive clustering algorithm with applications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 21 (5) (1999) 450–465. [14] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, MA, 1992. [15] R.M. Gray, D.L. Neuhoff, Quantization, IEEE Trans. Inform. Theory 44 (1998) 2325–2383. [16] S.K. Gupta, K.S. Rao, V. Bhatnagar, k-Means clustering algorithm for categorical attributes, in: Proc. Data Warehous. Knowl. Disc. (DaWaK), Lecture Notes Comput. Sci. (LNCS), vol. 1676, Springer-Verlag, Florence, Italy, 1999, pp. 203– 208. [17] T. Huang, Y. Yu, G. Guo, K. Li, A classification algorithm based on local cluster centers with a few labeled training examples, Knowl.-Based Syst. 23 (6) (2010) 563–571. [18] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Disc. 2 (3) (1998) 283–304. [19] L. Hubert, P. Arabie, Comparing partitions, J. Classif. 2 (1) (1985) 193–218. [20] M. Inaba, N. Katoh, H. Imai, Applications of weighted voronoi diagrams and randomization to variance-based k-clustering, in: Proc. ACM Symp. Comput. Geom., 1994, pp. 332–339. [21] M. Inaba, N. Katoh, H. Imai, Variance-based k-clustering algorithms by Voronoi diagrams and randomization, IEICE Trans. Inform. Syst. E83-D (6) (2000) 1199–1206. [22] F. Jacquenet, C. Largeron, Discovering unexpected documents in corpora, Knowl.-Based Syst. 22 (6) (2009) 421–429. [23] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, An efficient k-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002) 881–892. [24] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 2005. [25] M. Li, L. Zhang, Multinomial mixture model with feature selection for text clustering, Knowl.-Based Syst. 21 (7) (2008) 704–708. [26] S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inform. Theory IT-28 (1982) 129–137. [27] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proc. Berkeley Symp. Math. Stat., Prob. I (Stat.), Berkeley, CA, 1965–1966 (Symp.), 1967 (Proc.), 1967, pp. 281–297. [28] D. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM J. Appl. Math. (SIAP) 11 (1963) 431–441.
880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
871
874 875 876 877 878
KNOSYS 2363
No. of Pages 10, Model 5G
12 October 2012 Q1 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967
Q4
10
D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx
[29] S. Martínez, A. Valls, D. Sánchez, Semantically-grounded construction of centroids for datasets with textual attributes, Knowledge-Based Systems, in press. [30] J. Max, Quantizing for minimum distortion, IEEE Trans. Inform. Theory 6 (1) (1960) 7–12. [31] J.J. Moré, The Levenberg–Marquardt algorithm: implementation and theory, in: G.A. Watson (Ed.), Numerical Analysis, Lecture Notes Math, vol. 630, Springer, Verlag, 1977, pp. 105–116. [32] A. Ng, CS229 course on machine learning, Stanford Univ., 2011. . [33] A.-H. Phan, P. Tichavsky, A. Cichocki, Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC, SIAM J. Matrix Anal., Appl. (SIMAX), submitted for publication. [34] D. Rebollo-Monedero, Quantization and transforms for distributed source coding, Ph.D. dissertation, Stanford Univ., 2007. [35] D. Rebollo-Monedero, J. Forné, M. Soriano, An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortionoptimized quantizers, Data Knowl. Eng. 70 (10) (2011) 892–921. [36] H. Steinhaus, Sur la division des corps matériels en parties, Bull. Pol. Acad. Sci. IV (12) (1956) 801–804.
[37] C. Studholme, D.L.G. Hill, D.J. Hawkes, An overlap invariant entropy measure of 3D medical image alignment, Pattern Recognit. 32 (1) (1999) 71–86. [38] L. Sweeney, k-Anonymity: a model for protecting privacy, Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 10 (5) (2002) 557–570. [39] K. Ueda, N. Yamashita, On a global complexity bound of the Levenberg– Marquardt method, J. Optim. Theory Appl. 147 (2010) 443–453. [40] I. Wald, V. Hvran, On building fast kd-trees for ray tracing, and on doing that in o(nlogn), in: Proc. IEEE Symp. Interact. Ray Trac., 2006, pp. 61–69. [41] L. Willenborg, T. DeWaal, Elements of Statistical Disclosure Control, SpringerVerlag, New York, 2001. [42] R. Xu, D. Wunsch II, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678. [43] X. Xu, J. Jäger, H.-P. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: Y. Guo, R. Grossman (Eds.), High Performance Data Mining: Scaling Algorithms*, Applications* and systems, Springer-Verlag, 2002, pp. 263–290. [44] S. Zhu, D. Wang, T. Li, Data clustering with size constraints, Knowl.-Based Syst. 23 (8) (2010) 883–889.
Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modification of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024
968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986