A modification of the k-means method for quasi ...

Viewer
Transcript

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Knowledge-Based Systems xxx (2012) xxx–xxx 1

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A modiﬁcation of the k-means method for quasi-unsupervised learning

2 3

Q1

a

4 5

b

6

Department of Telematics Engineering, Technical University of Catalonia (UPC), E-08034 Barcelona, Spain Department of Computer Architecture, Technical University of Catalonia (UPC), E-08034 Barcelona, Spain

a r t i c l e

2 8 0 9 10 11 12 13 14 15 16 17 18 19

David Rebollo-Monedero a, Marc Solé b, Jordi Nin b, Jordi Forné a,⇑

i n f o

Article history: Received 28 November 2011 Received in revised form 30 July 2012 Accepted 31 July 2012 Available online xxxx Keywords: k-Means method Quasi-unsupervised learning Constrained clustering Size constraints

Q3

a b s t r a c t Since the advent of data clustering, the original formulation of the clustering problem has been enriched to incorporate a number of twists to widen its range of application. In particular, recent heuristic approaches have proposed to incorporate restrictions on the size of the clusters, while striving to minimize a measure of dissimilarity within them. Such size constraints effectively constitute a way to exploit prior knowledge, readily available in many scenarios, which can lead to an improved performance in the clustering obtained. In this paper, we build upon a modiﬁcation of the celebrated k-means method resorting to a similar alternating optimization procedure, endowed with additive partition weights controlling the size of the partitions formed, adjusted by means of the Levenberg–Marquardt algorithm. We propose several further variations on this modiﬁcation, in which different kinds of additional information are present. We report experimental results on various standardized datasets, demonstrating that our approaches outperform existing heuristics for size-constrained clustering. The running-time complexity of our proposal is assessed experimentally by means of a power-law regression analysis. Ó 2012 Elsevier B.V. All rights reserved.

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

36 37

1. Introduction

38

Essentially, data clustering aims to create groups of similar objects, while keeping different objects in separate groups, according to a certain quantiﬁable measure of similarity. Clustering is commonly used in a large variety of domains, including machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics. In machine learning, for example, one frequently deals with the important problem of classiﬁcation, in which a training dataset of observations, tagged or labeled with categories of interest, is employed by an algorithm allowing a classiﬁer to learn via inductive inference to automatically categorize new unlabeled data. Techniques that in this context employ both the data and the tags of the training set are referred to as supervised learning [5,32]. At the opposite extreme is unsupervised learning [10], covering techniques which face a complete unavailability of labels in the training data, and must therefore resort only to properties of the data itself in order to partition it, guided by a preestablished metric of dissimilarity or distortion. Unsupervised clustering may also be of great value in reducing the complexity of overwhelmingly ﬁne-grained data, thus facilitating the training and operation of a subsequent supervised classiﬁer.

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

Q2

⇑ Corresponding author. E-mail address: [email protected] (J. Forné).

In a middle ground lies the case when only a fraction of the data is labeled, and techniques known as semisupervised learning [6] are then most suitable, but by no means exclusive. In this realistic case, the unsupervised preprocessing just described may come in handy to take advantage of the properties of a large portion of untagged data. Ultimately, this may improve the performance of a supervised classiﬁer, trained on the smaller portion of tagged data, preprocessed in the same manner. Such improvement may come in the form of reduced overﬁtting issues, as described in [2]. A fundamental question, which we shall address in this paper, is whether an unsupervised learning technique may be slightly modiﬁed, retaining the convenient low complexity stemming from the essence of its operation on the data, while incorporating only a small amount of information about the labels, in order to ﬁnally improve the suitability of the partitions obtained. In accordance with the provision of such partial label information, we shall rightfully call such modiﬁed technique, potentially quite cost-effective, quasi-unsupervised. A practically simple piece of information about the data, possibly computed from labeled data or known by any other means, is the relative frequency of each category of interest in the classiﬁcation. A conceptually simple way of incorporating such information would consist in introducing, in a thus modiﬁed clustering algorithm, constraints on the sizes of the partitions, in keeping with the prevalence of the data labels. The algorithm should then meet those size constraints, while striving to minimize a given measure of dissimilarity within the clusters, measure based solely on the geometric properties of the unlabeled data.

0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2012.07.024

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1

2

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

98

Finally, the k-means method [27] has undoubtedly played a preeminent role in the ﬁeld of unsupervised learning, particularly as a feature-extraction strategy prior to classiﬁcation, and more specifically as an algorithm to cluster unlabeled data, according to a distortion metric, commonly mean squared Euclidean distance, between cluster points and a cluster representative value. We postulate that incorporating even small pieces of information regarding the labeling of some of the data provided may noticeably improve classiﬁcation performance. This raises the question of how the k-means method may be modiﬁed not only to minimize the overall distortion due to clustering, but also to observe appropriate cluster size constraints, or similarly simple restrictions reﬂecting partial label information.

99

1.1. Overview of the k-means method and size-constrained clustering

100

There exists extensive literature on algorithms for unsupervised clustering [13,10,43,24,42], being the k-means method [27] one of the most popular choices. This algorithm satisﬁes several relevant properties and has numerous variants [18,16,4,34]; most notably, k-means iterates by alternatingly fulﬁlling two necessary optimality conditions in the minimization of the distortion incurred by replacing the clusters formed by a common representative data point [14,15]. This algorithm has been rediscovered several times in the statistical clustering literature [15], under various names. The term k-means was ﬁrst used by MacQueen in [27] although the idea goes back to Steinhaus in [36]. The algorithm was rediscovered independently by Lloyd in 1957 as a quantization technique for pulse-code modulation, but it was not published until much later, in 1982 [26]. Accordingly, in the ﬁeld of data compression, the algorithm is often named Lloyd algorithm, but also LloydMax algorithm [30], among other names. Regardless of the research area and the algorithm name, in many real-world problems, for instance document clustering [25,22] and localization of merchandise assortment [11], to name a few, practitioners have some background knowledge about the number of clusters and their approximate size. This kind of knowledge may be incorporated into the original k-means, or any of its variants, adding size constraints to the data clustering problem. Of course, size constraints are a relevant feature because algorithms able to use such information may lead to better clusters than traditional algorithms not designed to take advantage of it, and the potential applications of size-constrained clustering algorithms abound. Among many others, size-constrained clustering algorithms may ﬁnd applications in k-anonymous microaggregation [38,8,9], and a variety of resource-allocation and operations research problems, such as the one mentioned [11], particularly applications of similarity-based allocation of resources or workload according to predetermined volume constraints. In the context of our application of interest, namely quasi-unsupervised learning, we focus our attention on a recent heuristic addressing the size-constrained variation of the clustering problem. Speciﬁcally, we shall compare our proposal with a heuristic method called size-constrained clustering (SCC) algorithm proposed in 2010 by Zhu et al. [44]. SCC is designed to solve the data clustering problem with size constraints, under the assumption that those sizes are available as one of the types of prior knowledge on the labels we have introduced. Although the authors report an excellent performance for SCC compared to the vanilla k-means method, their approach does not inherit the distortion-optimality properties of kmeans, in the sense that the necessary nearest-neighbor condition of [14,15] would not be satisﬁed if the constraints were removed. On the other hand, in our own previous work [35], a signiﬁcant modiﬁcation of the k-means algorithm is proposed, to address constraints on cell probabilities, while striving to inherit the optimality characteristics of this method, named, accordingly, probability-con-

86 87 88 89 90 91 92 93 94 95 96 97

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149

strained Lloyd (PCL) algorithm. In the cited work, PCL is proposed strictly as a heuristic method, merely pointing out the formal similarities between the necessary optimality conditions in conventional quantization and the modiﬁed conditions, without further theoretical analysis. Although PCL outperforms the state of the art contender in k-anonymous microaggregation for statistical disclosure control [41], the algorithm is only analyzed in said ﬁeld, without contemplating its potential in the area of quasi-unsupervised learning, nor adapting it to other kinds of side information on the labels. To the best of our knowledge there has been no investigation on whether the desirable properties of this size-constrained variation of the k-means algorithm, PCL, could result in more compact clusters than the ones obtained with the SCC heuristic, and if such compactness would prove beneﬁcial from the classiﬁcation point of view.

150

1.2. Contribution and organization

165

Motivated by the optimality properties of the k-means algorithm and its long history of application across numerous ﬁelds, in this paper, we put forth a heuristic method for size-constrained clustering, loosely inspired by this celebrated algorithm. Precisely, the contribution in this work is twofold. First, we propose the application of a very recent algorithm, namely the aforementioned probability-constrained Lloyd algorithm, originally formulated for k-anonymous microaggregation, to quasi-unsupervised learning. More speciﬁcally, after certain adjustments, we demonstrate that PCL is suitable to the problem of size-constrained clustering, despite being a completely different context. Effectively, PCL is a substantial modiﬁcation of the k-means method, a modiﬁcation with excellent performance in its original ﬁeld. We experimentally analyze its performance against the SCC algorithm, mentioned above, a state-of-the-art method in the ﬁeld of quasi-unsupervised learning, with various standardized datasets. Our second contribution consists in further modifying the PCL algorithm beyond its ability to perform clustering in an analogous manner to the k-means algorithm, although satisfying cluster-size constraints. The modiﬁcations in this work endow said algorithm with the possibility of initialization of the reconstruction centroids from (a subset of) labeled data, when such information is available, and also with the possibility of ﬁxing the cluster assignment of a small subset of labeled data while clustering the rest of data, unlabeled. The detailed experimental evaluation we conduct on PCL strongly supports its consideration as a state-of-the-art candidate for quasi-unsupervised learning, not only for its intended ﬁeld of application. Furthermore, the three variations proposed here help address various scenarios concerning the partial availability of information regarding the labels of the data, thus widely extending the applicability range of the celebrated k-means method. Our experimental results conﬁrm that our proposal is capable of outperforming the state-of-the-art SCC method, yielding more compact clusters and higher-quality classiﬁcation. In addition to evaluating its distortion performance, the running-time complexity of our proposal is assessed experimentally by means of a power-law regression analysis. The rest of the paper is organized as follows. Section 2 is devoted to a more formal overview of the traditional k-means method, where after a few statistical preliminaries, the two necessary conditions for distortion-optimal clustering are laid out. Inspired by those optimality conditions, Section 3 mathematically formulates the application of the aforementioned PCL heuristic to quasi-unsupervised clustering with prior knowledge related to cluster sizes, and provides a running-time complexity analysis. In Section 4, we describe our three variations of said heuristic, widening the types of prior knowledge considered. Our k-means

166

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

151 152 153 154 155 156 157 158 159 160 161 162 163 164

167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1

3

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

Original Data Point Original Data Point

Cluster Index

Clustering

Cluster

Reconstructed Data Point

Reconstructed Point

Reconstruction

Fig. 1. Example of two-dimensional clustering.

214 215 216

217 218

modiﬁcations are then experimentally evaluated against the stateof-the-art heuristic SCC in Section 5. Finally, conclusions and future work are presented in Section 6. 2. Mathematical preliminaries and background on the traditional k-means method

219

2.1. Mathematical preliminaries

220

It is important to stress that, as a method of strictly unsupervised learning, absolutely no label information is available to k-means. Consequently, even if the method is to be used as a preprocessing for supervised learning of otherwise overwhelmingly ﬁne-grained data, k-means merely exploits geometric properties of the data according to a preset measure of dissimilarity or distortion, commonly, squared Euclidean distance. Prefaced by a few statistical preliminaries, this section overviews the traditional k-means method more formally, and lays out the two necessary conditions for distortion-optimal clustering. Throughout the paper, the measurable space in which a random variable takes on values will be called an alphabet. The cardinality of a set X is denoted by jXj. We shall follow the convention of using uppercase letters for random variables, and lowercase letters for particular values they take on. Thus, the notation X = x represents the event when the random variable X takes on the value x 2 X. Recall that a probability mass function (PMF) is essentially a relativefrequency histogram representing the probability distribution of random variable over its alphabet. The expectation operator is denoted by E. Expectation can model the special case of averages over a ﬁnite set of data points {x1, . . . , xn}, simply by deﬁning a random variable X uniformly disP tributed over this set, so that EX ¼ 1n ni¼1 xi . More generally, when X is distributed according to a PMF p(x) on a discrete alphabet X, a function f of X has expectation

221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244

245 247 248 249 250

251 253

X Ef ðXÞ ¼ f ðxÞ pðxÞ: x2X

We adhere to the common convention of separating conditioning variables with a vertical bar, so that, for instance, E[f(X)jy] denotes the expectation of f(X) conditioned on the event Y = y:

E½f ðXÞjy ¼

X f ðxÞ pðxjyÞ; x2X

254

where p(xjy) denotes the conditional PMF of X given Y.

255

2.2. Background on the traditional k-means method

256

A clustering method is a function that partitions a range of values x of a random variable X, commonly continuous, approximating each

257

b The resulting cluster by a value ^ x of a discrete random variable X. clustering map ^ xðxÞ may be broken down into two steps. First, an assignment of data points X to a cluster index Q, usually natural numbers, by means of a clustering function q(x), and secondly, a reconb that struction function ^ xðqÞ mapping the index Q into a value X approximates the original data, so that ^ xðxÞ ¼ ^ xðqðxÞÞ. Both the composition ^ xðxÞ and the ﬁrst step q(x) are often interchangeably interpreted, and referred to, as clustering. This is represented in Fig. 1, along with an example where the random variable X takes on values in R2 . Depending on the ﬁeld of application, the terms quantizer or (micro) aggregation are used in lieu of clustering, and cell in lieu of cluster, even though they have essentially equivalent meanings. Often, a distinction is made between clustering and microaggregation, to emphasize the large and small size of the resulting cells, respectively. Cluster indices Q = q(X) take on values in a ﬁnite alphabet Q ¼ f1; . . . ; jQjg. The size jQj of this alphabet, simply put, the number of clusters, may be given as an application requirement. Clearly, clustering comes at the price of introducing a certain amount of distortion between the original data X and its reconb ¼^ structed version X xðQ Þ. In mathematical terms, we deﬁne a nonnegative function dðx; ^ xÞ called distortion measure, and consider the b Þ. A common measure of distortion expected distortion D ¼ EdðX; X b k2 , popular is the mean squared error (MSE), that is, D ¼ EkX X due to its mathematical tractability. Optimal clustering is that of minimum distortion for a given number of possible indices. It is well known [14,15] that optimal clustering must satisfy the following conditions. First, the nearest-neighbor condition, according to which, once a reconstruction function ^ xðqÞ is chosen, the optimal clustering q⁄(x) is given by

259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287

288

q ðxÞ ¼ arg mindðx; ^xðqÞÞ;

ð1Þ

q¼1;...;jQj

that is, each value x of the data is assigned to the index corresponding to the nearest reconstruction value. Secondly, the centroid condition, which, in the important case when MSE is used as a distortion measure, for a chosen clustering q(x), states that the optimal reconstruction function ^x ðqÞ is given by

^x ðqÞ ¼ E½Xjq

ð2Þ

that is, each reconstruction value is the centroid of a cluster. Each necessary condition may be interpreted as the solution to a Bayesian decision problem. We would like to stress that these are necessary but not sufﬁcient conditions for joint optimality. Still, these optimality conditions are exploited in the k-means method [27,36], also widely known as Lloyd-Max algorithm [26,30] in the data compression ﬁeld. The method in question consists in the iterative, alternating optimization of q(x) given ^ xðqÞ and viceversa, according to (1) and (2). A popular method to initialize the algorithm consists

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

258

290 291 292 293 294 295

296 298 299 300 301 302 303 304 305 306 307

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1 308 309 310 311 312 313 314 315 316 317 318 319 320 321

322 323

4

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

in choosing random reconstruction points ﬁrst. A common stopping criterion compares the relative distortion reduction in each iteration to a given small threshold. In the k-means method, either the clustering or the reconstruction is improved at a time, leading to a successive improvement on the distortion; strictly speaking, the sequence of distortions is nonincreasing. Even though a nonnegative, nonincreasing sequence has a limit, rigorously, the convergence of the sequence of distortions does not guarantee that the sequence of clusterings obtained by the algorithm will tend to a stable conﬁguration, less so to a jointly optimal one. In theory, the algorithm can merely provide a sophisticated upper bound on the optimal distortion. In practice, however, the k-means algorithm often exhibits excellent performance [15, Sections II.E and III]. 3. Theoretical formulation of the k-means method with size constraints, and complexity analysis

333

In this section we ﬁrst introduce the formulation of the size-constrained clustering problem. A slightly more general formulation is developed, which in fact contemplates probability-constrained clustering. Secondly, we postulate two update steps inspired by the optimality conditions of the k-means method, but taking into account those constraints. Thirdly, we proceed to propose two variations of our own algorithm, for large and small cluster sizes, respectively. We end this section with a detailed running-time complexity analysis of the PCL algorithm, the basic building block of our new proposals.

334

3.1. Formulation of and insight into the problem

335

We consider the design of minimum-distortion clusters satisfying size, or more generally, cluster probability constraints, with the same block structure of traditional clustering, depicted in Fig. 1. For ﬁnite sets of discrete data points, one may simply view cluster probability constraints equivalently as size constraints, normalized by the total number of original data points available. Our general formulation in terms of probabilities permits including the case of continuous distributions of the data. Original data values, that is, the data points to be clustered, are modeled by a random variable X in an arbitrary alphabet X, possibly discrete or continuous, for example a set of points in a multidimensional Euclidean space, or a set of words in an ontology. The clustering q(x) assigns X to an index Q in a ﬁnite alphabet Q ¼ f1; . . . ; jQjg of a predetermined size. The reconstruction funcb which may be regarded tion ^ xðqÞ maps Q into the reconstruction X, as an approximation to the original data, deﬁned in an arbitrary b commonly but not necessarily equal to the original alphabet X, data alphabet X. Just as in traditional clustering, for any nonnegative (measurable) function dðx; ^ xÞ, called distortion measure, deﬁne the associb that is, a measure of the ated expected distortion D ¼ EdðX; XÞ, discrepancy between the original data values and their reconstructed values, which reﬂects the loss in data accuracy. Recall also from the background section that an important example of distortion measure is dðx; ^ xÞ ¼ kx ^ xk2 , for which D becomes the MSE. Alternatively, dðx; ^ xÞ may represent a semantic distance in an ontob model a conceptual generalization of X, logical hierarchy [7] and X a random variable taking on words as values. pQ(q) denotes the PMF corresponding to the cell probabilities. The cluster size requirement in the quantization problem is established by means of the cell probability constraints pQ(q) = p0(q), for any predetermined PMF p0(q). As we have introduced before, the motivating application of this work is the problem of size-constrained clustering of a set of sam-

324 325 326 327 328 329 330 331 332

336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368

ples. In this important albeit special case, X would be given by the empirical distribution of the data values in the original set, that is, pX(x) would be the PMF corresponding to the relative frequency of occurrences of the different tuples of data values. Let n be the number of records in the dataset. An m-size constraint could be translated into probability constraints by setting jQj ¼ bn=mc and p0 ðqÞ ¼ 1=jQj, which ensures that n p0(q) P m. We would like to remark that even if we are only interested in the clustering portion q(x) of the quantization process, the reconstruction values may still be only used, although only as a means to compute a measure of similarity, and not as a replacement for the data clustered. Finally, given a distortion measure dðx; ^ xÞ and probability constraints expressed by means of p0(q) (along with the number of quantization cells jQj), we wish to design an optimal clustering q⁄(x) and an optimal reconstruction function ^ x ðqÞ, in the sense that they jointly minimize the distortion D while satisfying the probability constraints.

369

3.2. A heuristic modiﬁcation of the traditional optimality conditions

387

Next, we propose heuristic optimization steps for probabilityconstrained clustering, analogous to the nearest-neighbor and centroid conditions found in k-means [14,15], reviewed in Section 2.2. We then modify the conventional k-means algorithm by applying its underlying alternating optimization principle to these steps. Finding the optimal reconstruction function ^ x ðqÞ for a given clustering q(x) is a problem identical to that in conventional kmeans:

388

^

x ðqÞ ¼ arg minE½dðX; ^xÞjq: ^x2 b X

q ðxÞ ¼ arg min dðx; ^xðqÞÞ þ cðqÞ:

372 373 374 375 376 377 378 379 380 381 382 383 384 385 386

389 390 391 392 393 394 395 396

397 399

ð4Þ

q¼1;...;jQj

This is a heuristic step inspired by the nearest-neighbor condition of the conventional k-means method (1). According to this formula, increasing the cost of a cluster, leaving the cost of the others and all centroids unchanged, will reduce the number of points assigned to it. Conversely, decreasing the cost will push cluster boundaries outwards and thus increase its size. Coefﬁcients may be construed, for the Euclidian case, as an additional dimension in which centroids can distance themselves from the points to be clustered. In technical words, the costs c(q) adding to a quadratic distance dðx; ^ xðqÞÞ in (4) may be all considered nonnegative without loss of generality, since adding a common base value to all of them would determine the same exact partition. But then,

pﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 dðx; ^xðqÞÞ þ cðqÞ ¼ kx ^xðqÞk2 þ cðqÞ ¼ ðx; 0Þ ^xðqÞ; cðqÞ : This geometric interpretation in illustrated in Fig. 2. The step just proposed naturally leads to the question of how to ﬁnd a cost function c(q) such that the probability constraints

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

371

ð3Þ

In the special case when MSE is used as distortion measure, this is the centroid step (2). On the other hand, we may not apply the nearest-neighbor condition in conventional clustering directly, if we wish to guarantee the probability constraints pQ(q) = p0(q). We introduce a cell cost function c : Q ! R, a real-valued function of the cluster indices, which assigns an additive cost c(q) to a cell indexed by q. The intuitive purpose of this function is to shift the cluster boundaries appropriately to satisfy the probability constraints. Speciﬁcally, given a reconstruction function ^ xðqÞ and a cost function c(q), we propose the following cost-sensitive nearest-neighbor step:

370

400 401 402 403 404 405 406 407 408 409 410

411 413 414 415 416 417 418 419 420 421 422 423 424 425 426

427 429 430 431 432

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

5

Fig. 2. Geometrical interpretation of the c(q) coefﬁcients. (left) Initial assignment of points {x1, . . . , x6} to two clusters q1 and q2. Leftmost cluster has four points assigned, rightmost only two. Initially coefﬁcients c(q1) = c(q2) = 0, thus reconstruction points are on the same plane as data points. (right) The clusters are balanced by increasing the coefﬁcient of the cluster with excess of points. The coefﬁcient can be viewed as the coordinate of the reconstruction points in an additional dimension.

454

pQ(q) = p0(q) are satisﬁed, given a reconstruction function ^ xðqÞ. Clearly, for given, ﬁxed reconstructions, the cluster probabilities pQ(q) are entirely determined by the costs c(q), through the modiﬁed nearest-neighbor condition (4). Considering each of the costs cð1Þ; . . . ; cðjQjÞ an unknown and each of the constraints pQ ð1Þ ¼ p0 ð1Þ; . . . ; pQ ðjQjÞ ¼ p0 ðjQjÞ an equation, we are left with a system of jQj nonlinear equations and jQj unknowns. For smooth probability distributions of the data and large cluster size constraints, we propose the following derivative-based, numerical method, which proved to be very successful in all of our experiments, including those reported in Section 5. Later on, we shall describe a mechanism to extend this derivative-based method to the case of small-size constraints and arbitrary distributions of the data, not necessarily smooth. Speciﬁcally, we propose an application of the Levenberg–Marquardt algorithm [28], a powerful, sophisticated algorithm to solve systems of nonlinear equations numerically. A ﬁnite-difference estimation of the Jacobian is carried out, by slightly increasing each of the coordinates of c(q) at a time. To do so more efﬁciently, we exploit the fact that only the coordinates of pQ(q) corresponding to neighboring cells may be changed, and compute the negative-semideﬁnite approximation in Frobenius norm to the Jacobian.

455

3.3. Running-time complexity analysis

456

We now proceed to investigate the running-time complexity of PCL, the algorithm proposed and adapted in this work for the purpose of quasi-unsupervised learning. Throughout the discussion that follows, n will denote the number of records of a dataset consisting in points in the d-dimensional Euclidean space Rd , on which we wish to ﬁnd k cells or clusters with a minimum size constraint of s points.1

433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453

457 458 459 460 461 462

463 464 465 466 467 468 469 470 471 472

3.3.1. Theoretical considerations Because PCL is a fairly sophisticated extension of the k-means method, intuition suggests that any theoretical analysis of its running time will be at the very least as intricate as that of k-means, already fairly involved. For a ﬁxed number of clusters k and dimension d, the general problem of ﬁnding a partition minimizing the mean squared distance can be solved exactly in time O(nkd+1 logn) in the number of points n [20,21], by bounding the number of iterations by the number O(nkd) of distinct Voronoi partitions on n points. 1

The use of the symbol k here should not be confused with its common use in statistical disclosure control, one of the ﬁelds of application of our algorithm, where in the notion of k-anonymity, k represents the minimum cell size constraint, s in our notation, rather than the number of cells.

In practice, the problem is solved approximately by the k-means or Lloyd algorithm itself, usually very fast on a wide variety on datasets. The difﬁculty in investigating the complexity of k-means from a theoretical perspective is blatantly obvious from the fact that published studies are hardly conducive to any practical application, as extremely loose upper bounds are reported for very speciﬁc synthetic datasets. In an attempt to remedy this discrepancy, ‘‘smoothed’’ running-time studies have been conducted more recently [1,3], but still yield fairly loose bounds, valid only for speciﬁc data. Precisely, [1] shows that for an arbitrary set of n points in [0, 1]d, if each point is independently perturbed by a zero-mean normal distribution with variance r2, the expected ‘‘smoothed’’ running time of k-means is 3

8

4

474 475 476 477 478 479 480 481 482 483 484 485 486

487

Oðn3 4k 4d log nr6 Þ;

489

polynomial in both the number of records n and the number of cells k. An attempt to provide a tighter bound in [3] resorts to n points in an integer lattice {1, . . . , r}d, obtaining O(dn4r2). Due to the fact that running times in practice are not adequately approximated by the theoretical bounds, and that some hypotheses on dataset properties are not easily veriﬁable, some complexity studies incorporate extensive experimentation. For example, the complexity of an efﬁcient implementation of k-means by means of kd-trees [40] is investigated both theoretically and experimentally in [23]. The time complexity of the Levenberg–Marquardt algorithm, also part of our proposal, and the more general family of variations of the Gauss–Newton’s method, have also been the object of extensive study [31,39,33]. In any case, neither the theoretical results on the running-time complexity of the k-means method nor those on the Levenberg–Marquardt algorithm are directly applicable to bound the overall complexity of PCL, because of its nontrivial interplay. Similarly, the state-of-the-art SCC algorithm, against which we shall compare PCL, requires an initial clustering step that in [44] was performed with the k-means algorithm. Next, an integer linear programming problem with nk variables is solved. This type of problems are NP-complete, and the fastest solving algorithms currently known have a worst-case exponential running time of Oð2nk Þ. Just as we argued for PCL, these theoretical bounds are hardly representative of their performance in practice. Hence, additional, experimental investigation is highly desirable.

490

3.3.2. Experimental setup The above theoretical considerations leads us to analyze the running-time complexity of PCL and SCC from an experimental perspective. The analysis in question is carried out for a dataset

517

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

473

491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516

518 519 520

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1

6

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

Fig. 3. Running time t of PCL versus estimated running time ^t ¼ s nm sr . The proportionality coefﬁcient s and the exponents m and r are adjusted via multiple linear regression on the logarithms of the recorded values.

521 522 523 524 525 526 527 528 529 530 531 532

533 535

synthetically generated from n independent, identically distributed drawings of d-dimensional vectors with independent, zero-mean, unit-variance, Gaussian entries. The PCL algorithm described in this work and the SCC algorithm of [44] are applied to this dataset, for n ranging from 30,000 to 100,000 in steps of 5000, and d = 2, 3, 5. A minimum cell size constraint s is imposed, for s ranging from 1000 to 5000 in steps of 1000, resulting in a number of cells k = bn/sc. The following robust stopping condition is observed: PCL is automatically interrupted whenever the MSE distortion decrement is lesser than 104 in ﬁve consecutive iterations. We consider a power-law model for the overall running time t as a function of n and s. Precisely, we consider the estimator m r

^t ¼ s n s

ð5Þ

538

of t from n and s, where the proportionality coefﬁcient s and the exponents m and r are adjusted via multiple linear regression on the logarithms of the recorded values given by

539 541

log t ’ log s þ m log n þ r log s:

536 537

544

Recall that the best afﬁne MSE estimation of a target variable Y from an observation X, the usual deﬁnition of the correlation coefﬁcient q for simple linear regression satisﬁes the property

545 547

MSE=r2Y ¼ 1 q2 ;

542 543

548 549 550 551 552 553

554 555 556 557 558 559 560 561 562 563 564

where r2Y denotes the variance of Y. In the multiple regression case, we redeﬁne the absolute value of q from said property, and it is thus a measure of the ratio between estimation error and the variance of the target variable. Simply put, values of jqj close to 1 will indicate that the model ^t deﬁned in (5) is an adequately approximate representation of the running time t. 3.3.3. Experimental results Our proof-of-concept implementation of PCL was written in Matlab R2010b, with an integrated MEX C routine to speed up nearest-neighbor search, and run on a modern computer equipped with an Intel Xeon CPU @ 2.67 GHz and Windows 7 64-bit. Running times and the results of our regression analysis are reported for the speciﬁc case of d = 5 in Fig. 3, in logarithmic axes. The resulting values of jqj,m and r for the other dimensions analyzed, d = 2 and 3, are very similar. In light of our model (5), the high absolute value of the correlation coefﬁcient jqj ’ 0.9, and the values m ’ 2.14 and r ’ 1.11 of the exponents, suggest that the

Fig. 4. Running time t of SCC versus estimated running time ^t ¼ snv sr . The proportionality coefﬁcient s and the exponents v and r are adjusted via multiple linear regression on the logarithms of the recorded values.

running time is roughly of the order of n2/s, and if k is also approximately n/s, then t is roughly proportional to nk. Interestingly, n2/s ’ nk is precisely the computational complexity of the modiﬁed nearest-neighbor routine implemented, which simply checks the nearest centroid for each point, and it is invoked a large number of times in both inner iterations executing the Levenberg–Marquardt algorithm, and outer iterations reshaping the cells produced by PCL. Future research may of course take into consideration the possibility of using efﬁcient methods for nearestneighbor search, such as kd-trees [40], to speed up our algorithm. We would like to conclude this section on the running-time complexity analysis of PCL with a quick comparison against its contender, SCC, applying the same type of experimental regression analysis described. Our implementation of the SCC algorithm was written in C and, since the algorithm is largely independent of the dimensionality of the data (practically all the running time is spent solving the ILP problem which is unaffected by dimensionality), for the regression analysis we have used d = 2. These experimental results for SCC, plotted in Fig. 4, suggest that the running time of SCC is of the order of approximately n4.62 = s1.97 ’ n2.65k1.97, with accuracy supported by a very high absolute correlation coefﬁcient jqj ’ 0: 991. This means that the running time of SCC scales with roughly the square of that of PCL, the latter estimated to be of the order of n2/s ’ nk. In other words, the scalability of PCL amply surpasses that of SCC for the experiments conducted, a result that is supported by the fact that we had to severely reduce the dataset for SCC to obtain reasonable running times.

565

4. Constrained k-means for quasi-unsupervised learning

592

In this section we introduce three different modiﬁcations of the traditional k-means algorithm for quasi-unsupervised learning. Firstly, as in [44], we assume that clusters have a speciﬁed size. After that, we relax this restriction and we assume that we only have an approximate idea about the clusters sizes but we know an archetype, a possible centroid, of the clusters. Finally, we operate under the assumption that a small portion of the data points available are labeled, information which is exploited in the classiﬁcation process of the entire dataset.

593

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591

594 595 596 597 598 599 600 601

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

602

4.1. Exact cluster sizes

603

Ideally, we wish to ﬁnd a clustering and a reconstruction function that jointly minimize the distortion considering the clusters size constrain deﬁned by the user as a parameter. The conventional k-means method, reviewed in Section 2.2, is essentially an alternating optimization algorithm that iterates between the nearestneighbor (1) and the centroid optimality (2) conditions. These are necessary but not sufﬁcient conditions, thus the algorithm can only hope to approximate a jointly optimal pair q⁄(x), ^ x ðqÞ, as it really only guarantees that the sequence of distortions is nonincreasing. We also mentioned that experimentally, however, the k-means algorithm very often shows excellent performance [15, Sections II.E and III]. Bear in mind that our modiﬁcation of the nearest-neighbor condition (4) for the probability-constrained problem is a heuristic proposal, in the sense that this work does not prove it to be a necessary condition. We still use the same alternating optimization principle, albeit with a more sophisticated nearest-neighbor condition (4), and deﬁne the following modiﬁed k-means method for probability-constrained quantization:

604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633

1. Choose an initial reconstruction function ^ xðqÞ and initial cost function c(q). 2. Update c(q) to satisfy the probability constraints pQ(q) = p0(q), given the current ^ xðqÞ. To this end, use the method described at the end of Section 3.2, setting the initial cost function as the cost function at the beginning of this step. 3. Find the next clustering q(x) corresponding to the current ^ xðqÞ and the current c(q), according to Eq. (4). 4. Find the optimal ^ xðqÞ corresponding to the current q(x), according to Eq. (3). 5. Go back to Step 2, until an appropriate convergence condition is satisﬁed or a maximum number of iterations exceeded.

634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663

The initial reconstruction values may simply be chosen as jQj random samples distributed according to the probability distribution pX(x) of X. A simple yet effective cost function initialization is c(q) = 0, because it prevents empty clusters. Note that the numerical computation of c(q) in Step 2 should beneﬁt from better and better initializations as the reconstruction values become stable. That stability may be improved during the initial steps of the algorithm described by replacing previous reconstruction values by convex combinations weighting both the updated reconstruction values and the previous ones, instead of directly replacing the previous value with the updated one, thus slowing the update process and improving the validity of previous costs as initialization to the Levenberg–Marquardt method. If the probability of a cell happens to vanish at any point of the algorithm, this can be tackled by choosing a new, random reconstruction value, with cost equal to the minimum of the rest of costs, or by splitting the largest cell into two, with similar centroids and costs, in order to keep the same number of clusters. The stopping convergence condition might for instance consider slowdowns in the sequence of distortions obtained. For example, the algorithm can be stopped the ﬁrst time a minimum relative distortion decrement is not attained, or, more simply, when a maximum number of iterations is exceeded. 4.2. Approximate cluster sizes and centroids In scenarios where it is not possible to obtain the exact number of objects in each one of the classes, but still it is possible to have an estimation of these values, we might be able to deﬁne a ‘‘representative’’ member for each class as in [29], that we call archetype, or at least provide an approximation of this representative member. When both pieces of information are available, they can be

7

incorporated into the algorithm deﬁned in the previous section as follows:

664

1. Given a representative for each cluster q, assign it as the reconstruction point ^ xðqÞ. Set the initial cost function c(q) (typically c(q) = 0). 2. Update c(q) to satisfy the probability constraints pQ(q) = p0(q), given the current ^ xðqÞ. 3. Find the next clustering q(x) corresponding to the current ^ xðqÞ and the current c(q), according to Eq. (4).

666 667 668 669 670 671 672 673

The idea is to use the representative point (archetype) of each cluster as its reconstruction point and then perform a single outer iteration of the constrained k-means algorithm. Although the archetypes could have been simply used as initial reconstruction points without further modiﬁcation of the algorithm, if several iterations are allowed the alternating optimization procedure typically produces a drift on the reconstruction points that can yield a ﬁnal reconstruction point quite different than the archetype. Since we assume the archetypes summarize the properties of the class they represent, that is, they are not very far away from the centroid of the class, it is a better approach to freeze the reconstruction points. To simulate the fact that we have an approximate knowledge of the cluster sizes and the archetypes, in our experiments (Section 5), we have used a 10% of the records randomly selected as a sample to estimate the sizes of the cluster as well as the archetypes. Then, the approximate cluster sizes and archetypes are simply passed to the algorithm, and thus the labels of the records in the sample are never directly used by the algorithm.

674

4.3. Some training labeled records

693

In datasets in which classes are quite mixed, that is, the convex hulls of the points of each class have intersection zones containing a signiﬁcant amount of records, previous strategies might yield quite rough classiﬁcations. In these contexts, some additional information can increase the performance of the classiﬁer. In particular, as in [17], we assume in this case that we have a small sample of records that have been classiﬁed. For instance this can be the case where there are too many records to be analyzed by humans, but a small fraction can be classiﬁed by hand. In contrast with the strategy presented in Section 4.2, in which the cluster sizes and archetypes were approximated using a 10% sample, but the algorithm was not aware of the records in the sample, this time we assume that we have access to these records, so that the algorithm is aware of the class of each one of these sample points. Instead of performing a classiﬁcation looking for jQj different classes, we will perform a classiﬁcation with a number of classes equal to the size of the sampling set (in our experiments this was a 10%). For instance in a dataset containing 150 records belonging to three different classes, the sampling set will contain 15 records. Instead of running the algorithm using jQj ¼ 3, we assume each one of the sample records is used as the archetype of a distinct class, thus jQj ¼ 15. The probabilities of each one of these classes are all the same, so that for all q and q0 , p0(q) = p0(q0 ), and we use the sample point associated with each one of these classes as initial reconstruction point of the class. In our example each one of the 15 clusters would 10 contain 150 ¼ 10 records, thus p0 ðqÞ ¼ 150 . 15

694

Once this initial set up has been performed, the c(q) coefﬁcients are computed so that the probability constraints pQ(q) = p0(q) are satisﬁed and then, all the records that have been classiﬁed in the same cluster are assigned to the same class of the sample that is associated with the cluster. For instance, in our running example

721

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

665

675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692

695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720

722 723 724 725

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1

8

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

P n P n HðVÞ ¼ kj¼1 n j log nj , with nj ¼ ci¼1 nij . Since algorithms used have a predeﬁned common number of clusters, in this scenario, c = k.

Table 1 Datasets from UCI [12] used in the experiments.

726 727 728 729

Dataset

Objects

Classes

Attributes

Cluster sizes

Iris Wine Balance scale Ionosphere WDBC Crx Pima Bands

150 178 625 351 569 666 768 366

3 3 3 2 2 2 2 2

4 13 4 33 30 6 9 19

(50, 50, 50) (71, 59, 48) (288, 288, 49) (226, 125) (357, 212) (367, 299) (500, 268) (230, 135)

after the computation of the c(q) coefﬁcients each one of the cluster would effectively contain ten records, that will be all assigned to the same original class as the archetype (sample point) of the cluster.

730

5. Experimental evaluation

731

742

In this section, we describe the experiments that we have carried out to test the performance of the proposed k-means modiﬁcations. The main conclusion we may draw from these experiments is that our modiﬁcations of the k-means algorithm are perfectly suitable candidates in situations where some prior knowledge is available as in the quasi-unsupervised learning scenario. First, we describe the datasets that we have considered in our experiments, and secondly, we proceed to explain the performance measures used. Next, we show the results obtained using our proposal, as well as using the state-of-the-art SCC algorithm proposed in [44] and brieﬂy reviewed in Section 1.1.

743

5.1. Datasets and description of performance measures

744

We have conducted a large number of experiments to validate our approach on eight UCI datasets from [12], whose main characteristics are summarized in Table 1. In particular, we provide the number of objects (records or datapoints), classes (categories) and attributes (dimensions) of each dataset, as well as the sizes of the classes. These datasets are exactly those used in [44]. To compare the quality of the clusterings obtained by the all the algorithms, we have used the following metrics of clustering performance for classiﬁcation:

732 733 734 735 736 737 738 739 740 741

745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763

764 766 767 768 769 770 771 772 773

Adjusted rand index (ARI) [19], computed as the number of pairs of objects such that: (i) both are assigned to the same cluster and have the same label, or (ii) they are assigned to different clusters and have different labels, divided by the total number of pairs. This metric always gives a value between 0 and 1, and higher values indicate a higher agreement between the clustering and the class labels. Normalized mutual information (NMI) [37] quantiﬁes the amount of information shared by two random variables that represent the cluster assignments (variable U) and the class labels (variable V). It is computed as

776 777 778

5.2. Experimental results

781

The three variations of the size-constrained k-means (CKM) method developed and experimentally analyzed are listed in Table 2. Table 3 reports the results obtained in the experiments. The ARI and NMI values for the CKM variants are the average of three executions. Among these methods, CKMb and CKMt showed almost no variability among the executions despite the randomness of some of the initial choices (initial reconstruction points for CKMb and the 10% sample used in CKMt), while CKMl showed a signiﬁcant degree of variability. From these results, we may draw the following conclusions:

782

In general, the performance of the CKM algorithms increases when the prior available knowledge also increases. This effect is common in all the considered datasets, except in the Wine dataset, for both measures. This fact demonstrates that CKM algorithms are able to proﬁt from the available prior knowledge. This is convincing evidence that our modiﬁcations of the kmeans algorithm are useful candidates in several quasi-unsupervised learning scenarios. At least one CKM conﬁguration outperforms SCC in all the datasets. This is possible thanks to the large ﬂexibility in the available knowledge CKM can manage. As the considered datasets have very different characteristics, the different prior knowledge allows CKM to reach higher performance than SCC.

793

Table 2 Algorithmic variations of the k-means method implemented and experimentally analyzed. Algorithm

Description

Sec.

CKMb CKMt

Random centroid initialization and moving centroids Approximate cluster sizes and centroids from a 10% sample Preassigned clustering of 10% of labeled records

4.1 4.2

CKMl

4.3

Table 3 Comparison of the clustering quality between the heuristic size-constrained clustering (SCC) of [44] and our size-constrained k-means algorithm, in the three variations described in Table 2. Dataset

Alg

ARI

NMI

Dataset

Alg

ARI

NMI

Iris

SCC CKMb CKMt CKMl

0.6416 0.7859 0.7860 0.5939

0.6750 0.7773 0.7782 0.6419

WDBC

SCC CKMb CKMt CKMl

0.1172 0.2156 0.5312 0.7129

0.1678 0.3134 0.5103 0.5917

Wine

SCC CKMb CKMt CKMl

0.3863 0.7693 0.6756 0.5648

0.4383 0.7818 0.6655 0.5965

Crx

SCC CKMb CKMt CKMl

0.1989 0.1441 0.2127 0.1629

0.1461 0.1043 0.1608 0.1172

Balance scale

SCC CKMb CKMt CKMl

0.1389 0.0762 0.3846 0.3882

0.0931 0.0715 0.3620 0.2803

Pima

SCC CKMb CKMt CKMl

0.0365 0.1930 0.1704 0.1762

0.0138 0.1179 0.1421 0.1085

Ionosphere

SCC CKMb CKMt CKMl

0.4056 0.0216 0.2047 0.4729

0.2926 0.0616 0.1669 0.3538

Bands

SCC CKMb CKMt CKMl

0.0528 0.0083 0.0236 0.0827

0.0268 0.0027 0.0207 0.0422

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

775

Note that such measures are independent of the prior knowledge the algorithms have. The values for SCC were directly extracted, as reported, from the paper it was proposed in [44].

IðU; VÞ NMI ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ HðUÞHðVÞ where I(U, V) is the mutual information between the cluster Pc Pk nij nij n assignment and class labels, deﬁned as i¼1 j¼1 n log ni nj , where nij stands for the number of objects assigned to cluster i with label j, n for the total amount of objects, c for the number of clusters and k for the number of classes (labels). H(U) and H(V) are the corresponding entropies, deﬁned as P P HðUÞ ¼ ci¼1 nn i log nni , where ni ¼ kj¼1 nij , and

774

779 780

783 784 785 786 787 788 789 790 791 792

794 795 796 797 798 799 800 801 802 803 804 805

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

CKMb obtains the best ARI and NMI results in two of the eight considered datasets, largely outperforming SCC in the Iris Wine, WDBC and Pima datasets. Speciﬁcally, in the Wine, Pima and WDBC dataset, CKMb vastly outperforms (100% of improvement) SCC using exactly the same kind of prior knowledge. In the balance scale dataset, SCC obtains slightly better performance than CKMb, but CKMt and CKMl reach higher ARI and NMI values than SCC. The Ionosphere dataset has some peculiarities, and the results obtained with this dataset have to be studied in a deeper manner. The clusters of this dataset are completely mixed. For this reason the results obtained by CKMb, which creates geometrically convex cells, are worse than the ones obtained by SCC. However, as in the Ionosphere dataset, very similar, close records share the same label, and CKMl manages to outperform SCC with an improvement around 15%. Note that CKMl considers a small subset of record labels as its prior knowledge, then it can take advantage from this fact. Indeed, this issue is well-known in the classiﬁcation community, and this is the reason why Ionosphere dataset has been largely used to simulate classiﬁcation scenarios. SCC and the CKM variants yield poor results with the Balance scale dataset because the records of this dataset are equidistant. This makes the task of building compact clusters more challenging and difﬁcult. Therefore, the results of all the methods are not as good as when we consider other datasets such as Iris.

831 832

6. Conclusion and future work

833

In this paper we have described the traditional k-means clustering algorithm from the point of view of distortion-optimized quantization, stressing the importance of the necessary optimality conditions it is built upon. Then, we have formulated the size-constrained clustering problem and we have shown how to accommodate cluster size constrains into the classical k-means algorithm. Speciﬁcally, the introduction of additive weights in the nearest-neighbor condition enables us to control cell sizes, while faithful to the optimality properties of the original algorithm. In order to adjust these additive weights, we have resorted to the Levenberg–Marquardt algorithm, a sophisticated variation of the Gauss–Newton method for numerical optimization. Next, we have argued that incorporating size constrains when clusters are built can be considered as a form of prior knowledge. This reasoning allows us to consider k-means as a clustering method for quasi-unsupervised scenarios. Additionally, we have shown that other types of prior knowledge can be integrated into kmeans, speciﬁcally, we have considered the scenario where some clusters archetypes, or frozen cluster centroids, are known and the scenario where some labeled records, the training dataset, are available. In addition to evaluating its distortion performance, the running-time complexity of our proposal is assessed experimentally by means of a power-law regression analysis, suggesting that the overall running time of our clustering algorithm is of the order of nk, with n the number of points in the dataset, and k the number of cells, for several, ﬁxed dimensions. Finally, in the experimental part of this work we have tested our proposal employing well-known and widely accepted real datasets extracted from the UCI repository, and we have compared our results against a state-of-the-art, size-constrained clustering method, SCC, showing that our substantial k-means modiﬁcations are capable of attaining lower distortion, thereby outperforming it, for various standardized datasets. As future work, we intend to investigate an extension that exploits the fact that inequality constraints offer more ﬂexibility,

834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869

9

which may lead to a distortion improvement in applications where strict equality is not required.

870

Acknowledgments

872

This work was partly supported by the Spanish Government through projects Consolider Ingenio 2010 CSD2007-00004 ‘‘ARES’’, TEC2010-20572-C02-02 ‘‘Consequence’’ and by the Government of Catalonia under Grant 2009 SGR 1362. D. Rebollo-Monedero is the recipient of a Juan de la Cierva postdoctoral fellowship, JCI-200905259, from the Spanish Ministry of Science and Innovation.

873

References

879

[1] D. Arthur, B. Manthey, H. Roeglin, k-Means has polynomial smoothed complexity, in: Proc. IEEE Annual Symp. Found. Comput. Sci. (FOCS), Atlanta, GA, October 2009, pp. 1157–1160. [2] L. Bai, J. Liang, C. Dang, An initialization method to simultaneously ﬁnd initial cluster centers and the number of clusters for clustering categorical data, Knowl.-Based Syst. 24 (6) (2011) 785–795. [3] A. Bhowmick, A theoretical analysis of Lloyd’s algorithm for k-means clustering, 2009. [4] M. Bilenko, S. Basu, R.J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, in: Proc. Int. Conf. Mach. Learn. (ICML), Banff, Alberta, Canada, July 2004, pp. 81–88. [5] C. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006. [6] O.B. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, MA, 2006. [7] V. Cross, Fuzzy semantic distance measures between ontological concepts, in: Proc. N. Amer. Fuzzy Inform. Process. Soc. (NAFIPS), 2004, pp. 236–240. [8] J. Domingo-Ferrer, J.M. Mateo-Sanz, Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. Knowl. Data Eng. 14 (1) (2002) 189–201. [9] J. Domingo-Ferrer, V. Torra, Ordinal, continuous and heterogenerous kanonymity through microaggregation, Data Min. Knowl. Disc. 11 (2) (2005) 195–212. [10] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, second ed., Wiley, New York, 2001. [11] M. Fisher, K. Rajaram, Accurate retail testing of fashion merchandise: methodology and application, J. Market. Sci. 19 (3) (2000). [12] A. Frank, A. Asuncion, UCI machine learning repository, Univ. California, Irvine, Sch. Inform., Comput. Sci., 2010. . [13] H. Frigui, R. Krishnapuram, A robust competitive clustering algorithm with applications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 21 (5) (1999) 450–465. [14] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, MA, 1992. [15] R.M. Gray, D.L. Neuhoff, Quantization, IEEE Trans. Inform. Theory 44 (1998) 2325–2383. [16] S.K. Gupta, K.S. Rao, V. Bhatnagar, k-Means clustering algorithm for categorical attributes, in: Proc. Data Warehous. Knowl. Disc. (DaWaK), Lecture Notes Comput. Sci. (LNCS), vol. 1676, Springer-Verlag, Florence, Italy, 1999, pp. 203– 208. [17] T. Huang, Y. Yu, G. Guo, K. Li, A classiﬁcation algorithm based on local cluster centers with a few labeled training examples, Knowl.-Based Syst. 23 (6) (2010) 563–571. [18] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Disc. 2 (3) (1998) 283–304. [19] L. Hubert, P. Arabie, Comparing partitions, J. Classif. 2 (1) (1985) 193–218. [20] M. Inaba, N. Katoh, H. Imai, Applications of weighted voronoi diagrams and randomization to variance-based k-clustering, in: Proc. ACM Symp. Comput. Geom., 1994, pp. 332–339. [21] M. Inaba, N. Katoh, H. Imai, Variance-based k-clustering algorithms by Voronoi diagrams and randomization, IEICE Trans. Inform. Syst. E83-D (6) (2000) 1199–1206. [22] F. Jacquenet, C. Largeron, Discovering unexpected documents in corpora, Knowl.-Based Syst. 22 (6) (2009) 421–429. [23] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, An efﬁcient k-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002) 881–892. [24] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 2005. [25] M. Li, L. Zhang, Multinomial mixture model with feature selection for text clustering, Knowl.-Based Syst. 21 (7) (2008) 704–708. [26] S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inform. Theory IT-28 (1982) 129–137. [27] J.B. MacQueen, Some methods for classiﬁcation and analysis of multivariate observations, in: Proc. Berkeley Symp. Math. Stat., Prob. I (Stat.), Berkeley, CA, 1965–1966 (Symp.), 1967 (Proc.), 1967, pp. 281–297. [28] D. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM J. Appl. Math. (SIAP) 11 (1963) 431–441.

880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

871

874 875 876 877 878

KNOSYS 2363

No. of Pages 10, Model 5G

12 October 2012 Q1 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967

Q4

10

D. Rebollo-Monedero et al. / Knowledge-Based Systems xxx (2012) xxx–xxx

[29] S. Martínez, A. Valls, D. Sánchez, Semantically-grounded construction of centroids for datasets with textual attributes, Knowledge-Based Systems, in press. [30] J. Max, Quantizing for minimum distortion, IEEE Trans. Inform. Theory 6 (1) (1960) 7–12. [31] J.J. Moré, The Levenberg–Marquardt algorithm: implementation and theory, in: G.A. Watson (Ed.), Numerical Analysis, Lecture Notes Math, vol. 630, Springer, Verlag, 1977, pp. 105–116. [32] A. Ng, CS229 course on machine learning, Stanford Univ., 2011. . [33] A.-H. Phan, P. Tichavsky, A. Cichocki, Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC, SIAM J. Matrix Anal., Appl. (SIMAX), submitted for publication. [34] D. Rebollo-Monedero, Quantization and transforms for distributed source coding, Ph.D. dissertation, Stanford Univ., 2007. [35] D. Rebollo-Monedero, J. Forné, M. Soriano, An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortionoptimized quantizers, Data Knowl. Eng. 70 (10) (2011) 892–921. [36] H. Steinhaus, Sur la division des corps matériels en parties, Bull. Pol. Acad. Sci. IV (12) (1956) 801–804.

[37] C. Studholme, D.L.G. Hill, D.J. Hawkes, An overlap invariant entropy measure of 3D medical image alignment, Pattern Recognit. 32 (1) (1999) 71–86. [38] L. Sweeney, k-Anonymity: a model for protecting privacy, Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 10 (5) (2002) 557–570. [39] K. Ueda, N. Yamashita, On a global complexity bound of the Levenberg– Marquardt method, J. Optim. Theory Appl. 147 (2010) 443–453. [40] I. Wald, V. Hvran, On building fast kd-trees for ray tracing, and on doing that in o(nlogn), in: Proc. IEEE Symp. Interact. Ray Trac., 2006, pp. 61–69. [41] L. Willenborg, T. DeWaal, Elements of Statistical Disclosure Control, SpringerVerlag, New York, 2001. [42] R. Xu, D. Wunsch II, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678. [43] X. Xu, J. Jäger, H.-P. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: Y. Guo, R. Grossman (Eds.), High Performance Data Mining: Scaling Algorithms*, Applications* and systems, Springer-Verlag, 2002, pp. 263–290. [44] S. Zhu, D. Wang, T. Li, Data clustering with size constraints, Knowl.-Based Syst. 23 (8) (2010) 883–889.

Q1 Please cite this article in press as: D. Rebollo-Monedero et al., A modiﬁcation of the k-means method for quasi-unsupervised learning, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.024

968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986

A quasi-Monte Carlo method for computing areas ... - Semantic Scholar

A New Histogram Modification-based Method for ...

modification of preloading bund for the construction of ...

A numerical method for the computation of the ...

A fully automatic method for the reconstruction of ...

A combinatorial method for calculating the moments of ...

A Method for the Model-Driven Development of ...

The Method of Separation: A Novel Approach for ...

A geodesic voting method for the segmentation of ...

A convenient method for the synthesis of 3,6-dihydroxy ... - Arkivoc

A geodesic voting method for the segmentation of tubular ... - Ceremade

A Voltammetric Method for the Determination of Lead(II) - American ...

A comprehensive method for the elastic calculation of ...

A Method for the Rapid Prototyping of Custom ...

A geodesic voting method for the segmentation of tubular ... - Ceremade

Method for the production of levulinic acid

A quasi-Newton acceleration for high-dimensional ... - Springer Link

25-clustering-and-kmeans-handout.pdf

Evolution of a method for determination of transfer ...

A New Energy Efficiency Measure for Quasi-Static ...

Affidavit Of Counsel Re. Decision & Order, Motion For Modification Of ...

A New Energy Efficiency Measure for Quasi-Static ...