Data & Knowledge Engineering 70 (2011) 892–921

Contents lists available at ScienceDirect

Data & Knowledge Engineering j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d a t a k

An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers☆ David Rebollo-Monedero a,⁎, Jordi Forné a, Miguel Soriano a, b a b

Department of Telematics Engineering, Universitat Politècnica de Catalunya (UPC) C. Jordi Girona 1–3, 08034 Barcelona, Spain Centre Tecnològic de Telecomunicacions de Catalunya (CTTC) Av. Carl Friedrich Gauss 7, 08860 Castelldefels, Barcelona, Spain

a r t i c l e

i n f o

Article history: Received 27 April 2010 Received in revised form 21 June 2011 Accepted 21 June 2011 Available online 2 July 2011 Keywords: k-Anonymity Privacy Anonymous microaggregation MDAV Location-based services Distortion-optimized quantizer design Lloyd algorithm k-Means method

a b s t r a c t We present a multidisciplinary solution to the problems of anonymous microaggregation and clustering, illustrated with two applications, namely privacy protection in databases, and private retrieval of location-based information. Our solution is perturbative, is based on the same privacy criterion used in microdata k-anonymization, and provides anonymity through a substantial modification of the Lloyd algorithm, a celebrated quantization design algorithm, endowed with numerical optimization techniques. Our algorithm is particularly suited to the important problem of k-anonymous microaggregation of databases, with a small integer k representing the number of individual respondents indistinguishable from each other in the published database. Our algorithm also exhibits excellent performance in the problem of clustering or macroaggregation, where k may take on arbitrarily large values. We illustrate its applicability in this second, somewhat less common case, by means of an example of location-based services. Specifically, location-aware devices entrust a third party with accurate location information. This party then uses our algorithm to create distortion-optimized, size-constrained clusters, where k nearby devices share a common centroid location, which may be regarded as a distorted version of the original one. The centroid location is sent back to the devices, which use it when contacting untrusted locationbased information providers, in lieu of the exact home location, to enforce k-anonymity. We compare the performance of our novel algorithm to the state-of-the-art microaggregation algorithm MDAV, on both synthetic and standardized real data, which encompass the cases of small and large values of k. The most promising aspect of our proposed algorithm is its capability to maintain the same k-anonymity constraint, while outperforming MDAV by a significant reduction in data distortion, in all the cases considered. © 2011 Elsevier B.V. All rights reserved.

1. Introduction The right to privacy was recognized as early as 1948 by the United Nations in the Universal Declaration of Human Rights, Article 12. With the shifting of the Internet connectivity paradigm towards almost every object of everyday life, privacy will undeniably become as crucial as ever. We motivate the importance of privacy protection with two distinct applications in the growing technological fields of statistical disclosure control (SDC) and location-based services (LBSs), respectively, in this section. The next section will offer a more technical review in the context of the state of the art.

☆ The material in this paper has been published in part in the proceedings of the 20th Tyrrhenian International Workshop on Digital Communications, Sardinia, Italy, Sept. 2–4, 2009 [1]. ⁎ Corresponding author. Tel.: + 34 93 401 7027. E-mail addresses: [email protected] (D. Rebollo-Monedero), [email protected] (J. Forné), [email protected], [email protected] (M. Soriano). 0169-023X/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2011.06.005

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

893

Fig. 1. k-Anonymous quantization, that is, aggregation of key attribute values to attain k-anonymity.

1.1. Statistical disclosure control A microdata set is a database table whose records carry information concerning individual respondents, either people or companies. This set commonly contains key attributes or quasi-identifiers, namely attributes that, in combination, may be linked with external information to reidentify the respondents to whom the records in the microdata set refer. Examples include job, address, age, gender, height and weight. Additionally, the data set contains confidential attributes with sensitive information on the respondent, such as salary, religion, political affiliation or health condition. The classification of attributes as key or confidential may ultimately rely on the specific application and the privacy requirements the microdata set is intended for. Intuitively, perturbation of the key attributes enables us to preserve privacy to a certain extent, at the cost of losing some of the data utility with respect to the unperturbed version. k-Anonymity is the requirement that each tuple of key attribute values be shared by at least k records in the data set, typically for a small integer k. This may be achieved through the microaggregation approach illustrated by the example depicted in Fig. 1, where height and weight are regarded as key attributes, and the blood concentration of cholesterol as a confidential attribute. Rather than making the original table available, we publish a k-anonymous quantized version containing aggregated records, in the sense that all key attribute values within each group are replaced by a common representative tuple. Despite the fact that k-anonymity as a measure of privacy is not without shortcomings, its simplicity make it a widely popular criterion in the SDC literature. 1.2. Privacy in location-based services The opening up of enormous business opportunities for location-based services (LBSs) is the result of recent advances in wireless communications and positioning technologies. Naturally, some of the data exchanged refers to private user information, such as user location and preferences, and should be carefully managed. In this spirit, we consider a particular application of location-based Internet access, which will serve as motivation for an architecture of private information retrieval, where k-anonymity is attained by means of clustering of user coordinates, for a large integer k. More precisely, as depicted in Fig. 2, consider Internet-enabled devices equipped with any sort of location tracking technology, frequently operative near a fixed reference location, for example a home computer, or a cell phone that is most commonly used from the same workplace. Suppose that such devices access the Internet to contact information providers, occasionally to inquire about location-based information that does not require perfectly accurate coordinates, say weather reports, traffic congestion, or local news and events. Even if authentication to the information providers were carried out with pseudonyms or authorization credentials, accurate location information could be exploited by the providers to infer user identities, possibly with the help of an address directory

ID, Query, Location Home Computer Reply LBS Provider

Cellphone Commonly Used in the Same Workplace Fig. 2. Internet-enabled devices retrieving information related to a fixed reference location where they commonly operate.

894

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

such as the yellow pages. Analyzing both location-based and location-independent queries coming from these devices, information providers could profile users according to their queries, in terms of both activity and content, thereby compromising their privacy. At this point we would like to describe a possible mechanism to counter this, at a functional level, solely to motivate our work. A trusted third party (TTP) collects accurate location information corresponding to the home location of these devices, possibly already publicly available in address directories. This party performs k-anonymity clustering of locations, that is, group locations minimizing the distortion with respect to centroid locations common to k nearby devices. Intuitively, while the same measure of privacy may be applied to all devices, devices with a home location in more densely populated areas should belong to smaller clusters and enjoy a smaller location distortion. The devices trust this intermediary party to send them back the appropriate centroid, which they simply use in lieu of their exact home location, and together with their pseudonym, in order to access LBS providers, as often as needed. Ideally, the TTP would carry out all the computational work required to cluster locations while minimizing the distortion, in a reasonably dynamic way that should enable devices to sign up for and cancel this anonymization service based on the perturbation of their home locations. 1.3. Contribution In this work, we develop a multidisciplinary solution to the important problems of k-anonymous microaggregation and clustering, each illustrated with an application: respectively, privacy protection in databases and private retrieval of locationbased information. Our main contribution is two variations of a k-anonymous aggregation algorithm, one of which is particularly suited to the important problem of k-anonymous microaggregation of databases, with a small integer k representing the number of individual respondents indistinguishable from each other in the published database. Our algorithm in its second variation also exhibits excellent performance in the problem of clustering or macroaggregation, where k may take on arbitrarily large values. 1 This newly developed algorithm is a substantial modification of the Lloyd algorithm [2,3], a celebrated quantization design algorithm, endowed with a numerical method to solve nonlinear systems of equations based on the Levenberg–Marquardt algorithm [4]. The most promising aspect of this algorithm is undoubtedly its performance in terms of the trade-off between privacy and data utility, inherent in any data perturbative method for privacy protection. Precisely, while maintaining the same k-anonymity constraints, our algorithm outperforms the state-of-the-art microaggregation MDAV by a significant reduction in distortion. Experiments show a typical distortion reduction, with respect to MDAV, of approximately 20% and 15% for clustering of uniform and Gaussian data, respectively, and 27% and 15% for microaggregation of uniform and Gaussian data, respectively. When applied to real, standardized microdata sets, our algorithm is still capable of reducing the distortion introduced by MDAV by a noticeable margin, of up to 14% in some cases, at no anonymity cost whatsoever. Under a more general perspective, the algorithm presented here is a distortion-optimized, probability-constrained quantizer design method, applicable to a wide range of engineering problems, among which the most widely known match, within the privacy literature, would be microdata anonymization. Indeed, we choose microaggregation as the main motivating example, and consequently one of the experimental cases analyzed. Less common in the privacy literature is the problem of k-anonymous clustering for large k, which we attempt to illustrate here with a simple architecture for k-anonymous retrieval of location-based information. Essentially, we consider location-aware devices, commonly operative near a fixed reference location. We then regard accurate, fixed location data as a quasi-identifier, and rescue and merge the principles behind pseudonymization, location anonymization and the privacy criterion used in microdata k-anonymization. Precisely, accurate location information is collected by a trusted third party to create distortion-optimized, size-constrained clusters, where k nearby devices share a common centroid location. This centroid location is sent back to the devices, which use it as a location quasi-pseudonym whenever they need to contact location-based information providers, in lieu of the exact home location, in order to enforce k-anonymity. Even though our actual contribution is the proposed algorithm, we believe the second application example described is an interesting complement, for it hints at the possibility of its applicability well beyond SDC with small k. 1.4. Contents This paper is organized as follows. Sections. 2.1 and 2.2 summarize the state of the art on SDC and privacy in LBSs, respectively, and Section 2.3 reviews the fundamentals of quantizer design. Section 3 describes our main contribution. It formally states the general problem of k-anonymous aggregation as a quantizer design problem, and develops a novel modification of the Lloyd algorithm for distortion-optimized, size-constrained clustering, suitable for both microaggregation and clustering. An architecture for k-anonymous retrieval of location-based information is proposed in Section 4, merely as an example of anonymous clustering, one of the possible applications of our algorithm, along with considerations of privacy attacks and connections with the literature on pseudonymization and microaggregation. Examples and empirical results are reported in Section 5 for various scenarios and statistics. Conclusions are drawn in Section 6. Finally, the rationale behind part of the heuristics involved in our algorithm is detailed in Appendix A. 1 Microaggregation commonly deals with values of k ranging from 2 to 100, and clustering could be very well defined on the basis of any value beyond that range. However, the distinction between small and large k will be made clear during our description of the two variations of the algorithm and the experimental results. Suffice it to say that our two variations of the algorithm cover any possible range of k.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

895

2. Background and state of the art This section gives first a rough overview of the state of the art on the two motivating applications of our contribution, namely SDC and privacy in LBSs. Secondly, it offers a few mathematical preliminaries and a brief summary of the fundamentals of traditional quantizer design, in preparation for the formulation of our own work. 2.1. State of the art on statistical disclosure control We now proceed to sketch a brief overview of the well-established field of SDC, motivated in the introductory Section 1.1. Although we shall focus on the application of k-anonymous microaggregation to databases, it is convenient, from this point on, to bear in mind that the principles of microaggregation extend to location data in LBSs. In any case, this extension will be emphasized over the next subsection. Recall from Section 1.1 that quasi-identifiers are data tuples that could be used to identify individual respondents, possibly with the help of external information. Microaggregation attains k-anonymity by selecting groups of at least k quasi-identifiers, and then replacing them with a common representative value for each group. The ultimate goal of microaggregation algorithms is to find, among all possible k-anonymous aggregations, that with maximum data utility; in other words, that introducing the minimum data distortion, according to some predefined distortion measure, commonly averaged squared errors. We also mentioned in the introduction that the concept of k-anonymity, originally proposed by the SDC community [5,6], is a widely popular privacy criterion, partly due to its mathematical tractability. More formally, k-anonymity is best understood as a key component of a privacy model addressing the problem of reidentification by linking [7]. In essence, a privacy attacker has access to two datasets: one linking identities with quasi-identifiers, and another linking quasi-identifiers to confidential attributes. SDC contemplates the aggregation of the quasi-identifiers in the former dataset to hinder attackers in their efforts to establish a link between identities and confidential attributes. The original formulation of k-anonymity as a privacy criterion, based on generalization and recording of key attributes, was modified into the microaggregation-based approach already commented on, and illustrated in Fig. 1, in Refs. [8–11]. Both approaches may be regarded as special cases of a generalization utilizing an abstract distortion measure between the unperturbed and the perturbed data, possibly in rather different alphabets. Distortion-optimal multivariate microaggregation was proved to be NP-hard [12], and its combinatorial complexity, even with a small number of samples, precludes brute-force search. 2 A number of heuristic methods have been proposed, which can be categorized into fixed-size and variable-size methods, according to whether all groups but one have exactly k elements with common perturbed quasi-identifiers. The maximum distance (MD) algorithm [9] and its less computationally demanding variation, the maximum distance to average vector (MDAV) algorithm [10,13–15], are fixed-size algorithms that perform particularly well in terms of the distortion they introduce, for many data distributions. Popular variable-size algorithms include the μ-Approx algorithm [11], the minimum spanning tree (MST) algorithm [16], the variable MDAV (VMDAV) algorithm [17] and the two fixed reference points (TFRP) algorithm [18]. Efforts to circumvent the complexity of multivariate microaggregation exploit projections onto one dimension but are reported to yield a much higher disclosure risk [19]. Unfortunately, while k-anonymity prevents identity disclosure, it may fail to protect against attribute disclosure. Precisely, the definition of this privacy criterion establishes that complete reidentification is unfeasible within a group of records sharing the same tuple of perturbed key attribute values. However, if the records in the group also share a common value of a confidential attribute, the association between an individual linkable to the group of perturbed key attributes and the corresponding confidential attribute remains disclosed. More generally, the main issue with k-anonymity as a privacy criterion is its vulnerability against the exploitation of the difference between the prior distribution of confidential data in the entire population, and the posterior conditional distribution of a group given the observed, perturbed key attributes. For example, imagine that the proportion of respondents with high cholesterol is much higher than that in the overall data set. This is known as a skewness attack. This vulnerability motivated the proposal of enhanced privacy criteria, some of which we proceed to sketch, along with algorithm modifications. A restriction of k-anonymity called p-sensitive k-anonymity was presented in Refs. [20,21]. In addition to the k-anonymity requirement, it is required that there be at least p different values for each confidential attribute within the group of records sharing the same tuple of perturbed key attribute values. Clearly, large values of p may lead to huge data utility loss. A slight generalization called l-diversity [22,23] was defined with the same purpose of enhancing k-anonymity. The difference with respect to p-sensitivity is that group of records must contain at least l “well-represented” values for each confidential attribute. Depending on the definition of well-represented, l-diversity can reduce to p-sensitive k-anonymity or be more restrictive. We would like to stress that neither of these enhancements succeeds in completely removing the vulnerability of k-anonymity against skewness attacks. Furthermore, both are still susceptible to similarity attacks, in the sense that while confidential attribute values within might be p-sensitive or l-diverse, they might also very well be semantically similar, for example similar diseases or salaries. A privacy criterion aimed at overcoming similarity and skewness attacks is t-closeness [24]. A perturbed microdata set satisfies t-closeness if for each group sharing a common tuple of perturbed key attribute values, the distance between the posterior distribution of the confidential attributes in the group and the prior distribution of the overall population does not exceed a threshold t. As argued in Ref. [25], to the extent to which the within-group distribution of confidential attributes resembles the distribution of those attributes for the entire dataset, skewness attacks will be thwarted. In addition, since the within-group n! There exist n = k ways of clustering n samples into cells of k samples each, assuming k divides n. For example, n = 100, k = 2 yields 8.3 ⋅ 10142 possible k! microaggregations. 2

896

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

distribution of confidential attributes mimics the distribution of those attributes over the entire dataset, no semantic similarity can occur within a group that does not occur in the entire dataset. The main limitation of the original t-closeness work [24] is that no computational procedure to reach t-closeness was specified. An information–theoretic privacy criterion, inspired by t-closeness, was proposed in Refs. [26,27]. In the latter work, privacy risk is defined as the conditional Kullback–Leibler divergence between the posterior and the prior distributions. This criterion is also tightly related to the concept of equivocation introduced by Shannon in 1949 [28], namely the conditional entropy of a private message given an observed cryptogram. Conceptually, the privacy risk defined may be regarded as an averaged version of the t-closeness requirement, over all aggregated groups. The main advantage of this information–theoretic privacy criterion is that it leads to a mathematical formulation of the privacy–utility trade-off that generalizes a well-known, extensively studied information-theoretic problem with half a century of maturity. Namely, the problem of lossy compression of source data with a distortion criterion, first proposed by Shannon in 1959 [29]. It is important to notice as well that the criterion for privacy risk in Refs. [26,27], in spite of its convenient mathematical tractability, as any criterion based on averages, may not be adequate in all applications [30]. A less technical, fairly conceptual discussion of this measure of privacy risk for SDC based on the Kullback–Leibler divergence appears in Ref. [31]. A related criterion was used in the context of optimal query forgery strategies for PIR in Ref. [32]. In conclusion, we would like to emphasize that despite the shortcomings of k-anonymity and its enhancements as a measure of privacy, it is still a widely popular criterion for SDC, mainly because of its simplicity and its theoretical interest. More generally, we acknowledge that the formulation of any privacy–utility problem relies on the appropriateness of the criteria optimized, which in turn depends on the specific application, on the statistics of the data, on the degree of data utility we are willing to compromise, and last but not least, on the adversarial model and the mechanisms against privacy contemplated. No privacy criterion, including k-anonymity in its numerous varieties, is the be-all and end-all of database anonymization [33]. 2.2. State of the art on privacy in location-based services Because our contribution is essentially a novel algorithm for k-anonymous microaggregation and clustering, along with an illustration of applicability to both SDC and LBSs, the remainder of our literature review, which shall contemplate privacy in LBSs, must be necessarily regarded merely as a partial, quick glance at the subject. The simplest form of interaction between a user and an LBS provider involves a direct message from the former to the latter including a query and the location to which the query refers. An example would be the query “Where is the nearest bank from my home address?”, accompanied by the geographic coordinates or simply the address of the user's residence. Under the assumption that the communication system used allows the LBS provider to recognize the user ID, there exists a patent privacy risk. Namely, the provider could profile users according to their locations, the contents of their queries and their activity. An intuitive solution that would preserve user privacy in terms of both queries and locations is the mediation of a TTP in the location-based information transaction, as depicted in Fig. 3. The TTP may simply act as an anonymizer, in the sense that the provider cannot know the user ID, but merely the identity IDTTP of the TTP itself inherent in the communication. Alternatively, the TTP may act as a pseudonymizer by supplying a pseudonym ID′ to the provider, but only the TTP knows the correspondence between the pseudonym ID′ and the actual user ID. A convenient twist to this approach is the use of digital credentials [34–36] granted by a TTP, namely digital content proving that a user has sufficient privileges to carry out a particular transaction without completely revealing their identity. The main advantage is that this TTP would need not be online at the time of service access to allow users to access a service with a certain degree of anonymity. Unfortunately, this approach does not prevent the LBS from attempting to infer the real identity of a user by linking their location to, say, a public address directory, for instance by using restricted space identification (RSI) or observation identification (OI) attacks [37]. In addition, TTP-based solutions require that users shift their trust from the LBS provider to another party, possibly capable of collecting queries for diverse services, which unfortunately might facilitate user profiling through crossreferencing inferences. Finally, traffic bottlenecks are a potential issue with TTP solutions, and TTPs may represent a severe infrastructure requirement in certain ad hoc networks. We shall see that the main TTP-free alternatives rely on perturbation of the location information, user collaboration and userprovider collaboration. The principle behind TTP-free perturbative methods for privacy in LBSs is represented in Fig. 4. Essentially, users may contact an untrusted LBS provider directly, perturbing their location information in order to hinder providers in their efforts to compromise user privacy in terms of location, although clearly not in terms of query contents and activity. This approach, sometimes referred to as obfuscation, presents the inherent trade-off between data utility and privacy common to any perturbative privacy method. A wide variety of fairly sophisticated perturbation methods for LBSs has been proposed [38–42]. ID, Query, Location

IDTTP, Query, Location

Reply User

Reply TTP

LBS Provider

Fig. 3. Anonymous access to an LBS provider through a TTP.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

897

ID, Query Location

Perturbed Location

Perturbation User

Reply

LBS Provider

Fig. 4. Users may contact an untrusted LBS provider directly, perturbing their location information to help protect their privacy.

There exist TTP-free methods relying on the collaboration between multiple users to achieve group anonymity. Many collaborative methods [43–45], strive to guarantee k-anonymity regarding both query contents and location, in the sense of the term defined in Sections 1.1 and 2.1, by means of constructing spatial cloaking regions, or encrypted computation of average positions. Returning to privacy mechanisms specifically aimed at LBSs, a third class of TTP-free methods such as Ref. [46] builds upon cryptographic techniques for private information retrieval (PIR) [47], which may be regarded as a form of untrusted collaboration between users and providers. Recall that PIR enables a user to privately retrieve the contents of a database, indexed by a memory address sent by the user, in the sense that it is not feasible for the database provider to ascertain which of the entries was retrieved [47]. Unfortunately, PIR methods require the provider's cooperation in the privacy protocol, are limited to a certain extent to query-response functions in the form of a finite lookup table of precomputed answers, and are burdened with a significant computational overhead. An approach to preserve user privacy to a certain extent, at the cost of traffic and processing overhead, which does not require that the user trust the information provider or the network, consists in accompanying original queries with bogus queries. Query forgery is a simple form of PIR, in the most general sense of the term, usually motivated in the context of privacy in search engines but applicable to LBS as a particularization. Building on this simple principle, several PIR protocols, mainly heuristic, have been proposed and implemented, with various degrees of sophistication [48–50]. An illustrative example for LBSs is Ref. [51]. An information–theoretic formulation of optimal forgery strategies is introduced in Ref. [32]. Simple, heuristic implementations in the form of add-ons for popular browsers have begun to appear recently [52,53]. In addition to legal implications, there are a number of technical considerations regarding bogus traffic generation for privacy [54]. Not surprisingly, a number of proposals for privacy in LBSs combine several of the elements appearing in all of the solutions above. Hybrid solutions more relevant to this work build upon the idea of location anonymizers, that is, TTPs implementing location perturbative methods [55], with the aim of hindering RSI and OI attacks, in addition to hiding the identity of the user. A sophisticated example of SDC clustering techniques applied to LBS databases containing trajectory information is Ref. [56], where the k-anonymity criterion is broadened to exploit the imprecision inherent in positioning systems. Many others are based on the k-anonymity and cloaking privacy criteria [37,57,44,58–60]. In fact, the application example of macroaggregation for private LBSs in this work falls into this category. We would like to remark that, unlike some of these architectures, the location anonymizer in our application example receives fixed reference locations once, clusters them using a k-anonymous algorithm in an optimized fashion that strives to minimize the distortion with respect to a shared centroid value for each cluster, informs the users of their centroid, and lets users contact the LBS provider directly from then on, as often as needed, using that centroid. Finally, we would like to briefly comment on some recent, rather sophisticated distributed protocols for privacy in LBSs, which also combine some of the ideas presented. While they do not assume that users necessarily trust each other, they do require certain trust relationships between architectural entities, or provide cloaking regions rather than exact positions to the LBS provider. The first example [61] envisions an architecture where two entities have access to user identities and location information separately. Accordingly, it is assumed that no information that could compromise location anonymity is exchanged between those two entities, for instance through auditing of a trusted mediator. An even more recent example [62] enables users to achieve location k-anonymity via a distributed, homomorphic cryptographic protocol. The protocol in Refs. [63,64] is presented as a solution to the problem of private retrieval of location-based information, but it is directly applicable to PIR in the most general sense of the term. In essence, it combines the techniques of query forgery and exchange in a collaborative network of users that do not trust each other, nor require a TTP or location anonymizer. 2.3. Mathematical preliminaries and background on traditional quantization Throughout the paper, the measurable space in which a random variable (r.v.) takes on values will be called an alphabet. The cardinality of a set X is denoted by |X|. We shall follow the convention of using uppercase letters for r.v.'s, and lowercase letters for particular values they take on. Probability density functions (PDF) and probability mass functions PMF) are denoted by p and subindexed by the corresponding r.v. The expectation operator is denoted by E. Expectation can model the special case of averages over a finite set of data points {x1,…, xn}, simply by defining an r.v. X uniformly distributed over this set, so that, for instance, EX = 1n ∑ni = 1 xi . A quantizer is a function that partitions a range of values x of a r.v. X, commonly continuous, approximating each resulting cell ˆ The quantizer map xˆ ðxÞ may be broken down into two steps. First, an assignment of source data X to by a value xˆ of a discrete r.v. X.

898

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) k-Anonymous aggregation regarded as a quantization design problem.

b)

Example of a quantizer in the two-dimensional Euclidean space.

Perturbed Data

Exact Data

Data aggregation

Fig. 5. Formalization and example of a quantizer.

a quantization index Q, usually natural numbers, by means of a clustering function q(x), and secondly, a reconstruction function xˆ ðqÞ mapping the index Q into a value Xˆ that approximates the original data, so that xˆ ðxÞ = xˆ ðqðxÞÞ. Both xˆ ðxÞ and q(x) are often referred to as quantizer. A general k-anonymous aggregation function is depicted in Fig. 5 in terms of the formulation just described, along with an example where the r.v. X takes on values in ℝ 2. Depending on the field of application, the terms clustering or (micro) aggregation are used in lieu of quantizer, and cluster in lieu of cell, even though they have essentially equivalent meanings. Often, a distinction is made between clustering and microaggregation, to emphasize the large and small size of the resulting cells, respectively. In the context of source coding, quantizers are required due to the need to represent the data in a countable alphabet, such as the set of finite bit strings, suitable for storage and transmission in computer systems. We reflect this requirement mathematically by assuming that Q is a finite-alphabet r.v. The size of this alphabet, that is, the number of quantization cells, is usually given as an application requirement. Clearly, quantization comes at the price of introducing a certain amount of distortion between the ˆ In mathematical original data X and its reconstructed version X. terms, we define a nonnegative function dðx; ˆxÞ called distortion   ˆ . A common measure of distortion is the mean squared error (MSE), measure, and consider the expected distortionD = Ed X; X ˆ ∥2 , popular due to its mathematical tractability. that is, D = E∥X− X Optimal quantizers are those of minimum distortion for a given number of possible indices. It is well known [65,66] that optimal quantizers must satisfy the following conditions: • Nearest-neighbor condition. Given a reconstruction function xˆ ðqÞ, the optimal quantizer q*(x) is given by q ðxÞ = arg min dðx; ˆxðqÞÞ q

ð1Þ

that is, each value x of the data is assigned to the index corresponding to the nearest reconstruction value. • Centroid condition. In the special case when MSE is used as a distortion measure, given a quantizer q(x), the optimal  reconstruction function xˆ ðqÞ is given by 

xˆ ðqÞ = E½X jq

ð2Þ

that is, each reconstruction value is the centroid of a quantization cell. Each necessary condition may be interpreted as the solution to a Bayesian decision problem. We would like to stress that these are necessary but not sufficient conditions for joint optimality. Still, these optimality conditions are exploited in the Lloyd–Max algorithm [2,3], a quantizer design algorithm based on the alternating optimization of q(x) given xˆ ðqÞ and vice versa, according to Eqs. (1) and (2). The Lloyd algorithm has been rediscovered several times in the statistical clustering literature [66], under names such as the k-means method. 3 In the Lloyd algorithm, either the quantization or the reconstruction is improved at a time, leading to an improvement on the distortion; strictly speaking, the sequence of distortions is nonincreasing. Even though a nonnegative, nonincreasing sequence has

3 The term k-means was first used by J. MacQueen in 1967 [67] although the idea goes back to H. Steinhaus in 1956 [68]. The algorithm was rediscovered independently by S. Lloyd in 1957 as a quantization technique for pulse-code modulation, but it was not published until much later, in 1982 [2].

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

899

a limit, rigorously, the convergence of the sequence of distortions does not guarantee that the sequence of quantizers obtained by the algorithm will tend to a stable configuration, less so to a jointly optimal one. In theory, the algorithm can merely provide a sophisticated upper bound on the optimal distortion. In practice, however, the algorithm often exhibits excellent performance ([66], §II.E, §III). Lastly, we would like to remark that the literature abounds with sophisticated modifications of the Lloyd algorithm, many addressing the complexity of quantizing large quantities of highly-dimensional data. Practical lossless codecs, for instance, almost universally exploit statistical redundancies of the data by means of an orthonormal transformation prior to the one-dimensional quantization of each of the transformed coefficients, trading off an acceptable distortion loss for a substantial complexity reduction [69]. Conceptually similar are the strategies of wavelet coding, trellis quantization, predictive coding, motion compensation in video, and many others [70,71]. Further, both the theories of quantization and transform coding have been extended beyond the paradigm of traditional source coding [72], and the study of Voronoi diagrams remains the object of intensive research still today [73]. A clear example of the application of traditional and new heuristic clustering techniques outside the domain of source coding is undoubtedly that of automated document classification [74,75]. Conceivably, these modifications and applications stand as potentially exciting sources of inspiration for future research into the main contribution of this paper, which, as we shall see in the next section, consists in a significant modification of a traditional quantization algorithm.

3. A modification of the Lloyd algorithm for k-anonymous aggregation 3.1. Formal statement of the k-anonymous aggregation problem as a quantizer design problem In this section, we proceed to formalize the problem of k-anonymous aggregation in a generic fashion, as a probabilityconstrained, distortion-optimized quantizer design problem. Microaggregation, in the context of SDC, is a well-established problem with extensive literature, and consequently sufficient to motivate this section. However, to hint at the potential applicability of our work beyond that, in Section 4, we shall provide a second example of k-anonymous aggregation, or more precisely, clustering, where the need for larger values of k arises naturally, this time in the context of LBSs. In either case we are concerned with the general problem of constructing groups of k, possibly multivariate, data samples, and then finding a common representative value for each group minimizing a distortion measure between the original data and such k-anonymous representation. In the SDC application, the data samples are simply the quasi-identifiers of individual respondents, and the common representative value replacing them in the published version of the data is the means to attain k-anonymity. In the LBS application, as we shall see in detail later in Section 4, the data samples will model exact locations, which may of course be regarded as quasi-identifiers. Experimental results for both applications will be reported in Section 5. We may now proceed to formally consider the design of minimum-distortion quantizers satisfying cell probability constraints, with the same block structure depicted in Fig. 5. (Tuples of) key attribute values are modeled by a r.v. X in an arbitrary alphabet X, possibly discrete or continuous. The quantizer q(x) assigns X to a quantization index Q in a finite alphabet Q = f1; …; jQj g of a ˆ which may be regarded as predetermined size. The reconstruction function xˆ ðqÞ maps Q into the aggregated key attribute value X, an approximation to the original data, defined in an arbitrary alphabet Xˆ commonly but not necessarily equal to the original data alphabet X. For any nonnegative (measurable) function dðx; ˆxÞ, called distortion measure, define the associated expected distortion   ˆ , that is, a measure of the discrepancy between the original data values and their reconstructed values, which D = Ed X; X reflects the loss in data accuracy. An important example of distortion measure is dðx; ˆxÞ = jjx− ˆxjj2 , for which D becomes the MSE. Alternatively, dðx; ˆxÞ may represent a semantic distance in an ontological hierarchy, and Xˆ model a conceptual generalization of X, a r.v. taking on words as values. pQ(q) denotes the PMF corresponding to the cell probabilities. The k-anonymity requirement in the clustering problem is formalized, from a more general perspective, by means of probability constraints pQ(q) = p0(q) for each cell q, for any given PMF p0(q). In the important example of microdata k-anonymization, let n be the total number of records to be microaggregated. The k-anonymity constraint could be translated into probability constraints by setting jQj ¼tn = k⌋ and p0 ðqÞ = 1 = jQ j, which ensures that n p0(q) ≥ k. More generally, for a given probability p0, we could naturally speak of p0-anonymization, a term more suited to continuous probability models of user locations. Given a distortion measure dðx; ˆxÞ and probability constraints pQ(q) = p0(q) (along with the specification of the number of  quantization cells jQ jÞ, we wish to design an optimal quantizer q*(x) and an optimal reconstruction function xˆ ðqÞ, in the sense that they jointly minimize the distortion D while satisfying the probability constraints. We would like to stress that our formulation of the probability-constrained quantization problem may also find further applications in a variety of similarity-based, workload-constrained resource allocation problems. Even though this paper focuses on MSE, our solution is suitable for an entirely generic distortion measure, possibly over categorical alphabets, for example semantic distances. The remainder of this section investigates the problem of distortion-optimized, probability-constrained quantization formulated here, and motivated as the functionality implemented by the microaggregation process in the case of SDC, and by a location k-anonymizer in the case of LBSs, as we shall see later in Section 4. The quantizer design method proposed is a substantial modification of the Lloyd algorithm [2,3], a celebrated quantization design algorithm, endowed with a numerical method to solve nonlinear systems of equations inspired by the Levenberg–Marquardt algorithm [4].

900

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

3.2. Optimization steps Next, we propose heuristic optimization steps for probability-constrained quantizers and reconstruction functions, analogous to the nearest-neighbor and centroid conditions found in conventional quantization [65,66], reviewed in Section 2.3. We then modify the conventional Lloyd algorithm by applying its underlying alternating optimization principle to these steps.  Finding the optimal reconstruction function xˆ ðqÞ for a given quantizer q(x) is a problem identical to that in conventional quantization:  xˆ ðqÞ = arg min E½dðX; ˆxÞjq⋅ xˆ

ð3Þ

In the special case when MSE is used as distortion measure, this is the centroid step (2). On the other hand, we may not apply the nearest-neighbor condition in conventional quantization directly, if we wish to guarantee the probability constraints pQ(q) = p0(q). We introduce a cell cost functionc : Q→R; a real-valued function of the quantization indices, which assigns an additive cost c(q) to a cell indexed by q. The intuitive purpose of this function is to shift the cell boundaries appropriately to satisfy the probability constraints. Specifically, given a reconstruction function xˆ ðqÞ and a cost function c(q), we propose the following cost-sensitive nearest-neighbor step: 

q ðxÞ = arg min dðx; ˆxðqÞÞ + cðqÞ⋅ q

ð4Þ

This is a heuristic step inspired by the nearest-neighbor condition of conventional quantization (1). According to this formula, increasing the cost of a cell, leaving the cost of the others and all centroids unchanged, will reduce the number of points assigned to it. Conversely, decreasing the cost will push cell boundaries outwards and thus increase its size. Instead of the conventional Voronoi cells [65,66] determined by the traditional nearest-neighbor condition, it is routine to check that our modified requirement defines an additively weighted Voronoi partition composed of convex polytopes. The step just proposed naturally leads to the question of how to find a cost function c(q) such that the probability constraints pQ(q) = p0(q) are satisfied, given a reconstruction function xˆ ðqÞ. We remark that for discrete probability distributions of X it is easy to see that it may not be possible to find such c(q). In the continuous case, we propose the following method, which proved to be very successful in all of our experiments, including those reported in Section 5. Specifically, we propose an application of the Levenberg–Marquardt algorithm [4], an algorithm to solve systems of nonlinear equations numerically, or similarly but slightly more simply, a Tychonov regularization of the Gauss–Newton algorithm [76], for example with backtracking line search [77] along the descent direction. A finite-difference estimation of the Jacobian is carried out, by slightly increasing each of the coordinates of c(q) at a time. To do so more efficiently, we exploit the fact that only the coordinates of pQ(q) corresponding to neighboring cells may be changed, and compute the negative semidefinite approximation in Frobenius norm to the Jacobian. 3.3. A modification of the Lloyd algorithm for probability-constrained quantization Ideally, we wish to find a pair of quantizers and reconstruction functions that jointly minimize the distortion. The conventional Lloyd algorithm [2,3], reviewed in Section 2.3, is essentially an alternating optimization algorithm that iterates between the nearest-neighbor (1) and the centroid optimality (2) conditions. These are necessary but not sufficient conditions, thus the  algorithm can only hope to approximate a jointly optimal pair q*(x), xˆ ðqÞ, but only guaranteeing that the sequence of distortions is nonincreasing. We also mentioned that experimentally, however, the Lloyd algorithm very often shows excellent performance ([66], §II.E, §III). Bear in mind that our modification of the nearest-neighbor condition (Eq. 4) for the probability-constrained problem is a heuristic proposal, in the sense that this work does not prove it to be a necessary condition. We still use the same alternating optimization principle, albeit with a more sophisticated nearest-neighbor condition (Eq. 4), and define the following modified Lloyd algorithm for probability-constrained quantization. In recognition of the celebrated quantization design method which inspired this work, we shall call our algorithm the probability-constrained Lloyd (PCL) algorithm: 1. Choose an initial reconstruction function xˆ (q) and initial cost function c(q). 2. Update c(q) to satisfy the probability constraints pQ(q) = p0(q), given the current xˆ (q). To this end, use the method described at the end of Section 3.2, setting the initial cost function as the cost function at the beginning of this step. 3. Find the next quantizer q(x) corresponding to the current xˆ (q) and the current c(q), according to Eq. (4). 4. Find the optimal xˆ (q) corresponding to the current q(x), according to Eq. (3). 5. Go back to Step 2, until an appropriate convergence condition is satisfied or a maximum number of iterations exceeded. The initial reconstruction values may simply be chosen as jQj random samples distributed according to the probability distribution pX(x) of X. A simple yet effective cost function initialization is c(q) = 0, because it ensures that the corresponding quantizer cells cannot have zero volume. Note that the numerical computation of c(q) in Step 2 should benefit from better and better initializations as the reconstruction values become stable. If the probability of a cell happens to vanish at any point of the algorithm, this can be tackled by choosing a new, random reconstruction value, with cost equal to the minimum of the rest of costs.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

901

The stopping convergence condition might for instance consider slowdowns in the sequence of distortions obtained. For example, the algorithm can be stopped the first time a minimum relative distortion decrement is not attained, or, more simply, when a maximum number of iterations is exceeded. It is important to remark that the numerical computation of the costs in the PCL algorithm just described, by means of the Levenberg–Marquardt method or any of the variations mentioned in Section 3.2, assumes that cell probabilities are a smooth function of these costs, as these numerical methods estimate partial derivatives through finite differences. In other words, the PCL algorithm, as described at this point, is only suitable for probability-constraints leading to a large number of samples per cell, or for continuous data. In particular, it is suitable for k-anonymous clustering with relatively large values of k. Next, in Section 3.4, we propose a modification of our PCL algorithm that addresses this issue, extending its applicability to microaggregation, that is, small values of k, and, more generally, to any sort of probability constraint. We shall provide empirical evidence in Section 5 that, just as in the conventional Lloyd algorithm, either the quantization or the reconstruction is improved at a time, and that the sequence of distortions obtained is nonincreasing. This is a very convenient property implying that the PCL algorithm can be used to improve the performance of any other aggregation algorithm, just by using it as initialization of PCL. The same section reports experimental results for both clustering or large k, and microaggregation or small k. 3.4. A noisy variation for the special case of microaggregation or small k We argued that the PCL algorithm described in Section 3.3 was only suitable for probability constraints leading to a large number of points per cell, or continuous models of the data, because the derivative-based numerical methods involved in the adjustment of the costs c(q) assumed a smooth dependence between them and the cell probabilities pQ(q). Here, we present a variation of the PCL algorithm suitable for constraints involving any number, however small, of points per cell, and in particular k-anonymous microaggregation, that is, small k. The main idea is the introduction of additional, noisy samples around the original data samples, with decreasing variance. As the number of total samples is large, the numerical methods at the end of Section 3.2 are appropriate, and as the noise variance decreases with the iterations, the final distribution of the data resembles that of the original samples. More precisely, this heuristic variation, which we shall refer to as noisy PCL algorithm, is as follows: 1. Choose an initial reconstruction function xˆ (q) and initial cost function c(q). 2. For each original data sample, introduce new noisy samples with a given covariance matrix ∑, and take the original data samples and the newly created noisy samples as data samples for the remainder of the algorithm (any previous noisy samples are replaced). 3. Update c(q) to satisfy the probability constraints pQ(q) = p0(q), given the current xˆ (q). To this end, use the method described at the end of Section 3.2, setting the initial cost function as the cost function at the beginning of this step. 4. Find the next quantizer q(x) corresponding to the current xˆ (q) and the current c(q), according to Eq. (4). 5. Find the optimal xˆ (q) corresponding to the current q(x), according to Eq. (3). 6. Reduce the magnitude of the entries of the covariance ∑. 7. Go back to Step 2, until an appropriate convergence condition is satisfied or a maximum number of iterations exceeded. The number of noisy samples per data sample must be chosen so that the probability-constraints lead to a large number of (total) samples per cell, and the finite-difference estimations of partial derivatives of the numerical methods to adjust the costs work appropriately. The experiments reported later on in Section 5 use independent, identically distributed copies of Gaussian samples with independent components, so that there are thousands of samples per cell. The noise variance is initially large enough for the resulting clouds of points to overlap, forming a fairly contiguous distribution, and finally small enough for the clouds not to intersect, allowing an approximate distinction of the original samples. The reduction of the variance follows a geometric progression. The appropriate maximum number of iterations is highly dependent on the data. This heuristic variation of the PCL algorithm will be proven to be very effective in the experiments of Section 5. The rationale behind it is detailed in Appendix A, and illustrated with a few examples. 3.5. On convergence, complexity and hierarchical partitioning Shortly, we shall provide experimental evidence that PCL may, and often does, outperform the state-of-the-art MDAV in terms of data utility for the same exact k-anonymity requirement. It is fair to stress, however, that PCL is noticeably more complex than MDAV, both in terms of sophistication and running time, as one may gather from the description above. Fortunately, the two scenarios of application that we contemplate in this paper will typically come with fairly lenient time constraints. Despite the heuristic, experimental, necessarily limited nature of our contribution, and the previous argument against a strong concern for running time, we would not like to finish this section without a quick note on the convergence and complexity properties of PCL. Because PCL combines and extends two already sophisticated algorithms, namely the Lloyd and the Levenberg– Marquardt algorithms, a rigorous, detailed analysis would surely require a great deal of effort beyond the scope of this paper. As far as the conventional Lloyd algorithm is concerned, suffice to say that the algorithm is known to converge only under certain statistics [78–80]. We stressed in our review of Section 2.3 that the conditions the Lloyd algorithm is based on are necessary for optimality but not sufficient, and that even though the sequence of distortions obtained has a limit, this does not imply that the

902

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

sequence of quantizers tends towards a stable, jointly optimal configuration. It is essentially the excellent experimental performance of the Lloyd algorithm what warrants its use ([66], §II.E, §III). In reference to the running time of the algorithm, and more concretely, the scaling of number of iterations required with the size of the dataset, again, common practice imposes a stopping condition based on the numerical stabilization of the sequence of distortions [65,66]. It is only natural to expect that the analysis of PCL to be at least as intricate as that of Lloyd, being an extension. There exist a number of mathematically involved convergence and complexity results on the Levenberg–Marquardt method in some of its variations [81]. A particularly recent one bounds complexity in terms of the norm of the gradient of the error [82]. Regrettably, it is far from straightforward to apply these bounds to PCL, not to mention that doing so would only address one step of PCL, where the costs are adjusted to satisfy the probability constraints. In conclusion, aside from marking clear paths for future research, we constraint ourselves to a preliminary, merely experimental analysis on the running time of PCL for a real, standardized dataset, in Section 5.3. We shall see in that section that, unsurprisingly, the complexity of the algorithm is notoriously nonlinear with the number of cells, which is the number of unknown costs in the system of nonlinear equations modeling the cost adjustment. For this very reason, we propose a hierarchical application of PCL, implemented in the experimental section. Precisely, we propose the prepartitioning of the dataset in a small number of large macrocells, and the individual postpartitioning of these macrocells into cells of the intended size k. Effectively, we are breaking down a large system of equations into decoupled parts. Clearly, the postpartitioning process has a complexity that grows linearly with the number of macrocells, and offers the convenient advantage of allowing parallel computation. Even though we used a two-level approach, multiple levels of hierarchical partitioning may help reduce the running time of the preprocessing. In our experiments of Section 3, we shall see that the time required for prepartitioning was almost negligible compared to postpartitioning, but that the hierarchical application of PCL incurs a small but noticeable loss in distortion performance. On the flip side, one may regard prepartitioning as a practically convenient strategy to gracefully trade-off distortion for running time. 4. A functional architecture for k-anonymous location-based information retrieval This section details the architecture sketched in Section 1.2 for k-anonymous retrieval of location-based information, and ties it up with the formulation in Section 3.1. The description of the architecture is prefaced by considerations regarding the privacy risk posed by submitting location information as part of a query addressed to an information provider. Our description in this section will remain at a functional level, in the sense that one of the building blocks, namely that carrying out k-anonymous microaggregation of location data, could in principle be implemented by any of the microaggregation algorithms cited in Section 2.1, for example MDAV. Because our main contribution is precisely the introduction of a k-anonymous algorithm, namely PCL, described in Section 3.3, and because its applicability to microaggregation suffices to motivate it, we do not flesh out the clustering application for LBSs beyond what is strictly necessary to hint at the possibility that PCL may be applicable for large k. Section 5 will provide experimental evidence that PCL typically outperforms the state-of-the-art algorithm MDAV, for both microaggregation and clustering, that is, for both low and high k, respectively. This will demonstrate that PCL is a superior candidate to implement the key building block of the architecture described next. 4.1. Privacy attack against location-based querying Recall from Section 1.2 that we consider devices frequently operative near a fixed reference location, accessing the Internet to contact information providers, occasionally to inquire about location-based information that does not require perfectly accurate coordinates. In addition to the examples of devices already mentioned in that section, depicted in Fig. 2, namely home computers and cell phones used from the same workplace, we may contemplate others such as desktop computers at a particular Internet café frequently visited at similar times, Internet-enabled PDAs used on a daily train to work with WiFi capabilities, and laptops commonly operated from a hotel one may regularly stay in. We assume that users submit their reference location to an information provider, along with a query based on that location, to enquire, perhaps, about an elegant restaurant relatively close to their home, the branch of their bank in the area they work, local news and events, traffic congestion or weather reports. Submission of a perfectly accurate location to an untrusted information provider poses a serious privacy risk. We provide a more specific illustration of such privacy compromise by means of an example. A user accesses a location-based information provider from a home computer to enquire about an elegant, local Italian restaurant, to take her date to. The user specifies her home location directly by means of her address or perhaps GPS coordinates. Immediately afterwards and from the same computer, the user then checks the restaurant website and proceeds to make an online reservation for dinner, for two people, the same night. A few minutes later, she buys two tickets for a musical downtown, directly at the theater's website or through any intermediary service for online ticket purchases. Assuming that the same IP has been used for all these transactions, and despite the fact that the initial query to the LBS has been pseudonymized, a privacy attacker is able to pinpoint the home address of the user, link it to her IP address, and with the help of yellow pages, infer her full name and home number. In this way, the attacker knows with significant detail about the user's date for tonight, in addition to her dinner and entertainment preferences. Furthermore, with the help of someone working in the payments department associated with one of the websites involved, the attacker may gain critical information that could be used for identity theft.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

903

This example illustrates that the solutions for preserving user privacy relying on anonymizers and pseudonymizers [34–36], cited in Section 2.2 and conceptually depicted in Fig. 3, may not be entirely satisfactory. Even if authentication to the information providers were carried out with pseudonyms or authorization credentials, accurate location information could be exploited by the providers to infer user identities, for instance by using RSI and OI attacks [37], possibly with the help of an address directory such as the yellow pages. In the terminology of SDC introduced in Sections 1.1 and 2.1, accurate location data, particularly when corresponding to a fixed reference location, as assumed here, plays the role of a quasi-identifier. Finally, note that location anonymization must be used in conjunction with anonymization and pseudonymization techniques, such as those mentioned in Section 2.2, that prevent attackers from obtaining any and all quasi-identifiers, including identifiers inherent in the communication system such as an IP address. In the event that the privacy requirement of k-anonymization of locations, viewed as quasi-identifiers, is constrained further to provide k-anonymization of all quasi-identifiers available to an attacker, one must consider a joint solution, in which aggregation should be performed on the entire tuple of quasi-identifiers. 4.2. Informal description of the architecture We now present an informal description of an architecture for k-anonymous retrieval of location-based information. A more formal specification is the object of Section 4.3. As we mentioned in Section 2.2, in order to protect user privacy in the retrieval of location-based information, a number of solutions take a step further from the idea of anonymizers and pseudonymizers and propose location anonymizers that perturb location data [55], many of them based on k-anonymity and cloaking [37,57,44,58–60]. The discussion of Section 4.1 and our assumption that reference locations are rather fixed enable us to regard location data as an effective identifier. Accordingly, we rescue and merge the principles behind pseudonymization and location k-anonymization to propose the architecture informally depicted in Fig. 6. In essence, a TTP collects accurate location information corresponding to the reference location of the users, possibly already publicly available in address directories. The TTP then performs k-anonymity aggregation of locations, that is, group locations minimizing the distortion with respect to centroid locations common to k nearby reference locations. From then on, users directly send their corresponding centroid, in lieu of their exact reference location, to the LBS provider, whenever they need to access it. In principle, this could be carried out by means of any of the microaggregation algorithms cited in Section 2.1, for example MDAV. However, we shall introduce and recommend an algorithm of our own, which outperforms the state-of-the-art algorithm MDAV, for both microaggregation and clustering, that is, for both low and high k, respectively, as we shall prove experimentally in Section 5. Intuitively, while the same measure of privacy may be applied to all devices, devices with a home location in more densely populated areas should belong to smaller clusters and enjoy a smaller location distortion. The devices trust this intermediary party to send them back the appropriate centroid, which they simply use in lieu of their exact home location, and together with their pseudonym, in order to access LBS providers. In the same way that we are lead to regard accurate location data as a quasiidentifier, we may regard the perturbed location data as a (k-anonymous) quasi-pseudonym. Ideally, the TTP would carry out all the computational work required to cluster locations while minimizing the distortion, in a reasonably dynamic way that should enable devices to sign up for and cancel this anonymization service based on the perturbation of their home locations. 4.3. Formal description of the architecture We formalize the architecture described in Section 4.2, under the general formulation of Section 3.1. Specifically, and according to the terminology of Section 2, we describe a protocol, sketched in Fig. 7, for k-anonymous location-based information retrieval

Location Anonymizer

Exact Location

Perturbed Location

User

LBS Provider Fig. 6. Informal depiction of the proposed architecture.

904

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

Fig. 7. Architecture.

with a location anonymizer. Summarizing Section 4.2, users are assumed to frequently operate near a fixed reference location, which we call home location, represented by values of a r.v. X in an arbitrary alphabet, possibly discrete or continuous, for example, Cartesian or spherical coordinates, or vertices of a graph modeling geographic adjacencies. A TTP playing the role of location anonymizer collects accurate home location information, either from the users, or from publicly available address directories. This party performs k-anonymous clustering of locations, that is, group locations around centroid locations common to k nearby devices. Users trust this intermediary party to send them back the appropriate centroid, which they simply use in lieu of ˆ which may be regarded as their exact home location whenever they access LBS providers. The centroid is represented by the r.v. X, an approximation to the original data, defined in an arbitrary alphabet, commonly but not necessarily equal to the original data alphabet. We stress once more that for higher privacy protection, users may in addition utilize anonymizers, pseudonymizers or digital credentials, as explained in Section 2.2. Suppose for the sake of argument that the population of users remained static. Under that idyllic hypothesis, the clustering algorithm, PCL, would need to be executed by the location anonymizer only once. In practice, as new customers start signing up for, or canceling, LBSs, but wish to anonymize their location, the location anonymizer could, in principle, reuse the same centroids and costs already computed by PCL to cluster the new users according to the modified nearest-neighbor condition (Eq. 4). However, slowly, a privacy attacker with a list of users who have canceled the service may observe a slightly smaller anonymity than that intended, and the old aggregation may yield slightly inadequate distortion on the new population. Hence, periodically afterwards, once the newer portion of the population acquires sufficient weight, PCL could be rerun by the TTP for the updated population, taking advantage of the iterative nature of the algorithm, that is, using the old parameters as initialization. The hassle of having to inform all customers of their updated centroids would of course prevent this update from occurring too frequently. Evidently, the actual k-anonymity attained by such slowly updated location anonymizer would be lower than intended when facing a sophisticated privacy attacker that could keep track of the history of perturbed locations used. Alternatively, PCL could be run only once on a population model represented by a probability distribution capturing the relative density of a growing population, for a relative anonymity constraint, in a statistical sense, but knowledge of new subscriptions and cancelations by an attacker would still pose a privacy problem. As one may expect after this preliminary digression, the anonymization of timevarying datasets, data streams and trajectories [83,56,42] is still a topic of extensive research. The clustering function implemented by the location anonymizer is depicted in Fig. 8, which is nothing but the particularization of Fig. 5(a) in the special case at hand.

5. Examples and experimental results This section illustrates the noiseless and noisy variations of PCL, the algorithm for probability-constrained quantization design proposed in Sections 3.3 and 3.4, with experimental results for a few intuitive, synthetic datasets in R2 , and compares its performance, in terms of distortion and anonymity, with the state-of-the-art algorithm MDAV. After presenting a couple of simple but insightful examples in Section 5.1, we turn our attention to an experimental analysis for larger sets of samples with uniform and Gaussian statistics in Section 5.2, which contemplates the cases of both clustering and microaggregation, that is, large and small k-anonymity, respectively. Finally, in Section 5.3, we analyze real, standardized datasets to confirm, once more, the higher performance of PCL over MDAV in terms of the privacy–utility trade-off. 1 ˆ 2 . More explicitly, the In all cases, MSE normalized per dimension is used as distortion measure, thus D = EjjX  Xjj 2 1 ∑n ∥x − xˆi ∥2 . Even expectation is of course taken according to the empirical distribution of the n points x1,…, xn, hence D = 2n i =1 i

Exact Location

Perturbed Location

Fig. 8. Location anonymizer.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

905

b) PCL, kmin =

a)

MDAV , kmin . Edges represent cell assignments.

iterations. Additively weighted Voronoi cells are shown.

1 2

2 3

0.8

0.8

5

3

x2

0.6

x2

0.6

5

0.4

0.4

0.2

0.2

4 0

0.2

0.4

1

1 0.6

0.8

0

1

0.2

x1

0.4

4

0.6

x1

0.8

1

Fig. 9. Simple example of k-anonymous microaggregation of two-dimensional points, where PCL outperforms MDAV by a 42%  distortion reduction, with exactly  k = 0:2 . The large dots p0 = n represent the reconstruction values xˆ 1 ; …; xˆ 5 . PCL was initialized randomly. the same k-anonymity requirement. The small dots depict n = 10 samples aggregated into |Q| = 5 cells of k = 2 elements

though our code has not been optimized for speed, we provide preliminary remarks regarding computational costs for the most challenging experiments carried out, in Section 5.3. 5.1. Simple, illustrative examples The following examples are meant to show that even for simple sample sets, the performance of PCL over MDAV may be significant. Figs. 9 and 10 compare the k-anonymous microaggregation of particular distributions of n = 10 points inside the square [0, 1] 2, with k = 2 and k = 3, for which both MDAV and 50 iterations of the noisy version of PCL have been run. The small dots depict the samples x1,…, xn, and the large ones, the reconstruction values xl ; …;xjQj . The edges in the MDAV microaggregation illustrate cell assignments. The additively weighted Voronoi cells determined by the modified nearest-neighbor condition (Eq. 4) are shown in the PCL microaggregation. The initial reconstruction values of PCL were randomly drawn from the samples available.

a)

b)

MDAV , kmin = 3, 0:0332. Edges represent cell assignments.

PCL, kmin = 3, 0.0189 (-43%), 50 iterations. Additively weighted Voronoi cells are shown.

0.8 0.8

1

1 0.6

2

x2

x2

0.6

3

0.4

0.4

2 0.2

0.2 0.2

0.4

0.6

x1

0.8

3 0.2

0.4

0.6

0.8

1

x1

Fig. 10. Another simple example of k-anonymous microaggregation of two-dimensional points, where PCL outperforms MDAV by a 43% distortion reduction, with exactly the same k-anonymity requirement. The small dots depict n = 10 samples aggregated into j Qj = 3 cells of at least k = 3 elements, occasionally  k = 0:3 . The large dots represent the reconstruction values xˆ 1 , xˆ 2 and xˆ 3 . PCL was initialized randomly. 4 p0 = n

906

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

The probability constraints used for the case xˆ q , in which k is not a divisor of n, were arbitrarily set to pQ ð1Þ = 4 =10 and pQ ð2Þ = pQ ð3Þ = 3 =10 , because a common, fractional probability p0 = k n = 3 =10 cannot be even approximately met for small k. Interestingly, PCL yields a distortion roughly a 40% lower than that of MDAV, while respecting the very same k-anonymity requirement. It is particularly clear from Fig. 10 that the greedy heuristic implemented by MDAV, after choosing groups 1 and 2 first, is finally left with a poor choice for group 3. It is only fair to recognize that the particular improvement in distortion must necessarily depend on the data samples and the initialization of PCL. In fact, it is straightforward to find samples and initializations leading to a lower distortion improvement, or even, albeit exceptionally, a poorer distortion, but also a higher distortion improvement. What these simple examples reveal, however, is the potential for a considerable improvement given by PCL over MDAV in terms of the privacy-distortion trade-off. We look into distortion improvements for much larger sets next in Section 5.2, which shall also prove quite significant.

5.2. Experimental results for uniform and Gaussian statistics We turn now to an experimental analysis based on larger sets of samples, which, while still synthetic and somewhat simplistic, we believe to be quite insightful. Specifically, we draw independent, identically distributed samples of a r.v. X for two probability distributions of points in ℝ 2. Namely, we first assume that X is uniformly distributed on the square [0, 1] 2, and secondly, we consider the case when X is composed of independent, zero-mean, unit-variance Gaussian entries. Although in this subsection we shall content ourselves with the empirical intuition provided by synthetic data, it is far from difficult to find real-world data roughly fitting a uniform or even jointly Gaussian model in the SDC scenario of Section 1.1. For example, the height and (the logarithm of) the weight of adult men approximately follow a Gaussian model with correlation coefficient 0.48, according to Ref. [84]. In any case, experiments on real data are reported at the end, in Section 5.3. From the perspective of the motivating application for LBSs described in Section 1.2, the first distribution may be roughly interpreted as the set of home locations of users in a square, uniformly populated sector. The second distribution is intended to evaluate the algorithm performance for drastically different statistics, but may also be interpreted as the set of home locations of users in a circular urban area with a denser, centric downtown and more spread out suburbs. Further, we contemplate both the cases of clustering and microaggregation, that is, both the cases of large and small kanonymity, respectively. In the first set of experiments, a total of n = 10000 points are drawn according to each of these statistics, and the PCL algorithm without noisy samples is applied for a common probability constraint p Q (q) = p 0 , with k p0 = = 1 =2 ; 1 =3 ; ::::; 1 =10 . In the second set of experiments, the number of samples is n = 250, the probability constraint n corresponds to the k-anonymity requirements k = 2,…, 10, but the noisy version of the PCL algorithm is used to satisfy the k anonymity constraint strictly. In the latter set, whenever k is not a divisor of n, as a common fractional probability constraint n cannot be met even approximately due to the low value of k, we set different equality constraints either matching the cell sizes of MDAV or enlarging slightly the first few constraints. Even though one could conceivably optimize the choice of probability constraints in this case, distortion improvements for large n and the distributions considered would go unnoticed. In both sets, a total of 5 initializations of the reconstruction values of PCL were used, one based on the results of MDAV, and the remaining 4 randomly chosen from the initial samples. The results for the noiseless version of the PCL algorithm, appropriate for large k, are shown in Figs. 11 and 12 for the uniform case, and Figs. 13 and 14 for the Gaussian case. A total of 50 iterations were carried out. In the absolutely worst case, among all

0%

0.06 0.05 0.04 0.03 0.02

PC L (MDAV Init) MDAV PC L (MDAV Init) PC L (Rand Init)

0.01 0 0.1

0.2

0.3

pmin

0.4

PC L (Rand Init)

0.5

0.1

0.2

0.3

0.4

0.5

pmin

Fig. 11. Performance comparison of PCL vs. MDAV for k-anonymous clustering of n = 10000 two-dimensional points. These points were drawn according to a k uniform distribution inside the square [0, 1]2, and both algorithms run for p0 = = 1 =2 ; 1 =3 ; …; 1 =10 . For each k, 5 initializations of the reconstruction values of n PCL were used, one based on the results of MDAV, and the rest randomly chosen from the initial samples.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) MDAV , kmin = 1428, pmin

0.14,

b)

0.0184.

PCL , kmin = 1428, pmin 0.14, 0.0128 (-30%), 50 iterations.

1

1

3

5

1

0.75

0. 5

7

6

0.25

3

0.75

5

x2

x2

907

1

0. 5

7

2

0.25

4

4

2

6

0

0 0

0.25

0. 5

0.75

1

0

0.25

x1

0. 5

0.75

1

x1

Fig. 12. MDAV and PCL clusterings from the experiments represented in Fig. 11, corresponding to the probability constraint p0 = 1 =7 . PCL outperforms MDAV by a 39% distortion reduction, with exactly the same k-anonymity requirement. The small dots depict n = 10000 uniform samples aggregated into |Q| = 7 cells of at least k = 1428 elements. The large dots represent the reconstruction values xˆ 1 ,…,ˆx7 .

initializations and for both statistics, the k-anonymity constraint was violated by only a 0.2%, while the distortion improvement attained approximate values of up to 35% over MDAV in the uniform case, and up to 20% in the Gaussian case, hardly sensitive to the initialization method. Note that just as we remarked in Section 3.2, clusters are convex polytopes. We would like to remark here as well that the quantizers for the uniform case in Fig. 12 (b), and for the Gaussian case in Fig. 13, bear a passing resemble to a hexagonal lattice, known to minimize the distortion among all two-dimensional lattices [85]. The results for the noisy version of the PCL algorithm, appropriate for small k, are shown in Figs. 15 and 16 for the uniform case, and Fig. 17 and 18 for the Gaussian case. The distortion improvements were approximately of up to 29% over MDAV for the uniform case, and up to 25% in the Gaussian case. Not entirely unexpectedly, the results were more sensitive to the initialization than those for large k, swinging the distortion by roughly ± 2%, typically, with respect to the average performance. Perhaps more surprising was the fact that initializations based on MDAV did not lead in general to a better performance, despite being an excellent, state-of-the-art heuristic for microaggregation. The number of iterations and the initial variance was increased until the

0.8 0.7

0%

MDAV PCL (MDAV Init) PCL (Rand Init)

PCL (MDAV Init) PCL (Rand Init)

0.6 0.5 0.4 0.3 0.2

0.1 0.1

0.2

0.3 pmin

0.4

0.5

0.1

0.2

0.3 pmin

0.4

0.5

Fig. 13. Performance comparison of PCL vs. MDAV for k-anonymous clustering of n = 10000 two-dimensional points. These points were drawn according to a Gaussian distribution of covariance identity, and both algorithms run for p0 = 1 =2 ; 1 =3 ; …; 1 =10 . For each k, 5 initializations of the reconstruction values of PCL were used, one based on the results of MDAV, and the rest randomly chosen from the initial samples.

908

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) MDAV , kmin = 1111, pmin

0.11,

b) PCL , kmin = 1110, pmin

0.11 (-0.09%), 0.187 (-15%), 50 iterations.

0.218.

4

4

3

3 2

2

1 8

4

0

5

1

1

5 3

8

x2

x2

1

9 7

0

4

2

6

0

1

2

6

2

3

4

0

x1

3

9

7

1

2

3

4

x1

k = 1 =9 . PCL outperforms MDAV Fig. 14. MDAV and PCL clusterings from the experiments represented in Fig. 13, corresponding to the probability constraint p0 = n by a 15% distortion reduction, at the cost of only a 0.09% worst-case violation of the anonymity requirement. The small dots depict n = 10000 Gaussian samples aggregated into j Qj = 9 cells of a target minimum of k = 1111 elements. The large dots represent the reconstruction values xˆ 1 ,…,xˆ 9 .

k-anonymity criterion was met perfectly, by trial and error in preliminary experimentation, with a single exception among all initializations, for the Gaussian case, which was of course discarded. The number of iterations required by the noisy version of the PCL algorithm, 150 for the uniform case and 300 for the Gaussian, was higher than that required by the noiseless version, partly because the denoising process must be sufficiently slow, and it was implemented as part of the outer iterations, rather than as part of the inner iterations running the Levenberg–Marquardt optimization. In the uniform case, we used 4000 noisy samples per original sample, and 5000 in the Gaussian case. As our proof-of-concept implementation used a common variance value for all samples, we confirmed the intuition that the experiments with Gaussian statistics required a higher initial variance of the noise, presumably due to the sparse samples forming the exterior of the cloud, and a very small final variance, due to the tightly packed samples at the center. In this and other experiments carried out we observed that each optimization step decreased the distortion while approximately respecting the probability constraints. This experimental finding leads us to believe that not only the centroid condition (3) is optimal, but so may be our proposal for the nearest-neighbor step (4), although it remains to be proved. The convergence behavior is promising, and similar to that often exhibited by the conventional Lloyd algorithm. Namely, often the same low-distortion solution is found regardless of the initialization, in a small number of iterations.

4

x 10 MDAV PCL ( MDAV Init) PCL ( Rand Init)

3.5 3

PCL ( MDAV Init) PCL ( Rand Init)

2.5 2 1.5 1 0.5 0

2

4

6

kmin

8

10

2

4

6

8

10

kmin

Fig. 15. Performance comparison of PCL vs. MDAV for k-anonymous microaggregation of n = 250 two-dimensional points. These points were drawn according to a uniform distribution inside the square [0, 1]2, and both algorithms run for k = 2,…, 10. For each k, 5 initializations of the reconstruction values of PCL were used, one based on the results of MDAV, and the rest randomly chosen from the initial samples.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) MDAV , kmin = 3, pmin = 0.012,

909

0.00079. Edges represent cell

assignments. 7

20

30

0.9

1

11

4

9

44 32

0.8

37

42

57

52

24

5 21

48 14

0.7 0.6

34

56

50

64

72

39

83

54

x2

36

0.5

62

28

73

0.4 18

0.3

0.1 0

10

0.1

0.2

47

51

22 2

25

45 23

58

38 43 41

15

75

63

59

53 6

35

55

71

40

0.2

65

74

27

61

77

49

70

81

79 67 69

17

68 60

80

82

46

26

78

76

66

16

31

0.3

0.4

19

29

12

0.5

0.6

0.7

33

13

8

3

0.8

0.9

1

x1

b)

PCL , kmin = 3, pmin = 0.012 (0%), 0.000585 (-26%), 150 iterations. Additively weighted Voronoi cells are shown. 66

11 14

23

0.9

30

36

80

42 7

0.7

24

2

32

0.6

x2

62

19

54

9

0.8

45

41

29

26 38

12

3 35

57

33

58

69 21

0.4

78

83

75

27

43 31

73 13

82

0.2

6

49

46

68 50 15

0.1

47

0.2

48 70

0.3

56 40

55 1

0.4

64 34

79

16 61

74

65 71

0.3

4

10 67

52

76

53

0

51

44

72

37

0.1

18

60

8

81

0.5

39 77

5

28 59

22

0.5

25

17

0.6

63

20

0.7

0.8

0.9

1

x1 Fig. 16. MDAV and PCL microaggregations from the experiments represented in Fig. 15, corresponding to k = 3. PCL outperforms MDAV by a 26% distortion reduction, with the   exact same k-anonymity requirement. The small dots depict n = 250 samples aggregated into j Qj = 83 cells of at least k = 3 elements k p0 = = 0:012 . The large dots represent the reconstruction values xˆ 1 ; …; xˆ 83 . n

5.3. Experimental results for a real, standardized dataset We conclude our experimental results with the particularly challenging problem of microaggregation of a real, standardized dataset with the noisy version of PCL of Section 3.4, over which MDAV is known to yield unparalleled performance. Being the most computationally intensive of all tests, we shall also report preliminary results on the computational complexity of our algorithm. The dataset in question, “Census”, was used in the computational aspects of statistical confidentiality (CASC) project [86,87], and has since then been served as a widely spread comparison test in the SDC literature. It contains 1080 records with 13 numerical attributes. In addition to CASC, examples of research studies utilizing this dataset include [9,16]. An interesting property

910

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

0.09 0.08

MDAV PCL (MDAV Init) PCL (Rand Init)

0%

0.07 0.06 0.05 0.04 0.03

PCL (MDAV Init) PCL (Rand Init)

0.02 0.01 2

4

6

kmin

8

10

2

4

6 kmin

8

10

Fig. 17. Performance comparison of PCL vs. MDAV for k-anonymous microaggregation of n = 250 two-dimensional points. These points were drawn according to a Gaussian distribution of covariance identity, and both algorithms run for k = 2,…, 10. For each k, 5 initializations of the reconstruction values of PCL were used, one based on the results of MDAV, and the rest randomly chosen from the initial samples.

of this dataset is that it is resistant against microaggregation algorithms that exploit variable-size strategies, natural clusters and heavy skewness of the data, such as μ-Approx [16] or VMDAV [17], making MDAV the best choice. Because PCL has been designed for equality constraints, just like MDAV, “Census” provides an adequate test for PCL, and given the fact that MDAV yields unsurpassed performance on it, an exciting challenge for our algorithm. We adhered to the common practice of normalizing each column of the data set for zero mean and unit variance. We explored a reasonably wide range of target anonymity constraints, ktarget = 5, 10, 25, 50, 75, 100, broadly representative of the values in the microaggregation literature. Observe that since every column of the dataset underwent unit-variance normalization, the total variance of the dataset is its number of dimensions. Because our measure of distortion is normalized by the number of dimensions, as we mentioned at the beginning of Section 5, the numbers reported turn out to be equivalent to the popular SDC measure of sum of squared errors (SSE) divided by sum of squared total (SST). Our implementation of MDAV followed the usual fixed-size strategy that starts isolating groups of the target size ktarget, away from the centroid of the overall dataset, and is left with a last group, of larger size when the target size is not a divisor of n, near that centroid. Hence, by specification, the anonymity attained by the algorithm was kmin = ktarget. The resulting distortion and running time for each case are reported in Table 1. In all cases, the running time was practically instant compared to those of PCL, of the order of milliseconds. Naturally, the sophistication of PCL is not without a price in computational complexity, particularly for the experiments carried out in this subsection. We should keep in mind, however, that the processing times reported correspond to a Matlab implementation of both MDAV and PCL, and that neither the code nor the numerical methods implemented were particularly optimized for speed, but originally designed merely to assess the privacy–utility performance of PCL against MDAV. 4 As we observed that the computational complexity of the cost adjustment, implemented by the numerical method based on the Levenberg–Marquardt algorithm, increased more than linearly with the number of cells, we carried out a prepartitioning strategy to limit the total computation cost, the strategy introduced in Section 3.5. Specifically, some of the results are obtained from a hierarchical application of PCL, first, on the entire dataset to create size-constrained prepartitions, and secondly, on each prepartition individually in accordance with the k-anonymity constraint. Prepartitioning not only enabled us to reduce the total computational complexity to a linear function of the complexity of each macrocell, but to implement the postpartitionings as parallel processes on our multicore CPU. The cost of the prepartitioning itself was almost negligible in comparison, due to the small number of macrocells, 2 or 3. We would like to stress that all performance comparisons in terms of distortion improvement of PCL over MDAV are against the most challenging case, namely when MDAV is not prepartitioned, but directly applied to the entire dataset for the best possible performance against PCL. In all cases we verified that, unsurprisingly, a prepartitioned version of MDAV resulted in a distortion loss. Being MDAV so fast, this would not have been a fair comparison. It is reasonable to assume that due to the prepartitioning process required by the computational complexity of PCL, if additional computation time were available so that larger prepartitions could be used, the performance of PCL would improve. The experiments reported employed the noisy version of PCL, with probability constraints corresponding to the same anonymity requirements ktarget = 5, 10, 25, 50, 75, 100 for MDAV, with a total of 5 initializations of the reconstruction values, one based on the results of MDAV, and the remaining 4 randomly chosen from the initial samples. Even though, as we mentioned,

4

Matlab R2010b on a multicore Intel Xeon CPU at 2.67 GHz running Windows 7 64-bit.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) MDAV , kmin = 4, pmin = 0.016,

911

0.0359. Edges represent cell

assignments.

3

8

2

10 14 1

1 4

x2

0

12 34 16 42 4938 32 19 46 22 28 44 58 36 29 5961 52 40 48 20 53 55 62 56 23 45 60 54 35 57 39 47 21 51 50 24 43 27 30 41 31 37 33 6 17 25 26

18

2

–1

5

11

3 15

13

–2

7 9

–3 –3

–2

–1

0

1

2

3

x1

b)

PCL , kmin = 4, pmin = 0.016 (0%), 0.0266 (-26%), 300 iterations. Additively weighted Voronoi cells are shown.

3

25

2

29

1

0

x2

13 11 43 23 8 31 57 39 24 47 59 35 42 38 16 40 21 48 45 19 55 41 37 62 20 54 14 1 28 33 32 44 6 56 53 17 60 30 4 46 34 15 7 5 10 2

3 22

26

9 58 27

–1 51

50 18

36

49 12

–2

52 61

–3 –3

–2

–1

0

1

2

3

x1 Fig. 18. MDAV and PCL microaggregations from the experiments represented in Fig. 17, corresponding to k = 4. PCL outperforms MDAV by a 26% distortion reduction, with the   exact same k-anonymity requirement. The small dots depict n = 250 samples aggregated into j Qj = 62 cells of at least k2 = 4 elements k = 0:016 . The large dots represent the reconstruction values xˆ 1 ; …; xˆ 62 . p0 = n

Table 1 MDAV. kmin = ktarget . kmin

D

t[ms]

5 10 25 50 75 100

0.0909 0.142 0.214 0.290 0.350 0.397

120 61.2 25.7 13.4 9.44 7.58

912

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

Table 2 Performance of PCL for k-anonymous microaggregation of “Census”, with 1080 records and 13 attributes. The algorithm was run for target anonymities ktarget = 5, 10, 25, 50, 75, 100, attaining a minimum anonymity greater than or equal to that required, i.e., kmin ≥ ktarget. For each ktarget, 5 initializations of the reconstruction values of PCL were used, one based on the results of MDAV, and the rest randomly chosen from the initial samples. PCL results corresponding to the initialization leading to the minimum final distortion, which turned out to be random rather than based on MDAV, in all cases. In the most time-consuming cases, PCL was applied hierarchically, first to obtain prepartitions of size nPP, and once again on each individual prepartition, with i iterations. Times include prepartition and correction, when applicable. kmin

ktarget

D

DPCL −DMDAV DMDAV

t[m : s]

i

nPP

5 10 25 25 51 76 107

5 10 25 25 50 75 100

0.0796 0.122 0.192 0.182 0.247 0.290 0.331

− 12.4% − 13.7% − 10.1% − 14.9% − 14.8% − 17.3% − 16.8%

16:15 08:53 01:38 05:36 01:22 00:37 00:16

150 130 110 110 90 70 50

360 540 540 – – – –

distortion comparisons are against unpartitioned MDAV, the initialization of PCL based on MDAV reconstructions did of course require to additionally compute MDAV on each prepartition. We also mentioned that “Census” was chosen partly due to its resistance against variable cell size strategies. In any case, whenever ktarget was not a divisor of n, the actual size constraints imposed on PCL roughly shared remaining points equally, in order to have a small margin for cost adjustment, rather than exploiting the strategy of MDAV of concentrating large cells near the centroid of the dataset. For that reason, the minimum size attained by PCL kmin was occasionally slightly larger than the required target size ktarget, which means that the minimum privacy attained by the algorithm was actually slightly better than the one intended. In the experiments carried out previously, we saw that due to the differentiability assumption inherent in the cost adjustment methods used, when PCL is applied to a finite set of data points the cell size constraints may be attained only within a small margin of error. Occasionally, cells constraints are met by plus or minus a few points. In the experiments of this subsection, posterior reassignment of these points, taking into account simple considerations of centroid proximity, enabled us to satisfy the constraints perfectly, with numerically negligible impact on the distortion, in utterly negligible time. Of course, the distortions reported here take into account this small correction and running times nevertheless include every single process, that is, correction, in addition to prepartitioning, when required. We used 3 prepartitions of size approximately nPP = 360 for ktarget = 5, and 2 prepartitions of approximate size nPP = 540 for ktarget = 10, 25. The case ktarget = 25 was repeated without prepartition, to quantify an example of the performance loss and the gain in running time due to prepartitioning. For the rest of values of ktarget, no prepartition was utilized to reduce the running time. The prepartitioning process was carried out with the PCL algorithm itself, and repeated for each initialization, using 1500 points per cell and 150 iterations, in roughly 15 seconds on average for each reinitialization. In the postpartitioning process, PCL used 1000 points per cell. The number of iterations i ranged from 150 for the smallest ktarget, to 50 for the largest, and the total running time, including prepartitioning when applied, ranged from approximately 16 min to 16 s, for the smallest and largest ktarget, respectively. Running times were hardly dependent on the initialization. The experimental results with PCL are summarized in

0.4

PCL (MDAV Init) 0.35

PCL (Rand Init) 0.3 0.25 0.2 0.15

MDAV PCL (MDAV Init) PCL (Rand Init)

0.1 0.05 0

20

40

60

ktarget

80

100

0

20

40

60

80

100

ktarget

Fig. 19. Performance comparison of PCL vs. MDAV for k-anonymous microaggregation of “Census”. The two clusters of points for ktarget = 25 correspond to the cases with prepartitioning (worse distortion) and without (better distortion).

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

913

t [s]

PCL (MDAV Init) PCL (Rand Init)

ktarget Fig. 20. Running time of PCL on “Census”, including prepartitioning and correction when applicable. The two clusters of points for ktarget = 25 correspond to the cases with prepartitioning (faster) and without (slower). Exact times and prepartition sizes were reported in Table 2.

Table 2, for the best out of 5 initializations, which turned out to be random rather than based on MDAV. Only the experiments for kt arg et = 25 without prepartition required use of the correction postprocess previously described, which operated on a single defective cell out of 43, with hardly any impact on the distortion (b0.04%) or the running time (b 0.04%). The distortions of MDAV and PCL are plotted in Fig. 19, and the running times of PCL, in Fig. 20. The distortion improvements ranged approximately between 9% and 17% over MDAV, less pronounced than for the microaggregation experiments on uniform and Gaussian statistics in Section 5.2. This is likely to be linked to two facts. First, “Census” has a particularly skewed geometry, and lower gains on them seem a natural extrapolation of the empirical observation in the previous subsection that PCL performed better for uniform statistics than Gaussian statistics. Secondly, the prepartitioning process to limit the total computation cost of PCL might also impose a limit on the performance gain with respect to the unpartitioned MDAV. The end result remains that PCL outperforms MDAV for the exact same (or occasionally an even stricter) anonymity constraint, by a noticeable margin in distortion reduction, albeit with a noticeable price in running time, and that the distortion improvement depends on the distribution of data points. Just as we observed in our previous microaggregation experiments, the results seem to be more sensitive to the initialization than those for clustering. A practical implication of this is that, if we were to reinitialize PCL several times and pick the most favorable outcome, the overall process would gracefully trade-off performance at the cost of complexity, evidently with a saturation effect. In any case, in all of these experiments and those for uniform and Gaussian statistics in previous subsections, which encompass the cases of both microaggregation and clustering, leave PCL as an excellent candidate for lowest distortion, highest data utility, for the same anonymity requirement, against the state-of-the-art MDAV.

1

0.8

0.6

x2

1

2

0.4

0.2

0

0

0.5

1

x1 Fig. 21. Simple example of k-anonymous microaggregation of two-dimensional points. The small black dots represent n = 10 samples to be aggregated in j Qj = 2 k = 1 =2 . The reconstruction values xˆ 1 and xˆ 2 , shown as large colored dots, are set arbitrarily. n

cells of k = 5 samples each, thus p0 =

914

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

0.5

1/2 − pmin

0.4

0.3

0.2

0.1

0 −1

−0.5

0

0.5

1

c2 Fig. 22. In the example corresponding to Fig. 21, we set c1 = 0 and depict p0 − pmin as a function of c2, leaving xˆ 1 and xˆ 2 fixed. Minimization of the objective function, that is, finding the value of c2 reducing the objective p0 − pmin to zero, is equivalent to satisfying the k-anonymity constraint p1 = p2 = p0 = 1 =2 for the xˆ 1 and xˆ 2 given.

Even though our proof-of-concept implementation was not meant to investigate computational issues, as PCL was initially designed for offline aggregation, it is reasonable to infer from our empirical observations that the cost adjustment with the Levenberg–Marquardt algorithm was the computationally most expensive part of our quantizer design algorithm. Fortunately, the sparsity of the Jacobian involved in the linearization step makes the problem scale with the number of neighboring cells rather than the total number of quantizer cells. In addition, the iterative nature of PCL seems suited for small data updates without the need for a complete recomputation. 6. Conclusion In this work, we develop a multidisciplinary solution, with state-of-the-art performance in terms of data utility and privacy, which addresses the important problems of k-anonymous microaggregation and clustering, each illustrated with an application, respectively, SDC and privacy for LBSs, but potentially extensible to numerous other scenarios. More precisely, our main contribution consists in two variations of a k-anonymous aggregation algorithm, which we call PCL, one of which is particularly suited to the important problem of k-anonymous microaggregation of databases. The other variation also exhibits excellent performance in the problem of clustering or macroaggregation, applicable to privacy in LBSs. This newly developed algorithm is a substantial, cost-weighted modification of the Lloyd algorithm, a celebrated quantization design algorithm, endowed with a numerical method to solve nonlinear systems of equations based on the Levenberg–Marquardt algorithm.

1

3

x2

0.75

0.5

0.25

0

1 0

0.25

2 0.5

0.75

1

x1 k = 1 =3 . The reconstruction values xˆ 1 , xˆ 2 Fig. 23. The small black dots represent n = 15 samples to be aggregated in j Qj = 3 cells of k = 5 samples each, thus p0 = n and xˆ 3 , shown as large colored dots, are set arbitrarily.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

915

1/3 − pmin 0.3 −0.5 0.25

c3

0.2 0

0.15 0.1

0.5

0.05 −0.5

0

0.5

c2 Fig. 24. In the example introduced with Fig. 23, we set c1 = 0 and depict p0 − pmin as a function of c2 and c3, leaving the reconstruction values fixed. Minimization of the objective function, that is, finding the values of c2 and c3 reducing the objective p0 − pmin to zero, is equivalent to satisfying the k-anonymity constraint p1 = p2 = p3 = p0 = 1 =3 for the reconstruction values given.

a) 15 iterations.

b) 20 iterations.

0.4 0.4

0.2

1

0

0

2

0.25

0. 5

0.75

x2

x2

0.2

1

0

0

1

2

0.25

x1

0.4

0.2

0.2

1

2

0.25

0. 5

0.75

x2

x2

1

d) 32 iterations.

0.4

0

0.75

x1

c) 24 iterations.

0

0. 5

1

0

1

x1

0

2

0.25

0. 5

0.75

1

x1

k = 1 =2 . The reconstruction values xˆ 1 and xˆ 2 , shown as large dots, are n set arbitrarily, and left fixed. As the algorithm is iterated, the variance of the noise around the two samples is reduced. Fig. 25. Simple example with n = 2 samples, represented as small dots, and k = 1, thus p0 =

916

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) 15 iterations.

b) 20 iterations.

0.5

0.5

0.25

0.25

0 0.5

0.55

0.6

c2

0.65

0.7

c) 24 iterations.

0 0.5

0.6

0.65

0.7

0.6

0.65

0.7

c2

d) 32 iterations.

0.5

0.5

0.25

0.25

0 0.5

0.55

0.55

0.6

c2

0.65

0.7

0 0.5

0.55

c2

Fig. 26. In the example of Fig. 25, we set c1 = 0 and depict the objective function p0 − pmin to minimize as a function of c2. As the variance of the noise is reduced, p0 − pmin is sharpened.

We illustrate the somewhat less known application of macroaggregation with a simple architecture for k-anonymous retrieval of location-based information. Essentially, we consider location-aware devices, commonly operative near a fixed reference location. We then regard accurate, fixed location data as a quasi-identifier, and rescue and merge the principles behind pseudonymization, location anonymization and the privacy criterion used in microdata k-anonymization. Precisely, accurate location information is collected by a trusted third party to create distortion-optimized, size-constrained clusters, where k nearby devices share a common centroid location. We report experimental results regarding k-anonymous clustering and microaggregation, that is, large and small k, for Gaussian and uniform statistics, with MSE as distortion measure. The resulting quantization cells are observed to be convex polytopes, and just as in the conventional Lloyd algorithm, the sequence of distortions is nonincreasing and the clustering configurations seem to rapidly converge to a low-distortion solution. While maintaining exactly the same k-anonymity constraints, our algorithm outperforms the state-of-the-art microaggregation algorithm MDAV by a significant reduction in distortion. The approximate distortion reduction was typically 20% and 15% for clustering of uniform and Gaussian data, respectively, and 27% and 15% for microaggregation of uniform and Gaussian data, respectively. The largest reductions observed were approximately 35%, 20%, 30% and 25% for the same data, also respectively. We then challenge PCL with a real, standardized dataset over which MDAV exhibited unparalleled performance, until now, called “Census”. PCL is still able to produce a noticeable distortion improvement of approximately up to 17% over MDAV, for the same exact k anonymity constraint. In short, PCL yielded a better privacy–utility trade-off than MDAV for a variety of data distributions, and that improvement depended on the data. Despite the experimental evidence that PCL outperforms MDAV in all cases considered, in terms of data utility for the same exact k-anonymity requirement, it is fair to stress, however, that PCL is noticeably more complex than MDAV, both in terms of sophistication and running time. In particular, the running time of PCL is clearly nonlinear with the number of cells. Fortunately, the two scenarios of application that we contemplate in this paper will typically come with fairly lenient time constraints. Still, in order to gracefully trade-off distortion for running time, we propose a hierarchical application of PCL, where the final postpartitioning will require a time linear with the number of macrocells resulting from the prepartitioning, while the number of partitioning levels is chosen to avoid severe time requirements for prepartitioning. The nature of PCL is such that it enables users to buy privacy–utility performance at the price of computation. Not only because of the prepartitioning strategy, but also because it may be initialized a number of times, in different ways, and the best outcome picked. A quantizer designed with our algorithm admits a compact representation, simply as a list of reconstruction values and costs, one per cell, rather than an arbitrary clustering of a large cloud of points. This is particularly useful when a model of the data is

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) 4 iterations.

b) 10 iterations.

1

1

3

0.5

0.25

3

0.75

x2

x2

0.75

1

0.5

0.25

2

0

1

2

0 0

0.25

0. 5

0.75

1

0

0.25

x1

0. 5

0.75

1

x1

c) 17 iterations.

d) 32 iterations.

1

1

3

0.5

0.25

3

0.75

x2

0.75

x2

917

1

0.5

0.25

2

0

1

2

0 0

0.25

0. 5

x1

0.75

1

0

0.25

0. 5

0.75

1

x1

k = 1 =3 . The reconstruction values xˆ 1 , xˆ 2 and xˆ 3 , shown as large Fig. 27. Simple example with n = 15 samples, represented as small dots, and k = 5, thus p0 = n dots, are set arbitrarily, and left fixed. As the algorithm is iterated, the variance of the noise around the samples is reduced.

given by means of a PDF, for which a probability-constrained quantizer is to be designed only once, but later on applied repeatedly to dynamic sets of samples distributed according to the original model. To sum up, or experiments on various statistics, for the cases of both microaggregation and clustering, demonstrate that PCL is an excellent candidate for lowest distortion k-anonymous aggregation, for the same anonymity requirement, against the state-ofthe-art MDAV. The sophistication of PCL, particularly the use of its derivative-based cost adjustment mechanism, does come at the price of an increased computation time, which may preclude its use in the event that microaggregation over thousands of data samples must absolutely be carried out in seconds and hardware resources are notoriously scarce. Because all evidence points to PCL yielding better data utility than MDAV for the same privacy requirements, under milder computational constraints or in offline applications, our proposed algorithm should become the preferred choice for microaggregation and clustering. Under stricter computational constraints, PCL may still prove helpful in quantifying the loss in privacy–utility performance compromised by the use of faster algorithms such as MDAV, and aid in assessing whether that loss is acceptable. The k-anonymous location clustering mechanism proposed in our work may be regarded more generally as a problem of minimum-distortion, probability-constrained quantization, which also addresses applications of similarity-based, workloadconstrained resource allocation. Even though this paper focuses on MSE, our solution is suitable for an entirely generic distortion measure, possibly over categorical alphabets, for example semantic distances. Acknowledgment We would like to thank Javier Parra-Arnau, the editor and the three anonymous referees for their thorough, extremely valuable comments, which motivated major improvements on this manuscript. This work was partly supported by the Spanish Government through projects CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES”, TEC2010-20572-C02-02 “CONSEQUENCE” and TEC-2008-06663-C03-01 “P2PSec”, and by the Government of Catalonia under grant 2009 SGR 1362.

918

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

a) 4 iterations.

b) 10 iterations.

0.2

0.2 0. 3

0.3

0. 3 0.3

0.25 0. 2 0.15

0.5

0.4

0. 2

c3

c3

0.4

0.25

0.15

0.5

0. 1 0.6 0.7

0. 1 0.6

0.05 0

0.2

0.7

0.4

0.05 0

c2

0.2

0.4

c2

c) 17 iterations.

d) 32 iterations. 0.2

0.2

0. 3

0. 3 0.3

0.3

0.25 0. 2 0.15

0.5

0. 2

0.4

c3

c3

0.4

0.25

0.15

0.5

0. 1

0. 1 0.6 0.7

0.05 0

0. 2

0. 4

c2

0.6 0.7

0.05 0

0.2

0.4

c2

Fig. 28. In the example of Fig. 23, we set c1 = 0 and depict the objective function p0 − pmin to minimize as a function of c2 and c3. As the variance of the noise is reduced, p0 − pmin is sharpened.

Appendix A. Rationale behind the noisy PCL algorithm This section is devoted to overview the rationale behind the noisy variation of the PCL algorithm proposed in Section 3.4, empirically proven to be effective for small k in Section 5. We emphasized that the key problem of the noiseless version of PCL was the use of derivative-based numerical methods for the computation of the costs c(q) to constrain the probabilities pQ(q). For brevity, throughout this section we rewrite c(q) as cq, pQ(q) as pq, and xˆ ðqÞ as xˆ q . In the following, assume that we are given specific data samples x1,…, xn and reconstruction values xˆ q , and that we are left with the problem of adjusting the costs c(q) in order to satisfy certain equality constraints on pq. Because adding a common constant to all costs does not change the cells determined by the modified nearest-neighbor condition (4) of Section 3.2, we may assume without loss of generality that c1 = 0. Consider first the example depicted in Fig. 21, in which n = 10 samples are to be aggregated in jQ j = 2 cells of k = 15 samples each, k = 1 =2 is enforced. Define pmin = minq{pq}. Fig. 22 plots the objective function p0 − pmin thus a common probability constraint p0 = n as a function of c2. Observe that minimization of this objective, that is, finding the value of c2 reducing the objective p0 − pmin to zero, is equivalent to satisfying the k-anonymity constraint p1 = p2 = p0 = 1 =2 for the xˆ 1 and xˆ 2 given. As Fig. 22 illustrates, the problem lies in the fact that the objective is piecewise constant, and that the range of solutions may be at the bottom of a fairly narrow well. Consequently, estimates of local derivatives may not provide an efficient strategy to find a solution for c2. An analogous example is represented in Figs. 23 and 24, this time for n = 15 samples to be aggregated in jQ j = 3 cells of k = 5 k = 1 =3 . samples each, with p0 = n Observe that the space of solutions for c2 and c3 is fairly narrow, at the bottom of a piecewise constant surface, which numerical methods only based on local derivatives could not possibly find easily. Let us look now into the effect of introducing noisy samples into the cost adjustment problem. We follow the noisy PCL algorithm described in Section 3.4, but leave the reconstruction values fixed. We shall observe that the objective function to minimize gradually transforms, starting as a smooth function, and then sharpening into the usual piecewise constant objective, as the variance of the noise is reduced. This effect is shown in Figs. 25 and 26 for a fairly trivial albeit insightful example with n = 2 k k = 1 =2 , and once more in Figs. 27 and 28, for n = 15 samples and k = 5, thus p0 = = 1 =3 . samples and k = 1, thus p0 = n n

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

919

Even though we proposed in Section 3.4 to accompany the variance reduction process with the updating of the reconstruction values, we could have just as well repeated the variance cooling as an inner iteration for each update of the reconstruction values, with similar results. References [1] D. Rebollo-Monedero, J. Forné, M. Soriano, Private location-based information retrieval via k-anonymous clustering, Proc. CNIT Int. Workshop Digit. Commun., ser. Lecture Notes Comput. Sci. (LNCS), Springer-Verlag, Sardinia, Italy, Sep. 2009 invited paper. [2] S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inform. Theory IT-28 (Mar. 1982) 129–137. [3] J. Max, Quantizing for minimum distortion, IEEE Trans. Inform. Theory 6 (1) (Mar. 1960) 7–12. [4] D. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM J. Appl. Math. (SIAP) 11 (1963) 431–441. [5] P. Samarati, L. Sweeney, Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression, SRI Int., Tech. Rep. (1998). [6] P. Samarati, Protecting respondents' identities in microdata release, IEEE Trans. Knowl. Data Eng. 13 (6) (2001) 1010–1027. [7] L. Sweeney, k-Anonymity: a model for protecting privacy, Int. J. Uncertain., Fuzz., Knowl.-Based Syst. 10 (5) (2002) 557–570. [8] D. Defays, P. Nanopoulos, Panels of enterprises and confidentiality: the small aggregates method, Proc. Symp. Design, Anal. Longitudinal Surveys, Stat. Canada, Ottawa, Canada, 1993, pp. 195–204. [9] J. Domingo-Ferrer, J.M. Mateo-Sanz, Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. Knowl. Data Eng. 14 (1) (2002) 189–201. [10] J. Domingo-Ferrer, V. Torra, Ordinal, continuous and heterogenerous k-anonymity through microaggregation, Data Min., Knowl. Disc. 11 (2) (2005) 195–212. [11] J. Domingo-Ferrer, F. Sebé, A. Solanas, A polynomial-time approximation to optimal multivariate microaggregation, Elsevier Comput., Math. with Appl. 55 (4) (Feb. 2008) 714–732. [12] A. Oganian, J. Domingo-Ferrer, On the complexity of optimal microaggregation for statistical disclosure control, UNECE Stat. J. 18 (4) (Apr. 2001) 345–354. [13] J. Domingo-Ferrer, A. Martínez-Ballesté, J.M. Mateo-Sanz, F. Sebé, Efficient multivariate data-oriented microaggregation, VLDB J. 15 (4) (2006) 355–369. [14] A. Hundepool, R. Ramaswamy, P.-P. DeWolf, L. Franconi, R. Brand, J. Domingo-Ferrer, μ-ARGUS version 4.1 software and user's manual, Voorburg, Netherlands, 2007. [Online]. Available: http://neon.vb.cbs.nl/casc. [15] M. Templ, Statistical disclosure control for microdata using the R-package sdcMicro, Trans. Data Privacy 1 (2) (2008) 67–85. [Online]. Available: http://cran.r-project.org/web/packages/sdcMicro. [16] M. Laszlo, S. Mukherjee, Minimum spanning tree partitioning algorithm for microaggregation, IEEE Trans. Knowl. Data Eng. 17 (7) (Jul. 2005) 902–911. [17] A. Solanas, A. Martínez-Ballesté, J. Domingo-Ferrer, VMDAV: a multivariate microaggregation with variable group size, Proc. Comput. Stat. (COMPSTAT), Springer-Verlag, Rome, Italy, 2006. [18] C. Chin-chen, L. Yu-chiang, H. Wen-huang, TFRP: an efficient microaggregation algorithm for statistical disclosure control, Elsevier J. Syst., Softw. 80 (11) (Nov. 2007) 1866–1878. [19] J. Nin, J. Herranz, V. Torra, On the disclosure risk of multivariate microaggregation, Elsevier Data, Knowl. Eng. 67 (3) (2008) 399–412. [20] T.M. Truta, B. Vinay, Privacy protection: p-sensitive k-anonymity property, Proc. Int. Workshop Privacy Data Manage. (PDM), Atlanta, GA, 2006, p. 94. [21] X. Sun, H. Wang, J. Li, T.M. Truta, Enhanced p-sensitive k-anonymity models for privacy preserving data publishing, Trans. Data Privacy 1 (2) (2008) 53–66. [22] A. Machanavajjhala, J. Gehrke, D. Kiefer, M. Venkitasubramanian, l-Diversity: privacy beyond k-anonymity, Proc. IEEE Int. Conf. Data Eng. (ICDE), Atlanta, GA, Apr, 2006, p. 24. [23] H. Jian-min, C. Ting-ting, Y. Hui-qun, An improved V-MDAV algorithm for l-diversity, Proc. IEEE Int. Symp. Inform. Processing (ISIP), Moscow, Russia, May 2008, pp. 733–739. [24] N. Li, T. Li, S. Venkatasubramanian, t-Closeness: privacy beyond k-anonymity and l-diversity, Proc. IEEE Int. Conf. Data Eng. (ICDE), Istanbul, Turkey, Apr. 2007, pp. 106–115. [25] J. Domingo-Ferrer, V. Torra, A critique of k-anonymity and some of its enhancements, Proc. Workshop Privacy, Security, Artif. Intell. (PSAI), Barcelona, Spain, 2008, pp. 990–993. [26] D. Rebollo-Monedero, J. Forné, J. Domingo-Ferrer, From t-closeness-like privacy to postrandomization via information theory, IEEE Trans. Knowl. Data Eng. 22 (11) (Nov. 2010) 1623–1636. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.190. [27] D. Rebollo-Monedero, J. Forné, J. Domingo-Ferrer, From t-closeness to PRAM and noise addition via information theory, Privacy Stat. Databases (PSD), ser. Lecture Notes Comput. Sci. (LNCS), Springer-Verlag, Istambul, Turkey, Sep. 2008, pp. 100–112. [28] C.E. Shannon, Communication theory of secrecy systems, Bell Syst., Tech. J. (1949). [29] C.E. Shannon, Coding theorems for a discrete source with a fidelity criterion, IRE Nat. Conv. Rec, 7, 1959, pp. 142–163, Part 4. [30] A. Evfimievski, J. Gehrke, R. Srikant, Limiting privacy breaches in privacy preserving data mining, Proc. ACM Symp. Prin. Database Syst. (PODS), San Diego, CA, 2003, pp. 211–222. [31] D. Rebollo-Monedero, J. Forné, How do we measure privacy? Upgrade J. XI (1) (Feb. 2010) 53–58. [32] D. Rebollo-Monedero, J. Forné, Optimal query forgery for private information retrieval, IEEE Trans. Inform. Theory 56 (9) (2010) 4631–4642. [33] J. Brickell, V. Shmatikov, The cost of privacy: destruction of data-mining utility in anonymized data publishing, Proc. ACM SIGKDD Int. Conf. Knowl. Disc., Data Min. (KDD), Las Vegas, NV, Aug. 2008. [34] D. Chaum, Security without identification: transaction systems to make big brother obsolete, Commun. ACM 28 (10) (Oct. 1985) 1030–1044. [35] V. Benjumea, J. López, J.M.T. Linero, Specification of a framework for the anonymous use of privileges, Elsevier Telemat., Informat. 23 (3) (Aug. 2006) 179–195. [36] G. Bianchi, M. Bonola, V. Falletta, F.S. Proto, S. Teofili, The SPARTA pseudonym and authorization system, Elsevier Sci. Comput. Program. 74 (1–2) (2008) 23–33. [37] M. Gruteser, D. Grunwald, Anonymous usage of location-based services through spatial and temporal cloaking, Proc. ACM Int. Conf. Mob. Syst., Appl., Serv. (MobiSys), ACM, San Francisco, CA, May 2003, pp. 31–42. [38] M. Duckham, K. Mason, J. Stell, M. Worboys, A formal approach to imperfection in geographic information, Elsevier Comput., Environ., Urban Syst. 25 (1) (2001) 89–103. [39] M. Duckham, L. Kulit, A formal model of obfuscation and negotiation for location privacy, Proc. Int. Conf. Pervas. Comput., ser. Lecture Notes Comput. Sci. (LNCS), vol. 3468, Springer-Verlag, Munich, Germany, May 2005, pp. 152–170. [40] C.A. Ardagna, M. Cremonini, E. Damiani, S. De Capitani di Vimercati, P. Samarati, Location privacy protection through obfuscation-based techniques, Proc. Annual IFIP Working Conf. Data Appl. Security, ser. Lecture Notes Comput. Sci. (LNCS), vol. 4602, Springer-Verlag, Redondo Beach, CA, Jul. 2007, pp. 47–60. [41] M.L. Yiu, C.S. Jensen, X. Huang, H. Lu, SpaceTwist: managing the trade-offs among location privacy, query performance, and query accuracy in mobile services, Proc. IEEE Int. Conf. Data Eng. (ICDE), Cancun, Mexico, Apr. 2008, pp. 366–375. [42] S. Mascetti, D. Freni, C. Bettini, X.S. Wang, S. Jajodia, On the impact of user movement simulations in the evaluation of LBS privacy-preserving techniques, Proc. Int. Workshop Privacy Locat.-Based Appl. (PiLBA), Málaga, Spain, Oct. 2008, pp. 61–81. [43] C. Chow, M.F. Mokbel, X. Liu, A peer-to-peer spatial cloaking algorithm for anonymous location-based services, Proc. ACM Int. Symp. Adv. Geogr. Inform. Syst. (GIS), Arlington, VA, Nov. 2006, pp. 171–178. [44] J. Domingo-Ferrer, Microaggregation for database and location privacy, Proc. Int. Workshop Next-Gen. Inform. Technol., Syst. (NGITS), ser. Lecture Notes Comput. Sci. (LNCS), vol. 4032, Springer-Verlag, Kibbutz Shefayim, Israel, Jul. 2006, pp. 106–116. [45] A. Solanas, A. Martínez-Ballesté, A TTP-free protocol for location privacy in location-based services, Elsevier Comput. Commun. 31 (6) (Apr. 2008) 1181–1191.

920

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

[46] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi, K.-L. Tan, Private queries in location based services: anonymizers are not necessary, Proc. ACM SIGMOD Int. Conf. Manage. Data, Vancouver, Canada, Jun. 2008, pp. 121–132. [47] R. Ostrovsky, W.E. Skeith III, A survey of single-database PIR: techniques and applications, Proc. Int. Conf. Practice, Theory Public-Key Cryptogr. (PKC), ser. Lecture Notes Comput. Sci. (LNCS), vol. 4450, Springer-Verlag, Beijing, China, Sep. 2007, pp. 393–411. [48] T. Kuflik, B. Shapira, Y. Elovici, A. Maschiach, Privacy preservation improvement by learning optimal profile generation rate, User Modeling, ser. Lecture Notes Comput. Sci. (LNCS), vol. 2702, Springer-Verlag, 2003, pp. 168–177. [49] Y. Elovici, C. Glezer, B. Shapira, Enhancing customer privacy while searching for products and services on the World Wide Web, Internet Res. 15 (4) (2005) 378–399. [50] B. Shapira, Y. Elovici, A. Meshiach, T. Kuflik, PRAW — the model for PRivAte Web, J. Amer. Soc. Inform. Sci., Technol. 56 ( 2) (2005) 159–172. [51] H. Kido, Y. Yanagisawa, T. Satoh, Protection of location privacy using dummies for location-based services, Proc. IEEE Int. Conf. Data Eng. (ICDE), Washington, DC, Oct. 2005, p. 1248. [52] D.C. Howe, H. Nissenbaum, Lessons from the identity trail: privacy, anonymity and identity in a networked society, ch. TrackMeNot: Resisting surveillance in web search, Oxford Univ. Press, NY, 2006. [Online]. Available: http://mrl.nyu.edu/dhowe/trackmenot. [53] V. Toubiana, SquiggleSR, [Online]. Available: http://www.squigglesr.com. [54] C. Soghoian, The problem of anonymous vanity searches, I/S: J. Law, Policy Inform. Soc. (ISJLP) (Jan. 2007). [55] M.F. Mokbel, Towards privacy-aware location-based database servers, Proc. IEEE Int. Conf. Data Eng. Workshops (PDM), Atlanta, GA, 2006, p. 93. [56] O. Abul, F. Bonchi, M. Nanni, Never walk alone: uncertainty for anonymity in moving objects databases, Proc. IEEE Int. Conf. Data Eng. (ICDE), Cancun, Mexico, Apr. 2008, pp. 366–375. [57] B. Gedik, L. Liu, A customizable k-anonymity model for protecting location privacy, Proc. IEEE Int. Conf. Distrib. Comput. Syst. (ICDS), Columbus, OH, Jun. 2005, pp. 620–629. [58] R. Cheng, Y. Zhang, E. Bertino, S. Prabhakar, Preserving user location privacy in mobile data management infrastructures, Proc. Workshop Privacy Enhanc. Technol. (PET), ser. Lecture Notes Comput. Sci. (LNCS), vol. 4258, Springer-Verlag, Cambridge, United Kingdom, 2006, pp. 393–412. [59] B. Gedik, L. Liu, Protecting location privacy with personalized k-anonymity: architecture and algorithms, IEEE Trans. Mob. Comput. 7 (1) (Jan. 2008) 1–18. [60] B. Bamba, L. Liu, P. Pesti, T. Wang, Supporting anonymous location queries in mobile environments with PrivacyGrid, Proc. Int. WWW Conf., Beijing, China, Apr. 2008, pp. 237–246. [61] B. Hoh, M. Gruteser, H. Xiong, A. Alrabady, Enhancing security and privacy in traffc-monitoring systems, IEEE Pervas. Comput. Mag., Special Issue Intell. Transport. Syst. 5 (4) (2006). [62] G. Zhong, U. Hengartner, A distributed k-anonymity protocol for location privacy, Proc. IEEE Int. Conf. Pervas. Comput., Commun. (PerCom), Galveston, TX, Mar. 2009, pp. 253–262. [63] D. Rebollo-Monedero, J. Forné, A. Solanas, T. Martínez-Ballesté, Private location-based information retrieval through user collaboration, Elsevier Comput. Commun. 33 (6) (2010) 762–774. [Online]. Available: http://dx.doi.org/10.1016/j.comcom.2009.11.024. [64] D. Rebollo-Monedero, J. Forné, L. Subirats, A. Solanas, A. Martínez-Ballesté, A collaborative protocol for private retrieval of location-based information, Proc. IADIS Int. Conf. e-Society, Barcelona, Spain, Feb. 2009. [65] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, MA, 1992. [66] R.M. Gray, D.L. Neuhoff, Quantization, IEEE Trans. Inform. Theory 44 (Oct. 1998) 2325–2383. [67] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. Berkeley Symp. Math. Stat., Prob., vol. I (Statistics), Berkeley, CA, 1965–1966 (symposium), 1967 (proceedings), pp. 281–297. [68] H. Steinhaus, Sur la division des corps matériels en parties, Bull. Pol. Acad. Sci. IV (12) (Mar. 1956) 801–804. [69] V.K. Goyal, Theoretical foundations of transform coding, IEEE Signal Processing Mag. 18 (5) (Sep. 2001) 9–21. [70] J.-R. Ohm, Three-dimensional subband coding with motion compensation, IEEE Trans. Image Processing 3 (5) (Sep. 1994) 559–571. [71] R. Krishnamoorthy, J. Kalpana, Minimum distortion clustering technique for orthogonal polynomials transform vector quantizer, Proc. ACM Int. Conf. Commun., Comput., Security (ICCCS), Rourkela, India, Feb. 2011, pp. 443–448. [72] D. Rebollo-Monedero, “Quantization and transforms for distributed source coding,” Ph.D. dissertation, Stanford Univ., 2007. [73] M. Dickerson, D. Eppstein, K.A. Wortman, Planar Voronoi diagrams for sums of convex functions, smoothed distance and dilation, Proc. Int. Symp. Voronoi Diag. Sci., Eng. (ISVD), Quebec, Canada, Jun. 2010, pp. 13–22. [74] C. Luo, Y. Li, S.M. Chung, Text document clustering based on neighbors, Elsevier Data, Knowl. Eng. 68 (11) (Nov. 2009) 1271–1288. [75] A. Kalogeratos, A. Likas, Document clustering using synthetic cluster prototypes, Elsevier Data, Knowl. Eng. 70 (3) (Mar. 2011) 284–306. [76] A. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996. [77] D.G. Luenberger, Y. Ye, Linear and Nonlinear Programming, 3rd ed Springer, New York, 2008. [78] M. Sabin, R.M. Gray, Global convergence and empirical consistency of the generalized Lloyd algorithm, IEEE Trans. Inform. Theory IT-32, no. 2 (Mar. 1986) 148–155. [79] G. Lugosi, A.B. Nobel, Consistency of data-driven histogram methods for density estimation and classification, Beckman Inst., Univ. of Illinoi, Tech. Rep. (1993) uIUC-BI-93-01. [80] E.B. Kosmatopoulos, M.A. Christodoulou, Convergence properties of a class of learning vector quantization algorithms, IEEE Trans. Image Processing 5 (2) (Feb. 1996) 361–368. [81] J.J. Moré, The Levenberg–Marquardt algorithm: Implementation and theory, in: G.A. Watson (Ed.), Numerical Analysis, ser. Lecture Notes Math, vol. 630, Springer-Verlag, 1977, pp. 105–116. [82] K. Ueda, N. Yamashita, On a global complexity bound of the Levenberg–Marquardt method, J. Optim. Theory, Appl. (147) (2010) 443–453. [83] J. Cao, B. Carminati, E. Ferrari, K. Tan, CASTLE: continuously anonymizing data streams, IEEE Trans. Depend., Secure Comput. 99 (2009). [84] D.E. Burmaster, D.M. Murray, A trivariate distribution for the height, weight, and fat of adult men, Risk Anal. 18 (4) (1998) 385–389. [85] J.H. Conway, N.J.A. Sloane, Sphere Packings, Lattices and Groups, Springer, Berlin, Germany, 1993. [86] CASC, “Computational Aspects of Statistical Confidentiality project,” European project IST-2000-25069 CASC, Tech. Rep., 2001–2004. [Online]. Available: http://neon.vb.cbs.nl/casc. [87] A. Hundepool, The CASC project, Privacy Stat. Databases (PSD), ser. Lecture Notes Comput. Sci. (LNCS), vol. 3050, Springer-Verlag, Barcelona, Spain, Jun. 2004, pp. 199–212.

David Rebollo–Monedero received the M.S. and Ph.D. degrees in electrical engineering from Stanford University, in California, USA, in 2003 and 2007, respectively. His doctoral research at Stanford focused on data compression, more specifically, quantization and transforms for distributed source coding. Previously, he was an information technology consultant for PricewaterhouseCoopers, in Barcelona, Spain, from 1997 to 2000, and was involved in the Retevisión startup venture. During the summer of 2003, still as a Ph.D. student at Stanford, he worked for Apple Computer with the QuickTime video codec team in California, USA. He is currently a postdoctoral researcher with the Information Security Group, in the Department of Telematics of the Universitat Politècnica de Catalunya (UPC), also in Barcelona, where he investigates the application of data compression formalisms to privacy in information systems.

D. Rebollo-Monedero et al. / Data & Knowledge Engineering 70 (2011) 892–921

921

Jordi Forné received the M.S. degree in telecommunications engineering from the Universitat Politècnica de Catalunya (UPC) in 1992, and the Ph.D. degree in 1997. In 1991, he joined the Cryptography and Network Security Group, in the Department of Applied Mathematics and Telematics. Currently, he is an associate professor of the Telecommunications Engineering School of Barcelona (ETSETB), and works with the Information Security Group, both affiliated to the Department of Telematics Engineering of UPC in Barcelona. He is coordinator of the Ph.D. program on Telematics Engineering (holding a Spanish Quality Mention) and Director of the research Master in Telematics Engineering. His research interests span a number of subfields within information security and privacy, including network security, electronic commerce and public-key infrastructures. He has been a member of the program committee of a number of security conferences, and he is editor of the Computer Standards & Interfaces Journal (Elsevier).

Miguel Soriano received the M.S. degree in telecommunications engineering from the Universitat Politècnica de Catalunya (UPC) in 1992, and the Ph.D. degree in 1996. In 1991, he joined the Cryptography and Network Security Group, in the Department of Applied Mathematics and Telematics. Currently, he is a professor of the Telecommunications Engineering School of Barcelona (ETSETB), and leads the Information Security Group, both affiliated to the Department of Telematics Engineering of UPC in Barcelona. His research interests encompass network security, electronic commerce, and information hiding for copyright protection. He has been a member of the program committee of a number of security conferences, and he is editor of the International Journal of Information Security (Springer–Verlag). He is also an associate researcher at the Centre Tecnològic de Telecomunicacions de Catalunya (CTTC).

An algorithm for k-anonymous microaggregation and ...

Jul 2, 2011 - With the shifting of the Internet connectivity paradigm towards almost ... The opening up of enormous business opportunities for ..... Returning to privacy mechanisms specifically aimed at LBSs, a third class of TTP-free methods such as Ref. ...... though our code has not been optimized for speed, we provide ...

3MB Sizes 1 Downloads 217 Views

Recommend Documents

An Evolutionary Algorithm for Homogeneous ...
fitness and the similarity between heterogeneous formed groups that is called .... the second way that is named as heterogeneous, students with different ...

An Algorithm for Implicit Interpolation
More precisely, we consider the following implicit interpolation problem: Problem 1 ... mined by the sequence F1,...,Fn and such that the degree of the interpolants is at most n(d − 1), ...... Progress in Theoretical Computer Science. Birkhäuser .

An Adaptive Fusion Algorithm for Spam Detection
An email spam is defined as an unsolicited ... to filter harmful information, for example, false information in email .... with the champion solutions of the cor-.

An Algorithm for Implicit Interpolation
most n(d − 1), where d is an upper bound for the degrees of F1,...,Fn. Thus, al- though our space is ... number of arithmetic operations required to evaluate F1,...,Fn and F, and δ is the number of ...... Progress in Theoretical Computer Science.

An Adaptive Fusion Algorithm for Spam Detection
adaptive fusion algorithm for spam detection offers a general content- based approach. The method can be applied to non-email spam detection tasks with little ..... Table 2. The (1-AUC) percent scores of our adaptive fusion algorithm AFSD and other f

An Algorithm for Nudity Detection
importance of skin detection in computer vision several studies have been made on the behavior of skin chromaticity at different color spaces. Many studies such as those by Yang and Waibel (1996) and Graf et al. (1996) indicate that skin tones differ

An Improved Divide-and-Conquer Algorithm for Finding ...
Zhao et al. [24] proved that the approximation ratio is. 2 − 3/k for an odd k and 2 − (3k − 4)/(k2 − k) for an even k, if we compute a k-way cut of the graph by iteratively finding and deleting minimum 3-way cuts in the graph. Xiao et al. [23

Data Structure and Algorithm for Big Database
recommendation for further exploration and some reading lists with some ... There is a natural tendency for companies to store data of all sorts: financial data, ...

an algorithm for finding effective query expansions ... - CiteSeerX
analysis on word statistical information retrieval, and uses this data to discover high value query expansions. This process uses a medical thesaurus (UMLS) ...

An Optimal Online Algorithm For Retrieving ... - Research at Google
Oct 23, 2015 - Perturbed Statistical Databases In The Low-Dimensional. Querying Model. Krzysztof .... The goal of this paper is to present and analyze a database .... applications an adversary can use data in order to reveal information ...

VChunkJoin: An Efficient Algorithm for Edit Similarity ...
The current state-of-the-art Ed-Join algorithm im- proves the All-Pairs-Ed algorithm mainly in the follow- .... redundant by another rule v if v is a suffix of u (including the case where v = u). We define a minimal CBD is a .... The basic version of

An algorithm for improving Non-Local Means operators ...
number of an invertible matrix X (with respect to the spectral norm), that is ...... Cutoff (ω). PSNR Gain (dB) barbara boat clown couple couple_2 hill house lake lena .... dynamic range of 0 to 0.2, corresponding to blue (dark) to red (bright) colo

An Implementation of a Backtracking Algorithm for the ...
sequencing as the Partial Digest Problem (PDP). The exact computational ... gorithm presented by Rosenblatt and Seymour in [8], and a backtracking algorithm ...

An Effective Tree-Based Algorithm for Ordinal Regression
Abstract—Recently ordinal regression has attracted much interest in machine learning. The goal of ordinal regression is to assign each instance a rank, which should be as close as possible to its true rank. We propose an effective tree-based algori

An Efficient Algorithm for Learning Event-Recording ...
learning algorithm for event-recording automata [2] based on the L∗ algorithm. ..... initialized to {λ} and then the membership queries of λ, a, b, and c are ...

An exact algorithm for energy-efficient acceleration of ...
tion over the best single processor schedule, and up to 50% improvement over the .... Figure 3: An illustration of the program task de- pendency graph for ... learning techniques to predict the running time of a task has been shown in [5].

An Improved Algorithm for the Solution of the ...
tine that deals with duplicate data points, a routine that guards against ... numerical algorithm has been tested on several data sets with duplicate points and ...

An O∗ (1.84k) Parameterized Algorithm for the ...
basic formulation is to find a 2-partition that separates a source vertex s from a target vertex t .... So we leave it open for a parameterized algorithm for ..... Discrete Mathematics and Theoretical Computer Science 5 (1991) 105–120. 8. Cygan, M.

AntHocNet: An Adaptive Nature-Inspired Algorithm for ...
network. Nature's self-organizing systems like insect societies show precisely these desir- ... while maintaining the properties which make ACO routing algorithms so appealing. ...... Routing over multihop wireless network of mobile computers.

An Evolutionary Algorithm for Constrained Multiobjective
MOPs as mathematical programming models, viz goal programming (Charnes and ..... Genetic algorithms + data structures = evolution programs (3rd ed.).

Development of an Avoidance Algorithm for Multiple ...
Traffic Collision Avoid- ance System (TCAS) II is a collision avoidance .... the controlled vehicle among unexpected obstacles” [3]. The VFF method works in.

An automatic algorithm for building ontologies from data
This algorithm aims to help teachers in the organization of courses and students in the ... computer science, ontology represents a tool useful to the learning ... It is clcar that ontologics arc important bccausc thcy cxplicatc all thc possiblc ...

Shark-IA: An Interference Alignment Algorithm for Multi ...
Nov 14, 2014 - Architecture and Design—Wireless communication. Keywords ... adversary, we exploit propagation delays as an advantage for throughput ...