On the (In)Security and (Im)Practicality of Outsourcing ...

Viewer
Transcript

On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining Ian Molloy, Ninghui Li, and Tiancheng Li Center for Education and Research in Information Assurance and Security and Department of Computer Science, Purdue University West Lafayette, Indiana, USA {imolloy, ninghui, li83}@cs.purdue.edu

Abstract—The recent interest in outsourcing IT services onto the cloud raises two main concerns: security and cost. One task that could be outsourced is data mining. In VLDB 2007, Wong et al. propose an approach for outsourcing association rule mining [1]. Their approach maps a set of real items into a set of pseudo items, then maps each transaction nondeterministically. This paper, analyzes both the security and costs associated with outsourcing association rule mining. We show how to break the encoding scheme from [1] without using context specific information and reduce the security to a one-toone mapping. We present a stricter notion of security than used in [1], and then consider the practicality of outsourcing association rule mining. Our results indicate that outsourcing association rule mining may not be practical, if the data owner is concerned with data confidentiality. Keywords-association rule mining, outsourcing, security

I. I NTRODUCTION The problem of outsourcing data mining tasks to a thirdparty service provider has been studied in a number of recent papers [1]–[3]. While outsourcing data mining has the potential of reducing the computation and software cost for the data owners, it is important that private information about the data is not disclosed to the service providers. The raw data and the mining results can both contain business intelligence of the organization and private information about customers of the organization and require protection from the service provider. Unfortunately, the current understanding of the potential privacy threats to outsourcing data mining and the needed privacy protection are still quite primitive. In [1] Wong et al. proposed an approach for outsourcing association rule mining. In their model, the data owner first encodes the transactional database before sending it to the service provider. The service provider finds the frequent item-sets and their support counts in the encoded database, then sends the information back to the data owner. The data owner then decodes the results to get the correct support counts of frequent item-sets in the original database. One na¨ıve encoding approach is to replace each item in the original data with a randomly generated pseudo-identifier, but this is subject to frequency analysis attacks [4]–[6]. To defend

against this, Wong et al. propose an encoding algorithm (we call the Wong algorithm) that supplements the na¨ıve approach with additional, random, dummy-items. Wong et al. claim that “[the proposed] technique is highly secure with a low data transformation cost,” and include a proof of security [1]. We present here an attack that breaks the Wong encoding and reduces it to the na¨ıve approach with a one-to-one mapping, allowing standard frequency analysis to be applied [4]–[6]. We go on to show that perfect secrecy is achievable but prohibitively expensive. As an alternative we introduce more natural and practical notions of security aimed at preventing frequency analysis attacks, and give a more secure encoding. There exists a tradeoff between security and efficiency, and when the security cost reaches a certain point it is cheaper to perform the association rule mining oneself. Hence, we analyze how encoding impacts the costs associated with outsourcing. Our work makes the following contributions: • We present a frequency analysis based attack that breaks a state of the art algorithm for outsourcing association rule mining. Knowledge of this attack can be used to develop more secure schemes and may be more widely applicable. • We show the security approach in [1] to be insufficient to withstand known-frequency attacks, and propose alternatives. • With these alternatives, we go on to evaluate the costs associated with outsourcing association rule mining. We find that if one is not concerned about frequency attacks, the na¨ıve approach is sufficient. Conversely, efficiently outsourcing, while securing against frequency attacks, may be impossible. The remainder of this paper is organized as follows. In Section II, we present the Wong et al.’s encoding scheme. In Section III we present our attack. Section IV presents several security properties and analyses several adversarial models. We present an alternative encoding in Section V, and consider the practicality of outsourcing in general. Finally, we review related work in Section VI and Section VII concludes.

II. BACKGROUND This section provides data-transformation framework definitions for outsourcing frequent itemset mining. We then describe the encoding algorithm proposed in [1]. A. A Data Transformation Framework In the data transformation framework, outsourcing works as follows: The data owner has a transaction database T , where each tuple represents a transaction. In an encoding step, the data owner computes an encoded version of T denoted by W. The data owner then provides W and a threshold θ to the service provider, who computes all itemsets in W that have support at least θ, and returns these itemsets and their support to the data owner. In a decoding step, the data owner computes the support of itemsets in T . Finally, using the result of the decoding step, the data owner computes the association rules and their support and confidence. Let I be the set of items that can appear in the input table T . A transaction t is a subset of I, i.e., t ∈ 2I , and ∗ T is a sequence of transactions, i.e., T ∈ 2I . Let Σ be the set that can appear in an encoded table, then of items ∗ W ∈ 2Σ . Definition 1: A data transformation algorithm E takes ∗ as input a table T ∈ 2I , and outputs hW, Di, where ∗ W ∈ 2Σ is the encoded table, and D : 2Σ → 2I is the mapping. A data transformation algorithm decoding ∗ E : 2I → hW, Di is said to be sound if and only if the following two conditions hold: 1) ∀x∈2I \{∅} ∀y∈2Σ \{∅} (D(y) = x) ⇒ suppT (x) = suppW (y) . 2) ∀x∈2I \{∅} ∃y∈2Σ \{∅} [(D(y) = x)] . where suppT (x) is the support of itemset x in table T . Condition 1 above requires that if the itemset y decodes into a non-empty itemset x (i.e., D(y) = x), then y’s support in W must equal x’s support in T . Intuitively, this means that y corresponds to x in the original data. An itemset y ⊆ Σ may not correspond to any itemset in the original set, in which case we should have D(y) = ∅. Upon receiving an itemset y and its support from the service provider, the data owner discards it if D(y) = ∅ and otherwise records the support of D(y). These conditions ensure that we can follow the outsourcing process described earlier to find all itemsets with frequency at least θ. For any itemset x that occurs at least θ times in T , Condition 2 requires that there must exist y that decodes into x and Condition 1 requires y to appear in W with the same frequency. Hence y will be found by frequent itemset mining in W, and be returned with its support. A na¨ıve encoding approach is to replace each item with a pseudo-identifier. This, however, is insecure and vulnerable to frequency analysis [4]–[6]. An adversary may obtain frequency information about frequent items and itemsets from other sources and use that information to attack the

encoded database. For example, an adversary may know that an item i is the most frequent item, and when i occurs, j is also highly likely to occur. Knowing this information, the attacker can find out which pseudo-identifiers correspond to i and j and recover their exact frequencies. To defeat frequency analysis attacks, Wong et al. [1] introduced a more sophisticated encoding approach. B. The WCH+ Encoding Algorithm We call the encoding algorithm in [1] the WCH+ encoding algorithm. In this algorithm, items in I are called original items. Σ, the set of items used in the encoded table, consists of three disjoint sets: U , the set of unique items; C, the set of common items; and F , the set of fake items. That is, Σ = U ∪ C ∪ F . The items in U correspond one-to-one with the items in I, that is, |U | = |I|. They can be viewed as replacing each original item with a pseudo-identifier so that the item name itself does not tell which item it is. Items in C and F are used to defeat frequency analysis attacks. The sizes of C and F are parameters to the algorithm. The algorithm consists of the following two steps. Step 1: Construct an item-level mapping In the first step, the algorithm generates an item mapping m : I → 2U ∪C , which maps each original item i ∈ I to a set of items in U ∪ C. Items in U appear in the image of exactly one item; let u : I → U be a random bijection between I and U , for each i ∈ I, m(i) contains u(i). For each item ci ∈ C, the algorithm randomly picks b items j ∈ I and adds ci to m(j), where b has an expected mean of NB . The number NB is an input parameter to the algorithm. The decoding mapping D is uniquely determined by m. Given an itemset σ ⊆ Σ, D(σ) = {i ∈ I | m(i) ⊆ σ}. Step 2: Construct a transaction-level mapping Here the algorithm processes the transactions in T one by one. For each transaction t ⊆ I, it performs three steps: a Compute M (t) = ∪i∈t m(i). b Compute N (t) = M (t) ∪ E(t), where E(t) is a subset of U ∪ C. For each item j ∈ / t, E(t) may include items in m(j) as long as m(j) 6⊆ N (t), which is sufficient to ensure that D(N (t)) = t. c Compute the final transformation R(t) = N (t)∪sf where sf is a random subset of F of a random size with a mean of NF . To recap, the algorithm uses the following three steps to defend against frequency analysis attacks: (1) each original item is mapped to a set of pseudo items, including one unique item and zero or more common items; (2) in each transaction additional unique and common items are added while ensuring that one does not include all items in the mapping of an original item not in the transaction; (3) in each transaction additional fake items are added.

III. ATTACK We now present an attack that breaks the security of the WCH+ encoding algorithm. A. Summary of Our Attack Before presenting how our attack works, let us first examine what a successful attack against the WCH+ algorithm needs to do. Recall that the WCH+ algorithm is introduced as an enhancement to the na¨ıve algorithm of replacing each item with a pseudo-identifier, which is vulnerable to frequency analysis attacks [7], [8]. Hence the goal of a successful attack is to reduce the security of tables encoded using the WCH+ algorithm to the same level as those encoded using the na¨ıve one-to-one mapping approach at which point frequency analysis attacks may be applied. Given a table W encoded from the original table T using the mapping m, there is a bijection g : T → W such that for every transaction t ∈ T , R(t) is the encoded transaction of t. Our attack succeeds if it can identify all images of the mapping m. That is, if we can find Γ = {m(i) | i ∈ I}, then we know that each γ ∈ Γ corresponds to an original item, and we can apply frequency analysis attacks just as the case of using the na¨ıve algorithm. Furthermore, while finding Γ = {m(i) | i ∈ I} is sufficient, it is not necessary. For example, even if an item a 6∈ m(i), it may be fine to include a in the itemset corresponding to i, provided that a occurs in every transaction in which m(i) occurs. More precisely, our attack succeeds when it is able to find a set of correct mappings, defined as follows. Definition 2: Given W, an encoded table of T under the bijection g : T → W, we say that Y is a correct mapping of the original item i when for each transaction t ∈ T , i ∈ t ⇔ Y ⊂ R(t). It is sufficient for the attack to find a set of correct mappings for original items in I. As frequency analysis is most effective with items of high frequency, it is critical to identify the correct mappings of high frequency original items. We illustrate the effectiveness of our attack on high frequency items in Section III-F. Our attack analyzes the frequencies of single items as well as pairs of items in the encoded database and relies only on the knowledge of the encoding algorithm, but not on its security parameters or on the frequencies of any itemsets in the original database. Given the encoded database, our attack works as follows. First, we identify and remove fake items from the encoded table. Then, we identify pairs of associated items that occur in correct mappings. Finally, we recover the itemsets that are correct mappings. We now explain our attacks in detail. B. Example To illustrate how our attack works, we introduce a small running example. We use the IBM data generator [9] to create a dataset with thirty items and 10,000 transactions.

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

m(i) suppT (i) {0} 2573 {1, 34} 3118 {2} 2917 {3} 3884 {4} 3886 {5} 3378 {6, 32} 3608 {7} 3364 {8, 30, 38} 4339 {9, 37} 2847 {10} 2919 {11, 35} 2599 {12, 32, 36} 2716 {13} 4483 {14} 4176

i 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

m(i) suppT (i) {15, 34} 3375 {16, 38, 39} 3207 {17, 33} 3058 {18} 1980 {19} 2950 {20, 30, 33, 39} 1236 {21, 36} 2396 {22} 4594 {23, 31} 3774 {24} 3759 {25} 2513 {26, 37} 2594 {27} 2156 {28, 35} 3342 {29} 2012

Table I T HE MAPPINGS m( · ) IN THE EXAMPLE .

We encode it using the scheme from Wong et al. [1] with ten common items and two fake items. Each common item is added to an average of three mappings (NB = 3), on average two extra items are added to each transaction (NE = 2), and on average one fake item is added to each transaction (NF = 1). The generated mapping is shown in Table I; note that without loss of generality, we assume u(i) = i. We illustrate the WCH+ encoding on a transaction from our dataset using these parameters. Consider the transaction t = {2, 6, 11, 14, 16, 22, 29} . The first step is to take the union of the item-wise mappings, [ M (t) = m(j) = {2, 6, 11, 14, 16, 22, 29, 32, 35, 38, 39} . j∈t

In the next step we add E(t) = {8, 34} to the transaction, N (t) = {2, 6, 8, 11, 14, 16, 22, 29, 32, 34, 35, 38, 39} . Finally we add the fake item sf = {41}, yielding R(t) = {2, 6, 8, 11, 14, 16, 22, 29, 32, 34, 35, 38, 39, 41} . C. Identifying and removing fake items In Step 2c, the WCH+ algorithm randomly generates a set sf ⊆ F and adds sf to obtain the final transformation R(t) = N (t) ∪ sf . This approach of adding fake items has two weaknesses. The first weakness is that each fake item has the same probability of being added to each transaction, and thus appears with similar frequencies when the number of transactions is large. The second weakness is that fake items are added to transactions independently of the items already present. As a result, each fake item f is independent of all other items x. That is, for each item x, Pr [f ∧ x] = Pr [f ] ∗ Pr [x]. This second observation

We use two metrics to tell when an association x ⇒ y is true: loadingW ({x, y}), defined above, and the following. Definition 6: The association confidence of an association x ⇒ y in W, denoted by conf W (x ⇒ y), is defined as suppW ({x, y}) (2) conf W (x ⇒ y) = suppW ({x})

W

Standard Deviation of Ind (x)

0.20

Figure 1.

0.15 0.10 0.05 0.000

0.04

0.08 0.12 W Mean Ind (x)

0.16

Fake (•) versus non-fake (x) items.

holds even if the frequency of each fake item is different. To measure independence, we define the following metric. Definition 3: The loading factor of a pair {x, y} ⊂ Σ in W, denoted by loadingW ({x, y}), is the ratio of the number of times we observe the pair to the number of times we expected to observe the pair assuming they are i.i.d. loadingW ({x, y}) =

|W| ∗ suppW ({x, y}) . suppW ({x}) ∗ suppW ({y})

(1)

When loadingW ({x, y}) = 1, then x and y are independent. When loadingW ({x, y}) > 1, then x and y are positively correlated. When loadingW ({x, y}) < 1, then x and y are negatively correlated. Hence 1 − loadingW ({x, y}) measures the degree of independence, with smaller values meaning higher independence. To tell whether an item x is a fake item or not, we need to check whether it is independent with all other items, hence Definition 4: The independence factor set of an item x ∈ Σ in W, denoted by IndW (x), is defined as n o IndW (x) = 1 − loadingW ({x, y}) | x 6= y ∧ y ∈ Σ To identify fake items, we use the following observation. Observation 1: When x is a fake item, both the arithmetic mean and standard deviation of IndW (x) should be close to zero and smaller than those of unique or common items. The effectiveness of this is illustrated in Figure 1. We add an additional eight fake items (for a total of ten) where the remaining items are given random support in [0, |T |]. We tested this method on a number of datasets with up to 1000 original items, 300 common items, 20 fake items and 500K transactions with equal results. D. Identify Associations The goal of the next step is to identify associations between pairs of items. There may be many candidate mappings m0 ( · ) that allow an adversary to recover the original data. We try to find a suitable candidate by identifying true associations between pairs of items. Definition 5: We say that y ∈ Σ is truly associated with x ∈ Σ if and only if there exists an original item i such that {x, y} ⊆ Y and Y is a correct mapping for i.

Our attack relies on the following observation. Observation 2: A true association x ⇒ y is very likely to have high association confidence and high loading factor, and any other pair is unlikely to have both high association confidence and high loading factor. We now explain the rationale underlying this observation. Consider a true association x ⇒ y; they must appear in m(i), where i is some original item. When x occurs in an encoded transaction, there are two cases. Case one is that i occurs in the original transaction. Here, y must also occur, contributing to the association confidence. Case two is that x is added during Step 2b. For a number of reasons, such situations will not occur frequently. First, as a unique item, x is unlikely to be added frequently in 2b. This is because common items in m(i) may appear in many mappings, thus appearing more frequently, reducing the probability of x being added, because it may result in incorrect decodings. A true association x ⇒ y may have a low association confidence when {x, y} ∈ m(i) and i appears rarely in the table. In most situations, missing these associations are acceptable because frequency analysis, the step after our attack, is most effective against frequent items. We point out that when x ⇒ y is a true association, conf W (y ⇒ x) may not be high, because y may appear in many mappings, and hence may appear often without x. Some pairs that are not true associations may still have high association confidence. Such an association x ⇒ y is due to y having a high frequency. These pairs can be differentiated from the true associations by examining the loading factor. When x ⇒ y is a true association, many of the joint occurrences are due to the fact that they are in the same mapping, hence x and y will have a high loading factor. On the other hand, when the high association confidence of x ⇒ y is simply due to the high frequency of y, they will have a loading factor close to 1. Based on Observation 2, by setting thresholds on the association confidence and the loading factor, one can identify the true associations. We combine these two metrics to create a one-dimensional ordering over all candidate sets as q µW ({x, y}) = loadingW ({x, y}) ∗ conf W (x ⇒ y) . (3) Note that this captures the intuition that both the association confidence and loading factor must be high. By selecting the appropriate threshold, the adversary can recover the correct associations and minimize the number of false positives and false negatives. The effectiveness of this one-dimensional

metric is shown in Figure 2. Each antecedent i thus defines a candidate set of consequence items that may compose m( · ). If desired, one may define two candidate sets: one for antecedents, and one for consequences, and take their intersection. In practice, this is not required. The result produces what we call the mapping association graph. The mapping association graph is a graph G : (V, E) where V = Σ and (u, v) ∈ E where u, v ∈ Σ if µW ({x, y}) is greater than some threshold. The graph our running example is shown in Figure 3. +It produces three false positives and zero false negatives. The final step is to identify subgraphs of the mappings association graph that represent the correct mappings. E. Finding Correct Mappings At this stage of the attack, we have a set of associations that we believe belong to some mapping. The final stage is to identify all associations to recreate the mappings m( · ) before frequency analysis. The key challenge is to identify which items are unique items and which items are common items. Once we are able to do that, then we can recover all mappings. We use the following observation. Observation 3: Unique items can only appear in one subgraph by definition, defining boundaries between mappings, and two unique items are unlikely to be associated together. The subgraphs can thus be identified by two-coloring the mapping association graph. We two-color the association graph to obtain the final mappings; each mapping is a unique item, and the adjacent common items. The difficulty is determining which of the two possible two-colorings is correct. Consider the association graph in Figure 3. In some instances, such as the subgraph h23, 31i, either two-coloring is correct (item 31 may also be considered unique). Otherwise, we must select the correct two-coloring. This may be deferred until frequency analysis or we may select the most probable mapping by selecting the most anomalous two-coloring as follows. Consider the subgraph containing items h6, 12, 21, 32, 36i; the possible two-colorings are {6, 12, 21} are unique, or {32, 36} are unique. If 12 is unique, there exists a mapping

2.0

Clusters for 8 ⇒ yi

Clusters for 20 ⇒ yi

Loading

1.75

30 38

1.5 1.25

Clusters for 27 ⇒ yi

33 39 30

1 3

19 516

36

1.0 2.0

Clusters for xi ⇒ 30

Clusters for xi ⇒ 32

Loading

1.75 20

1.5 1.00

0.25

0.5

Confidence

0.75

Clusters for xi ⇒ 35

28 11

612

8 1936

12 2638 18 29 33 36

1.25

39 8 35

1.0

0

0.25

0.5

Confidence

25 29 18 31 23 0.75

16 18 239 27 1.0

0

0.25

0.5

Confidence

0.75

1.0

Figure 2. Confidence versus loadings for the candidate associations. An item in a gray circle is a correct association. The dashed line is the separation for µW ({ · , · }) > 0.95.

with three items m(i) = {12, 32, 36}, otherwise there exists mappings of size three m(i) = {6, 12, 32} and m(j) = {21, 21, 36}. We extend the loading factor to sets of items by assuming a set of items is composed of two independent disjoint sets: loadingW (S) =

min

S1 ⊂S S2 =S\S1

|W | ∗ suppW (S) . suppW (S1 ) ∗ suppW (S2 )

(4)

This is the most natural extension to the loading defined in Equation 1 because it maintains a similar scale. In the first possible two-coloring, loadingW ({12, 32, 36}) = 1.702, while in the other two-coloring we have loadingW ({6, 12, 36}) = 1.146 and loadingW ({12, 21, 36}) = 0.607. We select the first instance as the most positively anomalous, and from Table I we can verify that this is correct. The result of two-coloring the remainder of the graph is shown in Figure 3. Each mapping m( · ) may be defined as a unique item and each adjacent common item separated by at most one edge, treating the graph as undirected. For example, {20, 30, 33, 39}, {8, 30, 38}, or {12, 32, 36}.

Figure 3. Association graph for µ > 0.95. Unique items colored gray using Obser. 3.

F. Evaluation In this section, we evaluate the effectiveness of our attack at identifying and isolating the true associations and allowing an adversary to recover the original database. We generate several datasets, encode them without fake items (because they are easily identified), and select all pairs of candidate sets such that µW ({ · , · }) > 1.0. Next, we make a simplifying assumption that we can identify each unique item, and define the candidate mappings m0 ( · ) as a unique item and its candidate set. For example, let i ∈ I and x be the unique item in m(i). Then m0 (i) = {x}∪ y | µW ({x, y}) > 1.0 ∨ µW ({y, x}) > 1.0 . Next we find all transactions W[m(i)] that contain m(i) and all transactions W[m0 (i)] that contain m0 (i). If W[m(i)] 6= W[m0 (i)] we consider the decoding to contain an error and the item i decodes incorrectly. In [1] Wong et al. report the decryption accuracy using recall, the total number of correct decodings divided by the total number of items. Without indicating the false positive rate this is a meaningless measure (decoding every transaction into I has perfect recall). We calculate the recall

W1 W2 W3 W4 100 1000 1000 1000 20 150 150 300 2.5 4 4 4 2 8 8 8 100 k 100 k 500 k 500 k E R E R E R E R 10% 0 22.6% 0 20.0% 0 21.0% 0 20.5% 20% 0 38.4% 0 35.6% 0 36.8% 0 36.4% 40% 0 63.5% 0 60.7% 0 61.4% 0 61.4% 50% 1 72.7% 0 70.8% 0 71.3% 0 71.2% 75% 7 87.7% 0 90.1% 1 90.2% 5 90.0% 100% 12 92.2% 32 99.1% 40 98.4% 33 98.9% |I| |C| NB NE |W|

Table II A NALYSIS OF OUR ATTACK . E: NUMBER OF ITEMS IN I PRODUCING AT LEAST ONE FALSE POSITIVE OR NEGATIVE . R: PERCENTAGE OF W CORRECTLY DECODED .

R only for items that are correctly recoded as a percentage of the total size of the entire table T and not on an individual item basis, i.e., X suppT (i) R= . kT k i∈I W[m(i)]=W[m0 (i)]

In Table II we report the number of items that decode incorrectly as errors, E, and the recall, R, of W. We provide results for the top 10%, 20%, 40%, 50% 75% and 100% of items (by support in T ), illustrating that most errors are due to infrequent items in the original data. Many of these errors are unlikely to be present in any large itemsets, allowing for complete recovery of the association rules by an adversary. It should be clear from the table that our attack is highly effective, especially for the most frequent items, and results in a very low error rate. Further, most errors are due to infrequent items and are unlikely to adversely affect the recovery of the association rules, or large portions of the original database. For our larger tests (W2−4 ) caused errors in around 3–4% of the unique items and only cased around 1% of the total number of items in the original table to decode incorrectly. Depending on θ, this may have no effect on the association rules. IV. S ECURELY O UTSOURCING It was claimed in [1] that “[the proposed technique] is highly secure”; in fact, there was a proof of security. It is natural to ask how an algorithm can be both proven secure and broken at the same time. We show that the answer is that the notion of security in [1] was incorrect, and we discuss what notions should be used. A. Incorrect Notion of Security We summarize the security proof in [1] as follows: Because a na¨ıve one-to-one mapping is vulnerable to frequency analysis attack the paper extends a one-to-one

mapping to a one-to-n mapping. To ensure soundness, only one-to-n mappings that contain a unique item are admissible. This requires that ∀x ∈ I, m(x) \ ∪y∈I\{x} m(y) 6= ∅. However, it was shown (Theorem 4 of [1]) that any admissible one-to-n mapping m can be decoded by a one-to-one mapping m0 that only considers the unique item, concluding that a one-to-n mapping is no more secure than the na¨ıve approach. They then show (Theorem 6 of [1]) that for the proposed encoding algorithm, there does not exist a one-to-one mapping that correctly decodes W. This is due mainly to Step 2b that allows unique items to be added to any transaction. Finally, the non-existence of a one-to-one mapping, and experimental analysis showing its resistance to a single attack, concludes their proof of security. We now show that the existence or non-existence of a one-to-one mapping does not constitute a proof of security. First, even if a one-to-one mapping does not exist, it may be easy for an attacker to recover a one-to-many mapping and fully recover the original database. Second, even if a one-to-one decoding mapping exists, an encoding scheme may remain secure because it is infeasible for the attacker to find the mapping. Our attack in Section III is an example of the first point. To further illustrate the point, we introduce a small example. Example 1: Each item i ∈ I is mapped to a set of two items {ai , bi }. In addition, we add two extra transactions, t0a = {ai | i ∈ I} and the other containing t0b = {bi | i ∈ I}. There does not exist a one-to-one decoding; using either ai or bi to decode i yields an incorrect result. However, this is no more secure than a one-to-one mapping for almost all databases. For almost all databases, the two additional transactions t0a and t0b are immediately apparent because they have the same size as the total number of items and are the only transaction where the items are disjoint. The adversary can remove the two transactions and any item that occurs in t0a or t0b . As a result, the adversary obtains a database that is a one-to-one mapping of the original. While the above example is artificial by construction, it illustrates the flaw in the security proof. To see the converse, that the presence of a one-to-one mapping does not constitute a security flaw, we present an encoding scheme with perfect secrecy, yet having a one-to-one decoding. B. Perfect Secrecy If the notion of security in [1] is inadequate, what is the right notion of security? We start from the well-known notion of perfect secrecy (or information theoretic secrecy) in cryptography. Intuitively the encoded table W should reveal no information about T . More specifically, given the randomized encoding algorithm E, it should satisfy ∀T1 ∀T2 ∀W (Pr[E(T1 ) = W] = Pr[E(T2 ) = W]) .

Note that we slightly abuse the notion and write E(T1 ) = W to denote that when running the encoding algorithm E on T1 , the output is W and some decoding mapping, the details of which we do not care about. If E satisfies the above condition, then the output W does not reveal any information about the input table because any table could be equally likely encoded into W. This level of security may be trivially accomplished, but is computationally prohibitive in practice. First, the data owner must encode all ω tables with ` items and n transactions T1 , T2 , · · · , Tω using one-to-one mappings over the individual items into the encoded tables W1 , W2 , · · · , Wω such that each table Ti is mapped to a disjoint set of items Σi . To produce a single table W, the ω encoded tables are combined by concatenating S transactions. Let Wij be transaction j of table i, then Wj = 1≤i≤ω Wij . To decode table i, D decodes the items in Σi (the items in the table Wi ) back into Ti , and all items in Σj6=i are translated to the empty set. Unfortunately, such an approach is extremely impractical. There are k = 2` possible transactions taken from ` items. Each table consists of n transactions from the set of k possible transactions where duplicates are allowed, resulting in ω = (k+n+1)! n!(k−1)! possible tables. The impracticality of achieving perfect secrecy should not be surprising at all. If the encoded database contains no information about the original dataset, then the service provider cannot do anything useful for the data owner; the data owner would be essentially mining the data by herself. C. A Practical Notion of Security An implicit assumption of our attack (and the reason for Wong et al. to devise their encoding in the first place), is that a one-to-one mapping is vulnerable to a frequency analysis attack. It should always be assumed the adversary has recovered some frequency information regarding the items in the table. The frequency information may be recovered from multiple sources, such as actuaries, public sales records, shareholder briefings, coercion, theft or bribery, social engineering, dumpster diving, or public records [10]. We argue an encoding is only as secure as its ability to protect against an adversary with frequency information. 1) Security Against Frequency Information: Wong et al. assume the adversary may have frequency information, and define (α, β)-security. In (α, β)-security, let L be the number of large itemsets in T and assume the adversary knows the support of α∗|L| itemsets are in the range suppT (i)∗(1±β). The encoding is (α, β)-secure if the adversary still cannot achieve their goal. We define a more general notion of security against an adversary with frequency information for individual items i ∈ I only and may be extended to other sets. Definition 7: We define the candidate set of a set S ⊆ I

given table W and β-knowledge as X ⊆ Σ ∧ suppT (S) ∗ (1 − β) ≤ W candβ (S) = X suppW (X) ≤ suppT (S) ∗ (1 + β) First, consider an adversary that has precise (β = 0) frequency knowledge for an item i ∈ I, i.e., suppT ({i}) = k. The encoding obfuscates the item only if candW 0 ({i}) = n is large. Note that this condition is necessary, but not sufficient, because the encoding may leak information by other means. Given the large number, 2|Σ| −1, of itemsets, it is possible that a large subset have support k. Unfortunately, it should be apparent that most of these subsets will have very low support. Further, given knowledge of the encoding scheme, we can observe that the probability an itemset S is correct decreases with the size of S. This leads us to the following property. An encoding is insecure if ∃i∈I candW (5) β ({i}) ≤ n for some security parameter n > 2. If candW 0 (S) is only large for S ∈ L, the encoding provides little security. Consider an item i ∈ I that has support k. If the number of itemsets S ⊆ Σ that have k support is large, but the number of itemsets that have k ± 1 support is small, these anomalies become security vulnerabilities. Instead we should require ∀k1 ≥k0 ≥k0 X | suppW (X) = k 0 ≥ n (6) for security parameters n, k0 , and k1 . To illustrate the necessity of these security requirements, we perform two experiments on our running example. First we calculate the number of candidate itemsets |S| ≤ 5 that have exactly support k for all k ≤ |W|. The results are shown in Figure 4(a); the vertical line is the mean support for all original items, the dark gray region is one standard deviation, and the light gray region is the support range. In our second experiment, we illustrate how ineffective the WCH+ encoding is at securing against an (α, β)-adversary when β is small. We assume an α = |I|/|L| adversary that knows the frequency of all original items. The results are shown in Figure 4(b). It can be seen that the WCH+ encoding provides very poor protection against an (α, 0) adversary; in many instances, the mappings are uniquely identified by their support. D. Suggested Changes We suggest three simple changes based on observations gained from our attack and the security requirements presented in the previous section that we believe improve the security of outsourcing association rule mining. 1) All items should be unique. The difference in association between two unique items and a unique item and a common items was easily exploited in our attack to identify associations and two-color the graph. All

mappings m(i) should be disjoint and the size of the mappings are still user-definable. 2) Adding unique items in Step 2b and fake items in Step 2c should be dependent on the items already in M (t) and N (t). We accomplish this by associating a small set of items {j0 , ji , · · ·} with an item i and add an item jk (either a unique or fake item) with a probability φk if i is in the partially encoded transaction. For unique items we must still ensure we do not break soundness in Definition 1. This defeats the independence property described in Section III-C for fake items. 3) The number of transactions in W should be increased by adding transactions that decode into the empty set. This aims at obfuscating frequency attacks by increasing the number of candidate itemsets and is aimed at directly satisfying Equation 6. Note that this requires θ | to be transformed θ0 = θ∗|T |W| . The new key space K is determined by the final size of Σ. If we assume mappings of size one are allowed, we can obtain an upper bound on the key space as |K| = |Σ| , |I| n where k are Stirling numbers of the second kind. E. Defense Against Attack We test our suggestions with our running example dataset. We encode it by restricting each mapping m(i) to at most four items, restrict φi ≥ 0.85, and increase the number of transactions by 50%. The resulting table W contains 96 unique items and 15,000 transactions. We run our attack for the WCH+ encoding and the na¨ıve frequency analysis attack against this encoded table. Our attack produced twenty-three false positives and fortythree false negatives, indicating that over 70% of the true associations are obfuscated. This encoding was equally successful at preventing the na¨ıve frequency analysis attack. From Figure 4(d) we see that the number of candidate mappings for each original item i ∈ I increases by one to two orders of magnitude. This is the product of shifting the mass (Figure 4(d)) of the candidate curve into the region occupied by the items of interest. Does this imply our encoding is secure? No, or at least it doesn’t prove that it is, as was suggested in [1]. This analysis only considers two simple attacks. While we empirically believe our encoding is more secure than the WCH+ encoding, and that it has properties we have shown are necessary for a secure encoding, this does not constitute a proof. In some sense, this inability to formally prove properties of such encoding schemes is to be expected. An analogy for such encoding schemes in the field of cryptography are block ciphers. While provable security is the foundation of modern cryptography, for cryptographic primitives such as block ciphers, there are no proofs of security. One can only assume that they are secure and prove that protocols using them are secure. The security of such primitives are ensured by first making sure that their design adopt the best

principles derived from known attacks, and then making sure that they withstand all known attacks. In the study of such primitives, new attack techniques are extremely important because the knowledge of them contributes to the design of new algorithms. In the next section we will consider a more important question: is outsourcing even practical? V. P RACTICALITY OF O UTSOURCING In the previous sections, we analyzed several security properties for outsourcing association rule mining. While Wong et al. argues outsourcing can be performed efficiently, we analyze the problem more closely and evaluate the tradeoff between security and efficiency. While preliminary, our results indicate that outsourcing may not be practical in a large number of instances. Sion and Carbunar [11] consider a similar problem for private information retrieval (PIR). In PIR, a client queries a server and wishes for their query to remain secret. Given the best known algorithms, [11] show that the computational costs of PIR protocols exceed the network costs of trivially sending all data to the client by comparing the computational requirements, in millions of instructions per second, to network throughput, in megabits per second. We attempt to analyze the feasibility of association rule outsourcing by comparing the time required to perform association rule mining with the network costs of transferring the results back to the data owner. We make a simple observation; Observation 4: For any large itemset X in T , Apriori must generate and return all subsets of X, of which there are 2|X| , while for the encoded table Apriori will generate all subsets of E(X), of which there are O 2NB |X| . The service provider cannot prune the results because this implies they can distinguish1 between two itemsets X ⊂ Y ⊆ I. Instead of attempting to evaluate how the parameters to the WCH+ algorithm impact the size of the results, we use a more generic metric, the ratio in the size of W to the size of T . This should make our results applicable to any encoding scheme that works within the framework described in Section II-A should more be proposed in the future. We perform our analysis as follows. We first generate a small example table T with 10 k transactions and 100 items using the IBM data generator as above. We use the WCH+ encoding and select a variety of values for encoding and mining parameters: |C| = {5, 10, 20, 40}; NB = {1, 2, 4, 5}; NE = {1, 2, 4}; NF = 4; and support thresholds θ ∈ {5%, 1%, 0.5%}. We use an implementation of Apriori [9] from Ferenc Bodon2 [12] written in C running on a 2.66 GHz Pentium 4 machine with 1 GB of RAM running Linux to mine the association rules, encode the data, and mine the encoded results. 1 For

example, D(X) ≡ D(Y ) yet suppW (X) 6= suppW (Y ).

2 http://www.cs.bme.hu/∼bodon/en/apriori/

00

1K 2K 3K 4K 5K Itemset Support k

(a) Number of candidate items sets for each support values k. Figure 4.

10 10

3

± 10% ± 5% ± 1% ± 0%

2

1

0

10 2000

3000 4000 Item Frequency

(b) Number of candidates ∀i ∈ I given β knowledge.

Required Throughput in Megabits

10 10 10

5

4

10 Gigabit Ethernet

3

SATA 2 Gigabit Ethernet USB 2.0

1

θ =0.5% θ =1% θ =5%

Fast Ethernet 802.11g

2

10 1.0

1.5

Figure 5.

200 100 00

1K 2K 3K 4K 5K Itemset Support k

(c) Number of candidate itemsets for each support value k.

6

10 5 10 4 10 3 10 2 10 1 10 0 10 2000

± 10% ± 5% ± 1% ± 0%

3000 4000 Item Frequency

(d) Number of candidates ∀i ∈ I given β knowledge.

WCH+ (left) versus our encoding (right) at defending against na¨ıve frequency attacks.

The size of the results file is calculated in bytes and is approximately the same size regardless of whether the file is stored in ASCII or binary formats. For those interested, gzip provided approximately a 1 : 9 compression ratio, although the output must be still process and decoded. We consider the encoding inefficient when the time required to transfer the results at a practical throughput exceeds the time required to mine the original data. Consider an input table T and encoded table W. If it takes 0.5 s to mine T , and mining W produces a 2 Mbit file, a transfer rate of ≤ 4 Mbit/s is clearly not efficient or practical. The results are shown in Figure 5. We separate the large itemset threshold θ because it is a data-owner selected parameter that does not affect security and has the greatest impact on the size of the mined association rules in our experiments. The horizontal axis is the ratio of the size of the encoded table W to the original table T ; intuitively a larger ratio implies increased security. Several throughput frames of reference ranging from the 54 Mbit/s (802.11g Wi-Fi) to 10 Gbit/s Ethernet are provided. Each reference is the theoretical peak transfer rate, and not the real-world observed rates; this may offset our decision not to use gzip for our measurements. These results indicate that modestly doubling the size of T (mapping each item i ∈ I to a pair) may be efficient for θ = 0.05, inefficient to outsource over the network for θ = 0.01, and inefficient in general

10

300

Number of Candidate Mappings

100

10

Number of Candidate Itemsets

200

Number of Candidate Mappings

Number of Candidate Itemsets

300

2.0 2.5 Increase in table size: || WT ||

3.0

The tradeoff between security.

3.5

for θ = 0.005. While these results may depend highly on the input data, they do indicate that a data owner should carefully consider whether outsourcing is an efficient use of resources for association rule mining given the security risks posed to the confidentiality and privacy of the data. VI. R ELATED W ORK We categorize existing related work into three categories. We describe each category of work and explain how they are different from our work. Outsourcing Privacy. Kumar et al. [7] and Lakshmanan et al. [8] consider the amount of privacy provided by one-toone encodings and have modeled the amount of information loss given different levels of domain knowledge. The main distinction between these works and frequency analysis in general is the availability and accuracy of the frequency information obtained by an attacker. Privacy-Preserving Data Mining (PPDM). In PPDM [13], the data collector receives responses from many users and tries to discover valuable information in aggregate while maintaining individual privacy. In our context, both aggregate and individual confidentiality and privacy is required. Two main approaches for PPDM have been proposed: randomization and secure multi-party computation (SMC). In the randomized approach [13] individuals’ randomize their information introducing noise and restricting the learned results to an approximation only. In SMC [14], the data is stored by multiple parties and the objective is to learn knowledge regarding the aggregate data. The SMC approach focuses on distributed databases while we protect the outsourced database from the single data miner. Database Outsourcing. In database outsourcing [15], the data owner publishes its data through a number of remote servers, with the goal of enabling users at the edge of the network to access and query the data more efficiently. Database outsourcing is more concerned with the integrity of the data than the confidentiality. Some works [16] concentrate on querying encrypted databases to prevent data theft, however

they do not prevent an adversary from learning sensitive information.

[5] D. L. Kahn, The codebreakers: the story of secret writing. New York: Scribner, 1996.

Data Encryption. Substitution ciphers are an encryption method where each letter of the plaintext is substituted with another letter in the ciphertext. It is easy to break substitution ciphers ( [4] and so on) when the statistical information about the plaintext space is available. Frequency analysis [4] and genetic algorithms are two widely-used techniques for breaking substitution ciphers. In [1], Wong et al. experimentally show that their encoding scheme is resistant to a genetic algorithm. However, as we show in this paper, statistical anomalies can be used to break their scheme more efficiently.

[6] D. R. Stinson, Cryptography: Theory and Practice. Press, 1995.

VII. C ONCLUSIONS AND F UTURE W ORK In this paper we presented an attack on a database encoding scheme for outsourcing association rule mining. We showed how an attacker may identify patterns in the data created by the encoding algorithm, allowing a significant amount of the original data to be recovered; it makes no assumptions regarding a priori knowledge of the data. We also analyzed the security properties of outsourcing association rule mining, and illustrated why the security properties discussed in [1] are inadequate to ensure the confidentiality of the data. We analyzed the security of their encoding given an adversary that possesses frequency knowledge and considered the practicality of using such an encoding in practice. We make several suggestions on how to improve the encoding that provides some defense against our attack and present an encoding that maintains perfect security against a computationally unbounded adversary. Further, we questioned the practicality of outsourcing association rule mining in general. The evidence we gathered suggests that outsourcing is not efficient in many settings. It is still an open problem if there exist provably secure encoding schemes that are still practical. Portions of this work were supported by a Google grant titled “Utility and Privacy in Data Anonymization” and by sponsors of CERIAS. R EFERENCES [1] W. K. Wong, D. W. Cheung, E. Hung, B. Kao, and N. Mamoulis, “Security in outsourcing of association rule mining,” in VLDB, 2007, pp. 111–122. [2] L. QIU, Y. LI, and X. WU, “An approach to outsourcing data mining tasks while protecting business intelligence and customer privacy,” ICDMW, vol. 0, 2006. [3] L. Xiong, S. Chitti, and L. Liu, “Preserving data privacy in outsourcing data aggregation services,” ACM Trans. Interet Technol., vol. 7, no. 3, p. 17, 2007. [4] I. A. Al-Kadit, “Origins of cryptology: The arab contributions,” Cryptologia, vol. 16, no. 2, pp. 97–126, April 1992.

CRC

[7] R. Kumar, J. Novak, B. Pang, and A. Tomkins, “On anonymizing query logs via token-based hashing,” in Proc. of the 16th international conference on World Wide Web (WWW), New York, NY, USA, 2007. [8] L. V. S. Lakshmanan, R. T. Ng, and G. Ramesh, “To do or not to do: the dilemma of disclosing anonymized data,” in Proc. of the 2005 ACM SIGMOD international conference on Management of data, 2005. [9] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in VLDB, 1994, pp. 487–499. [10] K. D. Mitnick and W. L. Simon, The Art of Deception: Controlling the Human Element of Security. Wiley, 2002. [11] R. Sion and B. Carbunar, “On the computational practicality of private information retrieval,” in Network and Distributed System Security Symposium NDSS 2007, 2007. [12] F. Bodon, “A survey on frequent itemset mining,” Budapest University of Technology and Economics, Tech. Rep., 2006. [13] R. Agrawal and R. Srikant, “Privacy preserving data mining,” in SIGMOD, 2000, pp. 439–450. [14] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in CRYPTO, 2000, pp. 36–53. [15] H. Hacigumus, B. Iyer, and S. Mehrotra, “Providing database as a service,” in ICDE, 2002, pp. 29–38. [16] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searches on encrypted data,” in S&P, 2000, p. 44.

On the (In)Security and (Im)Practicality of Outsourcing ...

the potential of reducing the computation and software cost for the data owners, it is important that private information about the data is not disclosed to the service providers. The raw data and the mining results can both contain business intelligence of the .... owner discards it if D(y) = â and otherwise records the support of ...

Download PDF

2MB Sizes 1 Downloads 154 Views

Report

On the (In)Security and (Im)Practicality of Outsourcing ...

Recommend Documents