u ncorrected proo f - Semantic Scholar

Viewer
Transcript

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

1 Information Sciences xxx (2005) xxx–xxx

OO F

www.elsevier.com/locate/ins

5 6

Jeﬀrey Xu Yu a,*, Zhihong Chong b, Hongjun Lu c, Zhenjie Zhang d, Aoying Zhou b

3

7 8 9 10 11

Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China b Fudan University, Shanghai, China c Hong Kong University of Science and Technology, Hong Kong, China d National University of Singapore, Singapore, Singapore

ED

a

PR

4

A false negative approach to mining frequent itemsets from high speed transactional data streams

2

RE CT

12

13 Abstract

OR

Mining frequent itemsets from transactional data streams is challenging due to the nature of the exponential explosion of itemsets and the limit memory space required for mining frequent itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I 1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and diﬃcult to track with limited memory. The existing studies on ﬁnding frequent items from high speed data streams are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter , and allow items with support below the speciﬁed minimum support s but above s counted as frequent ones. However, such false-positive oriented approaches cannot be eﬀectively applied to frequent itemsets mining for two reasons. First, false-positive items found increase the number

UN C

14 15 16 17 18 19 20 21 22 23 24

*

Corresponding author. Tel.: +852 2609 8309; fax: +852 2603 5505. E-mail addresses: [email protected] (J.X. Yu), [email protected] (Z. Chong), luhj@ cs.ust.hk (H. Lu), [email protected] (Z. Zhang), [email protected] (A. Zhou).

0020-0255/$ - see front matter 2005 Published by Elsevier Inc. doi:10.1016/j.ins.2005.11.003

INS 7284 28 November 2005 Disk Used

2

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

OO F

of false-positive frequent itemsets exponentially. Second, minimization of the number of false-positive items found, by using a small , will make memory consumption large. Therefore, such approaches may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can eﬀectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. Our algorithms are based on Chernoﬀ bound in which we use a running error parameter to prune item(set)s and use a reliability parameter to control memory. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predeﬁned parameter so that desired recall rate of frequent itemsets can be guaranteed. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They signiﬁcantly outperform the existing false-positive algorithms. 2005 Published by Elsevier Inc.

PR

25 26 27 28 29 30 31 32 33 34 35 36 37 38

No. of Pages 30

ARTICLE IN PRESS

ED

39 Keywords: Data stream; Frequent pattern mining; Memory minimization 40

41 1. Introduction

OR

RE CT

Recently, data streams emerged as a new data type that attracted great attention from both researchers and practitioners. A data stream is essentially a virtually unbounded sequence of data items arriving at a rapid rate. Since data items arrive continuously, it is only feasible to store certain form of synopsis (in memory or disk) rather than the raw data for analysis or information extraction. It is also infeasible to multiple scan the original data to build such synopsis because of the massive volume as well as the rapid arrival rate. Research work related to data streams boils down to the problem of ﬁnding the right form of synopsis and related construction algorithms so that the required statistics or patterns can be obtained with a bounded error for unbounded input data items with limited memory. A large amount of work has been reported for various statistics and patterns, including simple aggregates and statistics such as maximum, minimum, average, median values and quantiles as well as complex patterns such as decision trees, clusters, and frequent itemsets. In this paper we study the problem of mining frequent item(set)s (or patterns) from high speed transactional data streams. Manku and Motwani gave an excellent review of wide range applications for the problem of frequent data stream pattern mining [12]. The problem can be stated as follows. Let I = {x1, x2, . . . , xn} be a set of items. An itemset is a subset of items I. A transactional data stream, D, is a sequence of incoming transactions, (t1, t2, . . . , tN), where a transaction ti is an itemset and N is a unknown large number of transactions that will arrive. The number of transactions in D that contain X is

UN C

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

3

called the support of X, denoted as sup(X). An itemset X is a frequent pattern, if and only if sup(X) P sN, where s = sup(X)/N is a threshold called a minimum support such that s 2 (0, 1). The frequent data stream pattern mining, denoted FDPM, is to ﬁnd an approximate set of frequent patterns (itemsets) in D with respect to a given support threshold, s. The approximation is controlled by two parameters, and d, where (2 (0, 1)) controls errors and d(2 (0, 1)) controls reliability. We call it (, d) approximation scheme. The challenge is to devise algorithms to support (, d) approximation for the FDPM problem with a bound regarding space complexity. The main diﬃculty is the nature of the exponential explosion of frequent patterns mining. For data streams, the incoming transactions will not be stored, and we can only scan them once. If a count is required for each itemset, an application with m distinct items will require 2m 1 counts. Even with a moderate set of items, for example m = 1000. The total number of itemsets is 2m 1 = 21000 1, which is obviously intractable. A simple version of FDPM problem, that is, mining frequent items but not itemsets, has been recently widely studied in data stream environments with bounded memory [2,3,5–7,9,10,12]. After a careful study of these published studies, we observed that, while the detailed algorithms are diﬀerent, almost all of them are false-positive oriented approaches. That is, given a minimum support s, they control memory consumption in the counting processes by an error parameter , and allow items with support below the speciﬁed minimum support s but above s counted as frequent ones. In this paper, we argue that since frequent items mining is the ﬁrst step in frequent itemsets mining, even a small number of false-positive items, resulted from the false-positive oriented item counting, could lead to a large number of false-positive itemsets, which makes eﬃcient and eﬀective frequent itemsets mining infeasible. This motivated us to develop a false-negative oriented approach for frequent items mining. In addition, to further address the problem caused by explosion of frequent itemsets, we explored a tight bound to control the counting process for frequent item(set)s mining. As a summary, our contribution can be summarized as follows:

97 98 99 100 101 102 103 104 105

• While most existing work follows the approach of false-positive oriented frequent items counting, we show that false-negative oriented approach that allows a controlled number of frequent itemsets missing from the output is a more promising solution for mining frequent itemsets from high speed transactional data streams. • We developed the ﬁrst set of one-scan false-negative oriented algorithms which signiﬁcantly outperform the existing false-positive oriented approaches for frequent itemsets mining as well as frequent items mining. We also derived memory bounds for both cases.

UN C

OR

RE CT

ED

PR

OO F

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

INS 7284 28 November 2005 Disk Used

4

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

OO F

• Most existing approaches use the error parameter for two purposes which are conﬂict: quality control () and memory size control (1/ or 1/2), which leads to a dilemma: a little increase of will make the number of false-positive items large, and a little decrease of will make memory consumption large. Our algorithms adopt a (, d) approximation scheme with = 0, d > 0 that decouples the two interrelated but conﬂict purposes and makes the parameter setting of the mining process easier. The remainder of the paper is organized as follows. Section 2 analyzes the false-positive and false-negative approaches in frequent item(set)s mining. Sections 3 and 4 present our frequent items mining algorithm, and the results of performance study. Sections 5 and 6 present our frequent itemsets mining algorithm, and the results of performance study. Section 7 concludes the paper.

PR

106 107 108 109 110 111 112 113 114 115 116 117 118

119 2. False-positive versus false-negative

RE CT

ED

Due to the little space allowed to mine frequent data stream patterns, the key point becomes how to prune those potentially infrequent patterns and how to maintain potentially frequent patterns with probabilistic guarantees [11]. Approximate mining frequent patterns with probabilistic guarantee can take two possible approaches, namely, false-positive oriented and false-negative oriented. The former includes some infrequent patterns in the ﬁnal result, whereas the latter misses some frequent patterns. There are a large number of publications on the false-positive oriented approaches [2,3,5–7,9,10,12]. All false-positive oriented approaches focused on frequent items mining, rather than frequent itemsets mining. In [12], as the ﬁrst attempt, Manku and Motwani also studied false-positive oriented frequent itemsets mining in a less theoretical nature and with a focus on systemlevel issues. As an one-scan false-negative oriented approach, the early work of this paper is reported in [15].

OR

120 121 122 123 124 125 126 127 128 129 130 131 132 133

No. of Pages 30

ARTICLE IN PRESS

134 2.1. Deﬁciency of false-positive oriented approaches Because the focus of this paper is on frequent itemsets mining, we concentrate ourselves on frequent itemsets mining, and we mainly address it in comparison with the algorithms proposed by Manku and Motwani [12]. In [12], Manku and Motwani developed two false-positive oriented algorithms for frequent items counting, Sticky-Sampling and Lossy-Counting. The Sticky-Sampling uses Oð1 logðs1 d1 ÞÞ expected number of entries, and Lossy-Counting uses Oð1 logðN ÞÞ entries. In theory, Sticking-Sampling requires constant space, while Lossy-Counting requires space that grows logarithmically with N. In practice, as shown in [12], Sticky-Sampling performs

UN C

135 136 137 138 139 140 141 142 143

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

5

worse because of its tendency to remember unique items sampled. LossyCounting can prune low frequency items quickly and keep only high frequent items. Based on this fact, Manku and Motwani give a Lossy-Counting based three module system (Buﬀer–Trie–SetGen) for mining frequent itemsets in a less theoretical nature. The main features of their algorithms include (1) all item(set)s whose true frequency exceeds sN are output, (2) no item(set)s whose true frequency is less than (s )N are output, and (3) estimated frequencies are less than the true frequencies by at most . In the following, without loss of generality, we address our false-negative approach in comparison with the Lossy-Counting algorithm and the BuﬀerTrie-SetGen approach.

155 156 157 158 159 160

Remark 1. Like Sticky-Sampling, Lossy-Counting is false-positive oriented and is -deﬁcient. The parameter is coupled with two conﬂict goals. First, let f be the number of items in [s , s]. Then f P f0 if > 0 . The smaller is, a less number of false-positive items are included in the result set. Second, because the memory consumption is a factor of 1/, the memory consumption increases reciprocally in terms of .

161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179

Remark 1 states the dilemma of false-positive oriented approaches (-deﬁcient). The memory consumption increases reciprocally in terms of where controls the error bound. It is diﬃcult to decouple the two functions, memory consumption control and error control, from the error bound . In Sticky-Sampling, is used to determine a sampling rate, and in Lossy-Counting, is used to determine the bucket width. Changing their ways of dealing with means to change the worst case space-complexity analysis. The impacts of parameter will be even greater when frequent itemsets mining is concerned, which is in fact related to the fundamental issue on application of Apriori property [1]. The Apriori property states: if any length k pattern is not frequent in a dataset, its length (k + 1)th super-patterns can never be frequent. In other words, the Apriori property suggests to use possibly smallest kth frequent itemsets to generate the (k + 1)th candidate itemsets, and then mine the (k + 1)th candidate itemsets. The false-positive oriented approaches allow 1-itemsets with support below s but above s counted as frequent. Consequently, when there are some false-positive 1-itemsets in [s , s], the nature of the exponential explosion makes the number of potential frequent itemsets be very large and makes false-positive oriented approaches diﬃcult to manage it.

UN C

OR

RE CT

ED

PR

OO F

144 145 146 147 148 149 150 151 152 153 154

180 2.2. Our false-negative oriented approach: -decoupling 181 False-positive oriented approaches have their limit to support frequent 182 item(set)s mining. One of the main diﬃculties is caused by the conﬂicts of

INS 7284 28 November 2005 Disk Used

6

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

183 the error parameter as stated in Remark 1. In this paper, we decouple the two 184 conﬂict functions of the error parameter as follows.

RE CT

ED

PR

OO F

• Error control and pruning: We use an eﬀective to control error bound, which is changeable and is not ﬁxed. The eﬀective value of becomes smaller while more data items are received from a data stream. In brief, we compute the eﬀective value of using minimum support s (user given), reliability d (user given), and the number of observations n (variable), where is reciprocal to n. The eﬀective value of approaches to zero when the number of observations increases. Therefore, the frequent item(set)s mining becomes more accurate. It is important to note that we use to prune data but do not use it to control memory. • Memory control: We use the reliability d instead of to control memory consumption. Diﬀerent from false-positive oriented approaches whose memory consumption is determined by 1/, the memory consumption in our algorithms is related to ln(1/d). Consider the same memory space using either or d, we have 1/ = ln(1/d). For getting the same memory space, when = 0.1, d = 0.00005; when = 0.01, d = 3.7 · 1044. Because in practice, d = 0.0001, our approach can signiﬁcantly reduce the memory consumption and processing cost for frequent item(set)s mining, while achieving high accuracy. We will discuss bounds for frequent item(set)s mining later.

OR

Our approach does not allow 1-itemsets with support below s counted as frequent, and therefore is a false-negative oriented approach. We will give the details of our approach, and show that the possibility of missing frequent item(set)s is considerably small later in this paper. Our one-scan false-negative oriented approach is diﬀerent from Toivonens two-scan false-negative oriented approach [13]. In brief, Toivonens algorithm is to pick a random sample and ﬁnd all association rules using this sample that probably hold in the whole dataset in one pass, and to verify the results with the rest of the dataset. It allows false-negative with probabilistic guarantees, and the sample size can be at least Oðc2 logðd1 ÞÞ. One of the problem of the Toivonens algorithm is that, because the error parameter can be very small, the memory consumption using Toivonens algorithm can be very large (1/2). We summarized some bounds in Table 1 for comparison. Note, in Table 1, as a false-positive approach, GroupTest does not rely on . But it requests the knowledge of the domain of a data stream, which is diﬃcult to obtain beforehand.

UN C

185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219

220 2.3. Frequent itemsets mining: a comparison 221 To verify our analysis, we conducted experiments to study the impacts of a 222 large number of itemsets in the range of [s , s + ] on frequent itemsets

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

7

Table 1 Theoretical memory bounds Space

False-positive

Oðk2 logðn=dÞÞ

Sticky-Sampling [12]

False-positive

Oð1 logðs1 d1 ÞÞ

OO F

Type

Charikar et al. [3] Lossy-Counting [12]

False-positive

Oð1 logðN ÞÞ

GroupTest [6]

False-positive

OðkðlogðkÞ þ logðd1 ÞÞ logðMÞÞ

Toivonen [13]

False-negative

Oððc2 logðd1 ÞÞ

FDPM-1 (this paper [15])

False-negative

O((2 + 2ln(2/d))/s)

OR

RE CT

ED

PR

mining. We report here one of the experiments. We generated a data stream of length 1000K which has an average transaction size 15 and maximal potentially frequent itemset size 6, with 10K unique items. We implemented the LossyCounting based frequent itemsets mining approach, denoted as BTS (Buﬀer– Trie–SetGen) [12]. With ¼ 10s and d = 0.1, we obtained results as shown in Table 2. To measure the quality, we use two metrics, recall and precision, that are deﬁned as follows. Given a set of true frequent itemsets A and a set of obtained frequent itemsets B, the recall is jA\Bj and the precision is jA\Bj . jAj jBj In Table 2, the ﬁrst column is the minimum support (s), and the second is the true size of frequent itemsets (jAj). The next three columns are a summary for the quality of BTS using minimum support s. The ﬁrst of the three columns is the result size (jBj). The second and third columns of the three columns are its recall and precision. It can be seen that the sizes of obtained results can be up to 10 times larger than the true size. All the three recalls are 1, which means that the obtained results contain all the true frequent itemsets. All the three precisions are less than 0.2, which means that the obtained results contain a large number of itemsets below s but above s . The number of false positive is large, and its impact is signiﬁcant in two ways: (i) the quality of mining result is low, and (ii) the memory needed at run time is even larger accordingly. Astute readers may suggest to turn a false-positive algorithm into a falsenegative one for frequent itemsets mining. That is, for user given s and , we can deliberately use s + as the minimum support to mine the frequent itemset so that the output will contain only those frequent itemsets with support greater than s but some of frequent itemsets between s and s + may not be

UN C

223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246

Algorithm

Table 2 Impact of false positives in BTS s (%)

True size

Mined size

Recall

Precision

0.08 0.10 0.20

21,361 12,252 2359

126,307 68,275 23,154

1.00 1.00 1.00

0.17 0.18 0.16

INS 7284 28 November 2005 Disk Used

8

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

Table 3 Impact of false negatives: BTS(s + ) where = s/10 Mined size

Recall

Precision

21,361 12,252 2359

18,351 10,411 1739

0.86 0.85 0.74

1.00 1.00 1.00

OO F

True size

0.08 0.10 0.20

ED

PR

in the output, which makes the algorithm false-negative. We implemented such idea and obtained results as shown in Table 3. Note, the true frequent itemsets in Tables 2 and 3 are the same. We can see that, in Table 3, the precisions become 1.0 as there are no false-positive. However, the recall rate drops 15– 26% that seems unsatisfactory low. We tested our false-negative oriented approach. For the same minimum support (s) in Table 2, with = s/10 and d = 0.1, we achieve 0.99 recall and 1.0 precision in all the setting. Note: BTS does not perform well, because there are many itemsets in [s , s + ] (Table 2). In order to test whether our falsenegative oriented approach misses itemsets, we used the same setting as Table 2 but set minimum support to be s instead, and tested if we miss many itemsets in [s , s + ]. We found that we can still achieve 0.99 recall and 1.0 precision in all the setting. As a conclusion, we believe that contrary to most existing approaches, the false-negative oriented approach is more promising to solve the FDPM problem.

RE CT

247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262

s (%)

263 3. Mining frequent items from a data stream

OR

264 In this section, we focus on frequent items mining, and discuss Chernoﬀ 265 bound [4], our basic approach and algorithm. We will discuss frequent itemsets 266 mining in Section 5. 267 3.1. Chernoﬀ bound

276

Suppose there is a sequence of observations, o1, o2, . . . , on, on+1, . . .. Chernoﬀ bound gives us certain probabilistic guarantees on the estimation of statistics about the underlying data, that generates these observations, based on the n observations obtained so far. Consider the sequence of observations, o1, o2, . . . , on, as n independent Bernoulli trails (coin ﬂips) such that Pr[oi = 1] = p, Pr[oi = 0] = 1 p for a probability p. Let r be the number of heads in the n coin ﬂips. The expectation of r is np. Chernoﬀ bound states, for any c > 0,

UN C

268 269 270 271 272 273 274

Prfjr npj P npcg 6 2e

npc2 2

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

9

277 Let r be r/n, and consider the minimum support s as the probability p. The 278 above equation becomes 280

Prfjr sj P scg 6 2e

nsc2 2

OO F

281 Further, we replace sc with . 282 n2 284 Prfjr sj P g 6 2e 2s

ð1Þ

ED

FDPM can be considered as an application of Chernoﬀ bound as follows. Given a sequence of 1-item transactions, D ¼ t1 ; t2 ; . . . ; tn ; tnþ1 ; . . . ; tN , where n is the number of ﬁrst n transactions being observed such as n N. For a pattern X, its running support up to n is supðX Þ and its true support up to N is sup(X). By replacing r with supðX Þ=n and r with s (= sup(X)/N), respectively, we can make the following statement. For a pattern X, when n observations have been made, the running support of X is beyond ± of s with probability 6d. In other words, the running support of X is within ± of s with probability P1 d. Consider s = 0.1, d = 0.1 and = 0.01. With Chernoﬀ bound, n 5991 (Eq. (2)). This implies the following for a pattern X. If we have about 5991 observations, its true value sup(X)/N is in the range of ðsupðX Þ=n 0:01; supðX Þ= n þ 0:01Þ with high probability 0.9.

RE CT

290 291 292 293 294 295 296 297 298 299 300 301 302

PR

285 Let the right side of Eq. (1) be d. We see that, with probability 6d, the running 286 average r is beyond ± of s, where 287 rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2s lnð2=dÞ ð2Þ ¼ 289 n

303 3.2. The basic approach

311 312 313 314

OR

Based on the Chernoﬀ bound, we group arrival items into two groups, namely potentially infrequent patterns and potentially frequent patterns. They are deﬁned as follows. Here, in order to distinguish from the error parameter used in false-positive oriented approaches as a user given and ﬁxed parameter, we use n in the following discussions for our false-negative oriented approaches. The error parameter n is not user given and neither ﬁxed. As shown in (Eq. (2)), n is a running variable which decreases while n increases.

UN C

304 305 306 307 308 309 310

Deﬁnition 1. Given n observations, a running error n in term of n can be obtained (Eq. (2)). A pattern X is potential infrequent if supðX Þ=n < s n in terms of n. A pattern X is potential frequent if it is not potential infrequent in terms of n.

315 The conditions for determining potential infrequent pattern can be repre316 sented alternatively as supðX Þ < ðs n Þn for a given n observations. A pattern

INS 7284 28 November 2005 Disk Used

10

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

317 X is potential frequent if supðX Þ P ðs n Þn. It is important to note that when 318 n becomes a very large number N, n 0. Therefore, supðX Þ sN .

ED

PR

OO F

Algorithm 1. FDPM-1 (s, d) 1: let n0 be the required number of observations (Eq. (3)); 2: n 0, P ;; 3: while a new transaction t arrives do 4: if t 2 P then 5: increase ts count by 1; 6: else 7: if jPj > n0 then 8: calculate the running n for the n observations; 9: delete all entries in P that are potentially infrequent; 10: end if 11: insert t with an initial count 1 into P; 12: end if 13: n n + 1; 14: output P on demand; 12: end while

RE CT

319 Remark 2. Our algorithm is false-negative oriented and is a (0, d) approximate 320 scheme. 321 Remark 3. For a given minimum support s and reliability d. The memory con322 sumption is bounded in terms of the number of observations, and is much less 323 than the number of observations in practice.

OR

324 Remark 3 states the fact that the same transactions may appear many times 325 in a transactional data stream. As discussed later, our bound does not rely on 326 the user-speciﬁed error , but on a running error n which decreases while the 327 number of observations n increases. 328 3.3. Mining frequent items

UN C

329 Our algorithm for mining frequent items from a data stream, denoted 330 FDPM-1, is outlined in Algorithm 1, which takes s and d as inputs. Note that 331 we do not take as input. Algorithm 1 makes use of the Chernoﬀ bound. In 332 line 1, n0 is the required number of observations, which is given below. 333 2 þ 2 lnð2=dÞ ð3Þ n0 ¼ 335 s 336 We will show how we determine n0 later, which in fact is the memory bound. 337 Now, when we receive a transaction t from a 1-itemset transactional data 338 stream, we check whether it exists in the pool of P. If it exits, we increase its

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

count by 1 (line 4–5). Otherwise, we insert t into P if the number of entries in P is less than n0. When P becomes full (P P n0) (line 7), we prune potential infrequent patterns X in P based on Deﬁnition 1. We output the mining results (P) only when there is such a demand at line 14. Note: we do not initially allocate memory for keeping n0 entries in P. We increase the size of P incrementally.

OO F

339 340 341 342 343 344

11

345 Theorem 1. Algorithm 1 finds frequent 1-itemsets in a data stream, with two 346 parameters s and d. Algorithm 1 ensures the followings, when data is independent. (a) All items whose true frequency exceeds sN are output with probability of at least 1 d. (b) No items whose true frequency is less than sN are output. (c) The probability of the estimated support that equals the true support is no less than 1 d. (d) The bound on memory space is (2 + 2ln(2/d))/s when the Chernoff bound is used.

355 356 357 358 359 360 361 362 363 364 365 366 367 368

The proof of Theorem 1 is sketched below. First, the ﬁrst three properties (a), (b) and (c) can be directly derived from the Chernoﬀ bound. When n transactions have been received, the true support of a pattern X, sup(X)/N, for N n, is within ±n of the running support supðX Þ=n when the Chernoﬀ bound is used. Recall n approaches 0 when the number of observations n increases. Because we prune potential infrequent patterns whose true support is not in the given range with probability d, the probability of pruning a frequent pattern is at most d. Therefore, the probability of the estimated support that equals the true support is no less than 1 d. Second, we show the proof for the property (d) when the Chernoﬀ bound is concerned. As shown in Algorithm 1, P always keeps all potential frequent patterns X such that supðX Þ P ðs n Þn, when n transactions have been received. Therefore, jPj 6 1/(s n), when s n > 0, otherwise jPj Æ (s n)n > n, which is impossible. let jPj = n = 1/(s n). We have the following equation: n¼

1 1 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ s n s 2s lnð2=dÞ n

UN C

370

OR

RE CT

ED

PR

347 348 349 350 351 352 353 354

ð4Þ

371 Solve the equation, we get 373

n¼

2 þ 2 lnð2=dÞ s

ð5Þ

374 as the proof of the last property (d) of Theorem 1. h 375 The last property of Theorem 1 is proved for the minimum number of obser376 vations. As an example, suppose s = 0.001, = s/10 and d = 0.1. The memory

INS 7284 28 November 2005 Disk Used

12

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

ED

PR

OO F

bound is n0 = 7991. Consider n as follows. When n 6 n0, n = 0, because all items can be kept in the pool. When n = 7992 (= n0 + 1), n = 0.000866 (the largest possible error). When n = 100,000, n = 0.000245. When n = 1,000,000, n = 0.000077. We discuss the time complexity regarding Algorithm 1 in brief. In Algorithm 1, the cost for inserting a new item is O(1). The cost for one pruning is O(1) because we only maintain n0 items. The maximum number of pruning is at most N/n0 where N is the length of the data stream. Algorithm 1 is designed on top of the Chernoﬀ bound which assumes data independent. In reality, data in a data stream is highly possible to be dependent. When data is dependent in a transactional data stream, the quality of Algorithm 1 cannot be guaranteed. Several approaches can be taken to handle data dependent data streams. One is to conduct random sampling with a reservoir [14], as indicated in [12]. The technique of random sampling with a reservoir is, in one sequential pass, to select a random sample of n records without replacement from a pool of N records where N is unknown [14]. With this technique, we can handle a data dependent data stream as a data independent data stream. In [8], a probabilistic-inplace algorithm was introduced to handle different distributions. Given m counters, the probabilistic-inplace algorithm reserves m/2 to store the current best candidates, and uses the unreserved m/2 to monitor network traﬃc. For every run, the probabilistic-inplace algorithm replaces the m/2 reserved counters with the top out of all m counters. With this technique, we can divide a transactional data stream into segments and apply Algorithm 1 to segments one-by-one continuously. The length of each segment is k Æ n0 where k is a positive number and n0 is the smallest number of observations. The memory required is 2n0—one for reserving potentially frequent patterns and the other for monitoring a segment.

RE CT

377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403

No. of Pages 30

ARTICLE IN PRESS

We report our experimental results for frequent items mining in this section. For frequent items mining, we implemented our false-negative oriented algorithm FDPM-1 (Algorithm 1). We also implemented false-positive oriented algorithms, Lossy-Counting [12] and Sticky-Sampling [12], and denote them as LC and SS, respectively. For testing frequent items mining, we generate 1-itemset transactional data streams using Zipf distribution. We implemented all the frequent item(set)s mining algorithms using Microsoft Visual C++ Version 6.0. We used the same data structures and subroutines in all implementations, in order to minimize any performance diﬀerences caused by minor diﬀerences in implementation. We conducted all testings (Sections 4 and 6) on a 1.7 GHz CPU Dell PC with 1 GB memory. Because the memory size is 1 GB, there were no I/Os in all our testings. We report our

UN C

405 406 407 408 409 410 411 412 413 414 415 416

OR

404 4. Performance study I: mining frequent items

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

13

417 results in terms of memory consumption (the number of counters) and CPU 418 time (s), as well as the recall and precision.

RE CT

ED

PR

We ﬁrst test two data sets of length 1000K using two Zipf factors, 0.5 and 1.5. We compare the three algorithms: LC, SS, and FDPM-1, by varying s, where = s/10 and d = 0.1. The memory and CPU are shown in Fig. 1. In both cases, SS consumes the largest memory for diﬀerent s. Diﬀerent from SS, the memory consumption of both LC and FDPM-1 decreases while s increases. It is interesting to note that when Zipf = 0.5, FDPM-1 outperforms LC in terms of memory consumption. On the other hand, when Zipf = 1.5, LC outperforms FDPM-1 in terms of memory consumption. In terms of CPU time, FDPM-1 outperforms LC in all the cases. Some explanations can be made below. When Zipf = 0.5, the data distribution is near uniform. Because of uniform distribution, there are only a small number of items whose support is greater than the minimum support s. LC and SS need more memory to maintain items in a rather sparse data stream. FDPM-1 prunes items using the running error n. While n increases, n approaches zero, and it allows us to track those near s items with less memory consumption. When Zipf = 1.5, data is more skewed, and the number of unique items is less. FDPM-1 cannot prune as it does when Zipf = 0.5. LC can eﬀectively prune items whose support is less than s . The recall and precision for Zipf = 1.5 are given in Table 4. FDPM-1 achieves 100% recall and

1e+06

1000

1000

10 0.01

(a)

10000

0.10

1.00 Support (%)

10.00

100

(c)

10 0.01

10

(b)

1 0.01

1000

1000

LC SS FDPM-1

100

LC SS FDPM-1

CPU (second)

100

Memory

CPU (second)

10000

OR

Memory

100000

UN C

420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438

OO F

419 4.1. Data distribution

0.10

1.00 Support (%)

10.00

0.10

1.00 Support (%)

10.00

LC SS FDPM-1

100 10

LC SS FDPM-1

0.10

1.00 Support (%)

10.00

(d)

0.01

Fig. 1. The eﬀectiveness of s ( = s/10, d = 0.1): (a) memory (Zipf = 0.5), (b) CPU (Zipf = 0.5), (c) memory (Zipf = 1.5), (d) CPU (Zipf = 1.5).

INS 7284 28 November 2005 Disk Used

14

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

Table 4 Varying s (Zipf = 1.5) LC

0.01 0.1 1 10

FDPM-1

P

R

P

R

1 1 1 1

0.91 0.96 0.92 1

1 1 1 1

0.91 0.96 0.92 1

P

1 1 1 1

1 1 1 1

PR

100% precision. SS and LC ensure recall to be 100%, but allow precision to be down to (91%, 92%), despite the fact that the patterns are skewed. In order to investigate the eﬀectiveness of Zipf distribution, we ﬁx s = 0.1%, = s/10, and d = 0.1, and test diﬀerent Zipf factors. The results are shown in Fig. 2 in which there is turn over when Zipf is about 1.25 (Fig. 2(a)). When Zipf is less than 1.25, FDPM-1 consumes less memory than LC. FDPM-1 performs best in terms of CPU cost. The recall and precision are shown in Table 5. When Zipf = 0.5, s = 0.1%, no frequent items can be found. When Zipf = 1.0, the precisions of LC and SS are even down to 0.87. FDPM-1 ensures high recall and precision.

ED

439 440 441 442 443 444 445 446 447 448

SS

R

OO F

s (%)

RE CT

449 4.2. Critical region testing

CPU (second)

450 In this section, we further conduct several testing to test a critical region of 451 [s , s]. Suppose that many frequent items reside in the critical region. LC and

10000 1000 100 10

(a)

LC SS FDPM-1

0.5

1.0

1.5

2.0

OR

Memory

100000

2.5

3.0

Zipf

20

LC SS FDPM-1

10

(b)

0 0.5

1.0

1.5

2.0

2.5

3.0

Zipf

Fig. 2. Eﬀectiveness of Zipf factors: (a) memory, (b) CPU.

UN C

Table 5 Varying Zipf factors Zipf

0.5 1.0 1.5 3.0

LC

SS

FDPM-1

R

P

R

P

R

P

– 1 1 1

– 0.87 0.96 1

– 1 1 1

– 0.87 0.96 1

– 0.99 1 1

– 1 1 1

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

PR

OO F

SS may suﬀer if they use s and to mine, because it is most likely to include many false-positive and aﬀect the precision. On the other hand, FDPM-1 may suﬀer if it uses s to mine, because it is most likely to miss items. First, with a window size = 0.00001, we slide the minimum support starting from s = 0.01% to s i Æ where i = 1, 2, . . . , 6, and test two data streams with Zipf = 0.5 and Zipf = 1.5. The recall and precision are shown in Table 6(a) and (b). Both LC and SS perform in a similar way. When data is not skewed (Zipf = 0.5), LC and SS are easier to include false-positives. FDPM-1 reaches 100% recall and 100% precision in all the cases. Second, we identify a region with [s , s] using the data stream of Zipf = 1.5 where LC and SS perform well. We artiﬁcially move frequent items from (s, 1] to [s , s], where s = 0.1% and = s/10. We test LC and SS using s as the minimum support, and test FDPM-1 using s as the minimum support. The recall and precision are shown in Table 7. As expected, the precision of LC and SS decreases while more items reside in [s , s]. But, FDPM-1 is insensitive to the number of items in the critical region.

468 4.3. The impacts of data arrival order

ED

452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467

15

Table 6 Sliding window s (%)

LC R

(b) Zipf = 1.5 0.010 0.009 0.008 0.007 0.006 0.005 0.004

SS

1 1 1 1 1 1 1

FDPM-1

P

R

P

R

P

0.79 0.73 0.79 0.80 0.82 0.80 0.78

1 1 1 1 1 1 1

0.79 0.73 0.79 0.80 0.82 0.80 0.78

1 1 1 1 1 1 1

1 1 1 1 1 1 1

0.91 0.95 0.92 0.96 0.93 0.93 0.93

1 1 1 1 1 1 1

0.91 0.95 0.92 0.96 0.93 0.93 0.93

1 1 1 1 1 1 1

1 1 1 1 1 1 1

OR

1 1 1 1 1 1 1

UN C

(a) Zipf = 0.5 0.010 0.009 0.008 0.007 0.006 0.005 0.004

RE CT

469 We test data arrival orders, in order to ensure whether our approaches are 470 order sensitive. Let s = 0.1%, = s/10, d = 0.1, and Zipf = 1.5. Several data 471 arriving orders are tested: OO (Original Order), rO (reverse Order), RO (Ran-

INS 7284

16

No. of Pages 30

ARTICLE IN PRESS

28 November 2005 Disk Used

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

Table 7 Critical region: test LC/SS with s = 0.1%, and test FDPM-1 with s = 0.09% where = s/10, d = 0.1, Zipf = 1.5 [s ,s]

(s, 1]

LC R

P

R

P

25 73 123 173 223

247 200 150 100 50

1 1 1 1 1

0.91 0.74 0.55 0.36 0.17

1 1 1 1 1

0.91 0.74 0.55 0.36 0.17

CPU (second)

Memory

20

LC SS FDPM-1

20000 15000 10000

0

10

OO

rO

RO

SO

FF

FM

FL

ED

(a)

15

5

5000

FDPM-1

(b)

0

P

1 1 1 1 1

OO F

R

1 1 1 1 1

PR

25000

SS

OO

rO

RO

SO

FF

LC SS FDPM-1

FM

FL

Fig. 3. Eﬀectiveness of data arrival order: (a) memory, (b) CPU.

RE CT

dom Order), SO (segment-based random order,1 FF (Frequent First), FM (Frequent Middle), FL (Frequent Last). FDPM-1 is shown to be insensitive to data arrival order. The results are shown in Fig. 3 and Table 8. It achieves 100% recall and 100% precision. It outperforms the others in terms of CPU. The memory consumption is not inﬂuenced by the data arrival order. LC and SS are rather sensitive to the data arrival order. For example, when frequent items arrive late (FM or FL), both LC and SS consume more memory than FDPM-1.

OR

Algorithm 2. FDPM (s, d) 1: let n0 be the required number of observations (Eq. (3)); 2: n1 k Æ n0; 3: n 0, F ;, P ;; 4: for every n1 transactions do 5: keep potential frequent patterns in P in terms of n1; 6: F P [ F; 7: prune potential infrequent patterns from F further if jFj > cu n0 ; 8: P ;; 9: n n + n1; 10: end for

UN C

472 473 474 475 476 477 478

11: output the patterns in F whose count Psn on-demand;

1

We randomly reorder data in a unit of segment (1000 items).

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

17

Table 8 Data arrival order SS P

R

P

1 1 1 1 1 1 1

0.96 0.96 1 1 0.95 0.95 0.95

1 1 1 1 1 1 1

0.96 0.96 1 1 0.95 0.95 0.95

479 5. Mining frequent itemsets from a data stream

R

P

1 1 1 1 1 1 1

1 1 1 1 1 1 1

OR

RE CT

ED

We show our frequent data stream pattern (itemsets) mining algorithm in Algorithm 2. In line 1, we obtain n0 based on the Chernoﬀ bound. Here n0 is the number of transactions. We divide a transactional data stream into segments. The length of segment is n1 = k Æ n0 (line 2). The parameter k controls the size of transactions we process in each run in a similar way like the probabilistic-inplace algorithm in [8]. We maintain potential frequent patterns in F, and use P for each segment in a run. Both are initialized in line 3. We will discuss the size of P and F in detail later. In a for loop statement (line 4–10), we deal with every segment of length of n1 transactions as an individual data stream repeatedly. For each segment, ﬁrst, we prune potential infrequent patterns (line 5), using the same techniques given in Algorithm 1. A pattern X is potential infrequent if supðX Þ < ðs n Þn where n increases from 0 to n1 and n is computed in terms of n (Deﬁnition 1). Second, we merge the potential frequent patterns in P with F. That is for every pattern X 2 P with a count c, we increase the count of the same pattern X by c if we can ﬁnd it in F. Otherwise, we create X in F with an initial count of c. Third, we further prune potential infrequent patterns in F, when jFj > cu n0 using an existing association rule mining algorithm. We will discuss cu in detail next. In Algorithm 2, the k controls the size of segment (k Æ n0) in a run. If k is small, Algorithm 2 will prune potential infrequent patterns frequently, which leads to less memory but more CPU time. On the other hand, a large k may lead to more memory but less CPU time. Regarding data dependent, we found in our extensive testing that a small k does not necessarily decrease the quality of frequent itemsets mining, because the number of combinations is large, in comparison with frequent items mining.

UN C

480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504

FDPM-1

R

OO F

OO rO RO SO FF FM FL

LC

PR

Data

505 Theorem 2. Algorithm 2 finds frequent itemsets in a data stream, with two 506 parameters s and d. Algorithm 2 ensures the same properties.

INS 7284 28 November 2005 Disk Used

18

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

(a) All itemsets whose true frequency exceeds sN are output with probability of at least 1 d. (b) No itemsets whose true frequency is less than sN are output. (c) The probability of the estimated support that equals the true support is no less than 1 d.

513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528

Theorem 2 can be directly derived from the Chernoﬀ bound. Below, we concentrate ourselves on bounds of Algorithm 2. In Algorithm 2, P keeps potential frequent itemsets in a segment of n1 transactions, and F keeps potential frequent itemsets in all n transactions received so far. At run time, some potential infrequent itemsets may exist in PðFÞ. An itemset in PðFÞ is an entry (a pair of itemset and count). We discuss the size of PðFÞ in terms of the number of entries, denoted jPjðjFjÞ. Obviously, jPj 6 jFj. The size jPjðjFjÞ can be possibly larger than n1 (n). For example, suppose that we receive 2 transactions, . . ., t1, t2, . . ., where t1 = {1, 2} and t2 = {2, 3}. The possible potential frequent itemsets can be {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, and {1, 2, 3}. Because PðFÞ may contain potential infrequent patterns, the theoretical upper bound of PðFÞ is diﬃcult to be determined due to the nature of the exponential explosion of itemsets. In this paper, we address an empirical upper bound of jFjðjPj 6 jFjÞ using the Chernoﬀ bound. We show that the empirical upper bound of jFj, uF, can be determined as a factor of n0, that is,

530

OR

such as jFj 6 uF . Here, n0 is determined by the Chernoﬀ bound (Eq. (3)). The empirical upper bound (uF) is determined as follows. First, let Fmax denote the largest jFj for a given minimum support s in the process of frequent itemsets mining. Here, Fmax is the number of entries used for processing transactions up to the current n transactions (n P n1). We obtained diﬀerent Fmax values using T10.I4.D1000K and T15.I6.D1000K, by varying s and d. In Table 9, due to the space limit, we only show the results

Table 9 The cmax for jFj s (%)

n0

UN C

531 532 533 534 535 536 537

u F ¼ c u n0

RE CT

ED

PR

OO F

507 508 509 510 511 512

0.1 0.2 0.4 0.6 0.8 1.0

7991 3996 1998 1332 999 799

T10.I4.D1000K

T15.I6.D1000K

Fmax

cmax

Fmax

cmax

59,385 6874 875 233 71 20

0.00006 0.00005 0.00006 0.00005 0.00004 0.00002

454,092 19,690 1722 660 245 102

0.00045 0.00015 0.00011 0.00014 0.00012 0.00010

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

bmax ¼ Fmax =n0 ¼ ðcmax =s3 Þ=n0

549

ED

Some bmax values are shown in Table 10 for diﬀerent minimum supports. Several points can be made: (i) bmax increases while the minimum support s decreases. (ii) bmax can be greater than, equal to, or less than 1. Third, for determining the empirical upper bound of jFj for diﬀerent data streams, D1 ; D2 ; . . ., the above ﬁnding suggests that we can select the largest cmax, cmax , to determine the largest bmax value using a representative data stream Dr . For example, T15.I6.D1000K is the representative in comparison with T10.I4.D1000K, because T15.I6.D1000K has a larger transaction size and a larger maximal potentially frequent itemsets than T10.I4.D1000K. As future work, we will further study the issues related to the representative data streams. In Table 9, cmax ¼ 0:00045. Alternatively, we can determine cmax based on a regression line among cmax values. Consequently, we can determine the largest bmax value, bmax , for a transactional data stream (Di ), that is represented by the representative data stream (Dr ), with an arbitrary minimum support s.

RE CT

550 551 552 553 554 555 556 557 558 559 560 561 562 563 564

OO F

with d = 0.1. We ﬁnd that cmax ¼ Fmax =s3 is about the same for diﬀerent minimum support values (s), for a data stream with a given d. The cmax value obtained from T15.I6.D1000K is larger than the cmax value obtained from T10.I4.D1000K, because the average of transaction size and the maximal potentially frequent itemsets of T15.I6.D1000K are larger than those of T10.I4.D1000K. Note: when d decreases (higher reliability), cmax increases a little. For example, when s = 0.1% and d = 0.01, cmax = 0.0001 and cmax = 0.00072 for T10.I4.D1000K and T15.I6.D1000K, respectively. Second, based on our ﬁnding, consider Fmax = bmax Æ n0 for a given minimum support s, then, we have

PR

538 539 540 541 542 543 544 545 546 547

19

bmax ¼ ðcmax =s3 Þ=n0

566

OR

bmax . 567 Finally, we identify cu ¼

UN C

Table 10 The bmax for Table 9 s (%)

n0

bmax (T10.I4)

bmax (T15.I6)

0.1 0.2 0.4 0.6 0.8 1.0

7991 3996 1998 1332 999 799

7.43 1.72 0.44 0.17 0.07 0.03

56.82 4.93 0.86 0.50 0.25 0.13

INS 7284 28 November 2005 Disk Used

20

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

568 Remark 4. For mining frequent patterns from a transactional data stream, the 569 number of entries in F can empirically be bounded by cu Æ n0 where cu is 570 selected using a representative data stream.

OO F

Remark 4 is important because it states that in fact the number of potential frequent itemsets can be possibly bounded by cu Æ n0. In addition, cu is a considerably small constant, and is not necessarily related with the domain of I unique items. Recall the number of potential frequent itemsets can be up to 2I for a transactional data stream in a domain of I unique items. In other words, it states that the memory required for jFj is possible to be multiplication of n0 (linearity). The value cu is used as a way to determine pruning (line 7) in Algorithm 2. In addition, we are able to do eager pruning. There are patterns that we can possibly prune if the running error n > s in terms of n observations. It is based on Deﬁnition 1. A pattern X is potential infrequent if supðX Þ=n < s n . Because supðX Þ P 0, n < s means no patterns can be pruned.

PR

571 572 573 574 575 576 577 578 579 580 581

ED

582 Remark 5. Based on Algorithm 2, the empirical upper bound for transactional 583 data streams is O(1/s3) if cu is selected from a representative data stream with a 584 ﬁxed d.

RE CT

585 Remark 5 is based on uF = cu Æ n0 where n0 is a denominator of cu. Note: the 586 bound, (2 + 2ln(2/d))/s, for Algorithm 1 is an exact bound.

587 6. Performance study II: mining frequent itemsets

OR

We report our experimental results for frequent itemsets mining. For frequent itemsets mining, we implemented our false-negative oriented algorithm FDPM (Algorithm 2). The idea of probabilistic-inplace is also used in Algorithm 2. For comparison purposes, we implemented Manku and Motwanis falsepositive oriented three module system BTS (Buﬀer–Trie–SetGen). The Apriori implementation we used is available from, http://fuzzy.cs.uni-magdeburg.de/borgelt/software.html#assoc, which is used in many commercial data mining tools. Its version is 4.07. For testing frequent itemsets mining, we generate transactional data streams using IBM data generator [12]. We mainly use two datasets, T10.I4.D1000K and T15.I6.D1000K with 10K unique items (as default). We process transactional data in batches. The size of a batch is 50,000 transactions. The parameter k used in FDPM and b used in BTS are adjusted accordingly.

UN C

588 589 590 591 592 593 594 595 596 597 598 599 600

601 6.1. Eﬀect of minimum support 602 We ﬁx = s/10 and d = 0.1, and vary s from 0.1% to 1.0%. Fig. 4(a) and (b) 603 show memory consumption and CPU for T10.I4.D1000K, and Fig. 4(c) and

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx 1000000

1000

(a)

100

10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Support (%)

0

(b)

CPU (second)

10000000 Memory

1000000 100000 10000 1000 100

(c)

50

BTS FDPM Bound

BTS FDPM Bound

10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Support (%)

0.1

800 700 600 500 400 300 200 100 0 0.1

(d)

0.2

0.3

0.4

OO F

Memory

10000

BTS FDPM

0.5 0.6 0.7 Support (%)

0.8

0.9

1.0

BTS FDPM

PR

CPU (second)

150

100000

100

21

0.2

0.3

0.4

0.5 0.6 0.7 Support (%)

0.8

0.9

1.0

OR

RE CT

(d) show memory consumption and CPU for T15.I6.D1000K. Recall memory consumption is the number of counters. In addition to BTS and FDPM, we show our empirical bounds (Bound) of FDPM, which is computed by cu Æ n0 and cu is computed using cmax ¼ 0:00045. As shown in Fig. 4, FDPM signiﬁcantly outperforms BTS. In the worst case, when s = 0.1%, FDPM only consumes 59,385 entries for T10.I4.D1000K and 454,092 entries for T15.I6.D1000K, whereas BTS consumes 259,581 and 2,373,968, accordingly. In the best case, when s = 1.0%, FDPM consumes only 20 entries for T10.I4.D1000K and 102 entries for T15.I6.D1000K, whereas BTS consumes 16,218 and 53,767, accordingly. FDPM signiﬁcantly outperforms BTS for both memory consumption and CPU cost. Fig. 4(a) and (c) show that the memory consumption is bounded by our empirical bound cu Æ n0. Table 11(a) and (b) show the recall and precision for Fig. 4. Here, FDPM achieves high recall (at least 99%) and ensures 100% precision. Table 11 shows recall and precision by rounding to two places of decimal. The details about the missing itemsets are given in Table 12. The column Total and Missed show the total number of itemsets being found and the total number of missing itemsets. The column Max and Min show the max s% and min s% for the missing itemsets. It is important to notice that it does not miss important itemsets. All the missing itemsets are in [s, s + s/10], and in fact are very close to s side. Recall BTS will include itemsets in [s , s] where = s/10.

UN C

604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625

ED

Fig. 4. Varying s ( = s/10, d = 0.1): (a) Mem (T10.I4.D1000K), (b) CPU (T10.I4.D1000K), (c) Mem (T15.I6.D1000K), (d) CPU (T15.I6.D1000K).

INS 7284 28 November 2005 Disk Used

22

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

Table 11 Varying s ( = s/10, d = 0.1) BTS

FDPM R

(a) T10.I4.D1000K 0.1 1 0.2 1 0.4 1 0.6 1 0.8 1 1.0 1

0.85 0.84 0.70 0.68 0.46 0.55

1 1 0.99 0.99 1 1

(b) T15.I6.D1000K 0.1 1 0.2 1 0.4 1 0.6 1 0.8 1 1.0 1

0.72 0.91 0.80 0.72 0.64 0.58

Table 12 Missing itemsets (d = 0.1) Total

(a) T10.I4.D1000K 0.1 10,595 0.2 1737 0.4 340 0.6 86 0.8 18 1.0 6

Missed

UN C

OR

(b) T15.I6.D1000K 0.1 34,720 0.2 4813 0.4 916 0.6 302 0.8 103 1.0 6

1 1 0.99 0.99 0.99 1

1 1 1 1 1 1

1 1 1 1 1 1

Max

Min

14 1 2 1 0 0

0.1026 0.2059 0.4030 0.6037

0.1000 0.2059 0.4000 0.6037

45 9 7 3 1 0

0.1027 0.2069 0.4079 0.6018 0.8258

0.1002 0.2002 0.4006 0.6013 0.8258

RE CT

s (%)

P

PR

P

ED

R

OO F

s (%)

626 6.2. Eﬀect of error control 627 We ﬁx s = 0.1% and d = 0.1, and vary . Fig. 5(a) and (b) show memory 628 consumption and CPU for T10.I4.D1000K, and Fig. 5(c) and (d) show mem629 ory consumption and CPU for T15.I6.D1000K. The recall and precision are 630 shown in Table 13(a) and (b).

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx 200 CPU (second)

100

BTS FDPM

10000 0.005 0.01

(a)

BTS FDPM

0.02 0.03 Epsilon (%)

0 0.005 0.01

0.04

1000000

BTS FDPM

100000 0.005 0.01

0.02 0.03 Epsilon (%)

0.04

800 600 400 200

(d)

0.04

BTS FDPM

1000 CPU (second)

Memory

0.02 0.03 Epsilon (%)

(b)

10000000

(c)

OO F

100000

PR

Memory

1000000

23

0 0.005 0.01

0.02 0.03 Epsilon (%)

0.04

Table 13 Varying (s = 0.1%, d = 0.1) (%)

BTS

(b) T15.I6.D1000K 0.040 0.030 0.020 0.010 0.005

1 1 1 1 1

OR

1 1 1 1 1

FDPM R

P

0.32 0.42 0.58 0.85 0.93

1 1 1 1 1

1 1 1 1 1

0.16 0.27 0.48 0.72 0.85

1 1 1 1 1

1 1 1 1 1

As shown in Fig. 5, our false-negative oriented algorithm FDPM is not inﬂuenced by . Both memory consumption and CPU are constant while varying . FDPM only needs 59,385 and 454,092 entries for T10.I4.D1000K and T15.I6.D1000K, respectively. However, has great impacts on the falsepositive oriented approach BTS. Its memory consumption increases while decreases. When = 0.005%, BTS needs large memory to keep 360,476 entries for T10.I4.D1000K, and 3,196,445 entries for T15.I6.D1000K, and achieves 93% and 85% precision, respectively. When = 0.04%, BTS needs

UN C

631 632 633 634 635 636 637 638

(a) T10.I4.D1000K 0.040 0.030 0.020 0.010 0.005

P

RE CT

R

ED

Fig. 5. Varying (s = 0.1%, d = 0.1): (a) Mem (T10.I4.D1000K), (b) CPU (T10.I4.D1000K), (c) Mem (T15.I6.D1000K), (d) CPU (T15.I6.D1000K).

INS 7284 28 November 2005 Disk Used

24

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

small memory to keep 86,537 entries for T10.I4.D1000K, and 659,233 entries for T15.I6.D1000K. But, BTS can only have 32% precision, and 16% precision for T10.I4.D1000K and T15.I6.D1000K, respectively. In sequent, BTS faces a dilemma: a little increase of will make the number of false-positive items large, and a little decrease of will make memory consumption large.

OO F

639 640 641 642 643

No. of Pages 30

ARTICLE IN PRESS

644 6.3. Eﬀect of reliability control

ED

PR

We ﬁx s = 0.1% and = s/10, and vary d. We compare FDPM with BTS, and show results in Fig. 6. As expected, varying d does not aﬀect BTS, because it treats d = 0. As shown in Fig. 6, while the reliability increases (smaller d), the memory consumption of FDPM increases, because it uses d to approximate the memory consumption. Even when d = 0.0001, the memory consumption of FDPM is much smaller than the memory consumption of BTS. As shown in Table 14(a), for T10.I4.D1000K, FDPM achieves 100% recall and 100% precision, even with d = 0.1, while BTS achieves 100% recall and 85% precision. Table 14(b) shows that, for T10.I4.D1000K, FDPM achieves 99% recall and 300000

2.5e+06

250000

Memory

2e+06

200000

1.5e+06

150000 100000 50000

(a)

RE CT

Memory

1e+06

500000

BTS

FDPM 0 0.0001

0.0010

0.0100

0.1000

Delta

(b)

BTS

0 FDPM 0.0001

0.0010

0.0100

0.1000

Delta

Fig. 6. Varying d: (a) Mem (T10.I4.D1000K), (b) Mem (T15.I6.D1000K).

d

OR

Table 14 Varying d

BTS

FDPM

R

P

R

P

(a) T10.I4.D1000K 0.1 0.01 0.001 0.0001

1 1 1 1

0.85 0.85 0.85 0.85

1 1 1 1

1 1 1 1

(b) T15.I6.D1000K 0.1 0.01 0.001 0.0001

1 1 1 1

0.72 0.72 0.72 0.72

0.99 1 1 1

1 1 1 1

UN C

645 646 647 648 649 650 651 652 653

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

25

654 100% precision, while BTS achieves 100% recall and 72% precision. FDPM out655 performs BTS.

We test the impacts of the data stream length on FDPM. We ﬁx s = 0.1% ( = s/10, d = 0.1), and vary the length of T15.I6.D1000K from 1000K to 9000K. The memory consumption and CPU cost are shown in Fig. 7. FDPM signiﬁcantly outperforms BTS. When dealing with a 9000K data stream, BTS consumes 2,333,510 entries, while as FDPM consumes only about its 10%, 269,126. Also, BTS requires 7545 s to process it, whereas FDPM only needs 1355 s. As shown in Table 15, FDPM guarantees high recall and precision (almost 100%). The precision of BTS is only 72%.

PR

657 658 659 660 661 662 663 664

OO F

656 6.4. The impacts of data stream length

665 6.5. The impacts of unique items

RE CT

ED

Wex.x test the impacts of the domain sizes, 1K, 5K and 10K, using T15.I6.D1000K, where s = 0.1%, = s/10 and d = 0.1. The results are shown in Fig. 8 and Table 16. When the data is dense (1K), there are many patterns. BTS needs 10 times of CPU than FDPM. Also, as shown in Table 16, FDPM ensures 100% precision and high recall (99%). BTS can only achieve about 71% precision.

1.5e+06 1e+06

BTS FDPM

500000 0 1000

3000 6000 Data Stream Length (K)

9000

OR

(a)

8000 BTS 7000 FDPM 6000 5000 4000 3000 2000 1000 0 1000 3000 6000 (b) Data Stream Length (K) CPU (second)

Memory

2e+06

9000

Fig. 7. Varying length (T15.I6.D1000K): (a) memory, (b) CPU.

Table 15 Varying length (T15.I6.D1000K)

UN C

666 667 668 669 670 671

jDj

1000K 3000K 6000K 9000K

BTS

FDPM

R

P

R

P

1 1 1 1

0.72 0.72 0.72 0.72

0.99 0.99 0.99 1

1 1 1 1

INS 7284 28 November 2005 Disk Used

26

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx 3e+06

1e+06

BTS FDPM

500000

(a)

1

5 Unique Items (K)

10

15000 10000 5000 0

(b)

OO F

CPU (second)

Memory

2e+06 1.5e+06

0

BTS FDPM

20000

2.5e+06

1

5 Unique Items (K)

10

Fig. 8. Varying domain (T15.I6.D1000K): (a) memory, (b) CPU.

jIj

FDPM

R

P

1 1 1

0.76 0.71 0.72

672 6.6. The impacts of data arrival order

R

P

0.99 0.99 0.99

1 1 1

ED

BTS

1K 5K 10K

We test several data arrival orders using T15.I6.D1000K: OO (Original Order), rO (reverse Order), RO (Random Order), SO (segment-based random order, FF (Frequent First), FM (Frequent Middle), FL (Frequent Last) where s = 0.1%, = s/10, and d = 0.1. As shown in Fig. 9 and Table 17. FDPM and BTS are insensitive to data arrival orders regarding frequent itemsets mining. FDPM achieves high recall (almost all 99% only one 97%) and ensures 100% precision. The precision of BTS is low (72%). In addition, FDPM outperforms BTS in terms of CPU and memory consumption.

RE CT

673 674 675 676 677 678 679 680

PR

Table 16 Varying domain (T15.I6.D1000K)

OR

681 6.7. How to determine bmax

UN C

682 We state that the empirical upper bound for transaction data streams is 683 cu Æ n0, where cu = bmax. Here, the bmax value is determined from a representative

CPU (second)

Memory

1e+06

100000

10000

(a)

OO BTS

rO

RO

SO

FF FM FDPM

FL

800 700 600 500 400 300 200 100 0

(b)

OO rO BTS

RO

SO

FF FM FDPM

Fig. 9. Impacts of data arrival order: (a) memory, (b) CPU.

FL

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

27

Table 17 Impacts of data arrival order BTS

R

1 1 1 1 1 1 1

0.72 0.72 0.72 0.72 0.72 0.72 0.72

0.99 0.99 0.97 0.99 0.99 0.99 0.96

P 1 1 1 1 1 1 1

RE CT

ED

data stream. In this testing, we show bmax values in various settings. We ﬁx d = 0.1 below. Table 18 shows bmax values when varying domains 1K, 5K and 10K using T10.I4.D1000K and T15.I6.D1000K. As can be seen in Table 18, a larger domain does not necessarily increase the bmax value. We observe that, even though a larger domain makes the possibility of overlapping between itemsets smaller, there are many factors which have impacts on the randomized process of generating datasets. It is diﬃcult to identify a trend when changing the domain sizes. However, the bmax values do not change signiﬁcantly when the domain becomes smaller/larger, as compared with the changes of the maximal potentially frequent itemset size and the changes of the average transaction size.

Table 18 Determining bmax by varying domains s (%)

bmax (1K)

OR

(a) T10.I4.D1000K 0.1 0.2 0.4 0.6 0.8 1.0

UN C

684 685 686 687 688 689 690 691 692 693 694 695

P

PR

OO rO RO SO FF FM FL

FDPM

R

OO F

Data

(b) T15.I6.D1000K 0.1 0.2 0.4 0.6 0.8 1.0

bmax (5K)

bmax (10K)

9.68 1.96 0.43 0.48 0.53 0.55

7.78 1.47 0.65 0.46 0.29 0.17

7.43 1.72 0.44 0.17 0.07 0.03

48.74 8.07 1.48 0.76 0.70 0.75

43.50 3.74 1.00 0.84 0.67 0.48

56.82 4.93 0.86 0.50 0.25 0.13

INS 7284 28 November 2005 Disk Used

28

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

Table 19 Determining bmax by varying I

(b) T15.*.D1000K 0.1 0.2 0.4 0.6 0.8 1.0

13.52 4.41 0.93 0.55 0.31 0.13

56.82 4.93 0.86 0.50 0.25 0.13

52.57 0.97 0.51 0.20 0.06 0.02

OO F

27.06 1.05 0.45 0.16 0.06 0.02

bmax (I8)

209.59 2.66 0.86 0.49 0.23 0.10

ED

7.43 1.72 0.44 0.17 0.07 0.03

RE CT

Table 19 shows bmax values when varying the maximal potentially frequent itemset size I (I4, I6 and I8) using T10.*.D1000K and T15.*.D1000K. When s% is small enough (s = 0.1%), the bmax value becomes noticeably large. It is expected, because the maximal potentially frequent itemset size becomes larger. Some details are given below. In both T10.I4.D1000K and T15.I6.D1000K, when s = 0.1%, n0 = 7991. For T10.I4.D1000K, when s = 0.1%, the bmax values change from 7.43 (I4), 27.6 (I6), to 52.57 (I8). Its

Table 20 Determining bmax by varying T s (%)

bmax (T10)

OR

(a) *.I6.D1000K 0.1 0.2 0.4 0.6 0.8 1.0

UN C

696 697 698 699 700 701 702

bmax (I6)

PR

bmax (I4)

s (%) (a) T10.*.D1000K 0.1 0.2 0.4 0.6 0.8 1.0

(b) *.I8.D1000K 0.1 0.2 0.4 0.6 0.8 1.0

bmax (T12)

bmax (T15)

27.06 1.05 0.45 0.16 0.06 0.02

38.20 2.08 0.62 0.28 0.11 0.04

56.82 4.93 0.86 0.50 0.25 0.13

52.57 0.97 0.51 0.20 0.06 0.02

90.72 1.54 0.64 0.31 0.11 0.04

209.59 2.66 0.86 0.49 0.23 0.10

INS 7284 28 November 2005 Disk Used

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

PR

OO F

corresponding max numbers of entries increase from 59,385 (I4), 216,211 (I6), to 420,053 (I8). For T15.I6.D1000K, when s = 0.1%, the bmax values change from 13.52 (I4), 56.82 (I6), to 206.59 (I8). Its corresponding max numbers of entries increase from 59,385 (I4), 216,211 (I6), to 420,053 (I8). However, we also observe that when s is not small enough, for example s = 0.2% in Table 19, the trend does not always exist such that bmax becomes larger when the maximal potentially frequent itemset size becomes larger. It is because that it does not necessarily ﬁnd more frequent itemsets when the maximal potentially frequent itemset size becomes larger when s is not small enough. Table 20 shows bmax values when varying the average transaction size T6, T12 and T15 using *.I6.D1000K and *.I8.D1000K. Here, bmax value becomes larger when the average transaction size becomes larger. When the maximal potentially frequent itemset size is ﬁxed, bmax value becomes larger when s becomes smaller. It suggests that it is better to determine bmax using a representative data stream of a larger average transaction size with a reasonable larger maximal potentially frequent itemset size and a small minimum support s.

ED

703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720

RE CT

721 7. Conclusion

OR

In this paper, we study the problem of mining frequent patterns from transactional data streams, the problem of FDPM. While most existing algorithms in mining frequent items for data streams using false-positive oriented approaches to control the error on the estimated frequency of mined patterns and memory requirement, we explored a new paradigm in FDPM, the falsenegative oriented approach. That is, we control the data mining process by limiting the probability of a frequent pattern that misses in the result, but all mined patterns are frequent. We developed both frequent items and itemsets mining algorithms using the Chernoﬀ bound. The bound enables us pruning infrequent patterns from the continuously arriving transactions with the guarantee of the required recall rate of frequent patterns. The performance study demonstrated the eﬀectiveness and eﬃciency of our false-negative oriented approach which uses a running error, n, to prune infrequent item(set)s, and uses d to control memory space. The Chernoﬀ bound assumes some underlying property of the underlying distributions of the data. Although the current performance study indicated that even the data does not follow strictly the assumptions, the bound is surprisingly eﬀective. One of our immediate future work is to further study the data distribution issues and explore possible theoretical bounds for frequent data stream pattern mining.

UN C

722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741

29

INS 7284 28 November 2005 Disk Used

30

No. of Pages 30

ARTICLE IN PRESS

J.X. Yu et al. / Information Sciences xxx (2005) xxx–xxx

742 Acknowledgment

OO F

743 The work described in this paper was supported by grant from the Research 744 Grants Council of the Hong Kong Special Administrative Region, China 745 (CUHK4229/01E).

746 References

OR

RE CT

ED

PR

[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of 20th International Conference on Very Large Data Bases, 1994, pp. 487–499. [2] N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, in: Proceedings of ACM STOC, 1996. [3] M. Charikar, K. Chen, M. Farach-Colton, Finding frequent items in data streams, in: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), 2002, pp. 693–703. [4] H. Chernoﬀ, A measure of asymptotic eﬃciency for tests of a hypothesis based on the sum of observations, The Annals of Mathematical Statistics 23 (4) (1952) 493–507. [5] S. Cohen, Y. Matias, Spectral bloom ﬁlter, in: Proceedings of ACM SIGMOD, 2003. [6] G. Cormode, S. Muthukrishnan, Whats hot and whats not: tracking most frequent items dynamically, in: Proceedings of 22nd ACM Symposium on Principles of Database Systems (PODS), 2003, pp. 296–306. [7] M. Datar, A. Gionis, P. Indyk, R. Motwani, Maintaining stream statistics over sliding windows, in: 13th Annual ACM-SIAM Symposium on Discrete Algorithms, 2002. [8] E. Demaine, A. Lo´pez-Ortiz, J.I. Munro, Frequency estimation of internet packet streams with limited spacein, in: Proceedings of 10th Annual European Symposium on Algorithms, 2002, pp. 348–360. [9] J. Feigenbaum, S. Kannan, An approximate l1-diﬀerence algorithm for massive data streams, in: IEEE Symposium on Foundations of Computer Science, 1999. [10] P. Flajolet, G.N. Martin, Probabilistic counting algorithms, Journal of Computer and System Sciences 31 (1985) 182–209. [11] M. Garofalakis, J. Gehrke, R. Rastogi, Querying and mining data streams: you only get one look, in: Tutorial in 28th International Conference on Very Large Data Bases, 2002. [12] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proceedings of 28th International Conference on Very Large Data Bases, 2002, pp. 346–357. [13] H. Toivonen, Sampling large databases for association rules, in: Proceedings of 22nd International Conference on Very Large Data Bases, 1996, pp. 134–145. [14] J.S. Vitter, Random sampling with a reservoir, ACM Transactions on Mathematical Software (TOMS) 11 (1) (1985) 37–57. [15] J.X. Yu, Z. Chong, H. Lu, A. Zhou, False positive or false negative: Mining frequent itemsets from high speed transactional data streams, in: Proceedings of 28th International Conference on Very Large Data Bases, 2004.

UN C

747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780

u ncorrected proo f - Semantic Scholar

22 below the specified minimum support s but above s Ð counted as frequent ones. ...... 478 arrive late (FM or FL), both LC and SS consume more memory than ...

Download PDF

299KB Sizes 2 Downloads 49 Views

Report

u ncorrected proo f - Semantic Scholar

Recommend Documents