Preparation of Papers in Two-Column Format

Viewer
Transcript

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

Development of New Structure for Frequent Pattern Mining Upasna Singh, G.C. Nandi Indian Institute of Information Technology, Allahabad. {upasnasingh, gcnandi}@iiita.ac.in comparing the results with that of FP-growth algorithm and at the end Section V captures conclusion and future work.

Abstract: In this paper, we develop a new novel data structure called SH-Struct (Soft-Hyperlinked Structure) which mines the complete frequent itemset using SH-Mine algorithm. This algorithm enables frequent pattern mining with different supports. SH-Struct is based on creating SH-Tree which extends the idea of H-Struct to improvise storage compression and allow very fast frequent pattern mining. The algorithm has been tested extensively with various datasets and the experimental analysis shows that it outperforms FP-growth (Frequent Pattern growth) algorithm in terms of space and time payoffs.

I.

II. BASIC CONCEPTS A. Association Rule Mining One of the most popular technique of data mining is association rule mining. Its aim is to find useful and interesting patterns in the transaction database. The transactional database is like market-basket database that consists of a set of items and transaction id. Association Rule is indicated as X→Y, where X and Y are two disjoint subsets of all available items in the database. The main significance of the rule depends on two measures known as support and confidence. The support of the rule X →Y is P(XUY) and the confidence is given as P(Y|X). The task of association rule mining is to find all strong association rules that satisfy minimum support and minimum confidence threshold.

INTRODUCTION

Association rules are discovered to generate patterns in transaction data in supermarket. It is first introduced by Agrawal et al. in [1]. Its successive refinements, generalizations and improvements have been discussed in [2],[3],[4],[5],[6],[7],[8],[9]. One of the most popular and fastest algorithm for frequent itemset mining is the FP-growth algorithm [2],[10],[11]. After this, a closely relative of this approach is the H-mine algorithm [3]. It uses the simpler data-structure called HStruct(Hyper-Linked Structure) which is better than FP-Tree in terms of space compression [3]. FP-growth method uses Apriori property. In this method candidate sets are not generated, it partitions the databases recursively into subdatabases and then finds the most frequent pattern accordingly and after then assembles longer patterns by searching the local frequent patterns [8]. On the other hand, H-Struct is used for fast mining on temporal dataset. It has a polynomial space complexity and thus more space efficient than other pattern growth methods like FP-growth and Tree-Projection while mining sparse datasets [3]. Since both the algorithms are fast for generating frequent patterns, many researchers have used their hybrid structures such as H-mine using FP-growth [12] to extract advantage of both. In the present investigation, we present a new advanced algorithm called SH-Mine that uses SH-Struct for building SH-Tree and find frequent patterns. The basic idea behind it is to save time and space for sparse and dense datasets.

B. Frequent Pattern Mining For mining Association rules between sets of items, the frequent pattern mining problem is introduced by Agrawal, et al. in [1]. Till yet, hundreds of research papers have been published presenting new algorithms to solve these mining problems more efficiently. One of the most popular algorithm for frequent pattern mining is FP-growth (Frequent Pattern growth). Its main objective is to use extended prefix-tree (FP-tree) structure in order to store the database in a compressed form. It has also adopted a divide-and-conquer approach to decompose mining tasks and databases. To avoid the costly process of candidate generation it uses a pattern fragment growth method. III. SH-Struct The present study shows the development of a new vital data structure called SH-Struct. This structure has to do two things, building SH-Tree and maintaining H-Struct at each level of the tree for every header node. It finds the complete set of frequent patterns as FP-growth but the concept of building prefix tree is different. In our algorithm, prefix tree is known as SH-Tree. We use the sample database given in TABLE I to illustrate the construction of SH-Tree. In this database, first two columns are given to us and the third column ‘Frequent

The presentation has been arranged as follows: Section II introduces the basic concepts of association rule mining and frequent pattern mining, Section III addresses both SH-Tree and SH-Mine algorithms for mining frequent pattern, Section IV experimentally shows the performance of the SH-Mine by 129

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

Projected Items’ is created on the basis that support(item) ≥ minimum_support_threshold. Let us suppose for the given database, minimum_support_threshold=2. Thus, all the transactions are updated accordingly. After that a list called Freq_List, shown in TABLE II is created which contains the item and its support throughout the database.

while(!EOF) { gettransaction (t); if (t. node== headernode ) Insert_header _nodes(root, item); else Insert_nodes(t, item); } } Insert_header_nodes(tree *root, item ) { if(t.root == NULL) t.root=newnode; else t.root->next = newnode support(item)+=1 } Insert_nodes(tree *t, item ) { if(t.child == NULL) t.child=newnode; else t.child->next = newnode; support(item)+=1; }

TABLE I SAMPLE DATABASE Frequent Transactions Items Projected Items T001

a,b,e,g

a,b,e

T002

b,d

b,d

T003

b,c

b,c

T004

a,b,d

a,b,d

T005

a,c,h

a,c

T006

b,c,f

b,c

T007

a,c

a,c

T008

a,b,c,e

a,b,c,e

T009

a,b,c

a,b,c

TABLE II FREQ_LIST

A.

Item

a

b

c

D

e

Support

6

7

6

2

2

SH-Tree Algorithm

For creating SH-Tree, we first take an empty root node, which is considered as level 0 for the tree, and then scan the transaction database in the same way as in FP-growth. A list of header nodes for each transaction is to be created. For an example, if we consider the sample database in TABLE I, the header nodes for third column of the database after scanning each transaction are ‘a’,`b’. Thus, the children of an empty root node are ‘a’,`b’, which are considered as the nodes of the tree at level 1. After that, each transaction is scanned once again. The items which have the headers ‘a’ and ‘b’ become the children of these headers respectively for the next level. In FIGURE I it is clearly shown that header ‘a’ is followed by item ‘b’ and ‘c’ in transactions T001, T004, T008, T009 and T005 respectively, so ‘b’ and ‘c’ are the children of header ‘a’ for level 2. In this way, we can proceed for other header nodes. For each level we have to maintain H-Struct, on the basis of which we partition the database according to the headers and then maintain each level according to their subdatabases. The overall process is explained in terms of pseudo code of SH-Tree illustrated as: Build_SH Tree(t) { create_freq_list() ;

FIGURE I SH-TREE

B.

SH-Mine Algorithm

Mining of frequent pattern in SH-tree starts from the root of the tree and proceeds recursively. The pattern is said to be frequent if at each level the items or nodes of the pattern are satisfying the condition, support(item) ≥ minimum_support_threshold. For an example, from FIGURE I, we may have one of the frequent pattern as {a,b,c}. Item ‘e’ is not satisfying the above condition so the mining stops at level 3. Other patterns are mined from SH-Tree in the same manner. At each level frequent_item_list is maintained for containing frequent patterns. The procedure is explained by the pseudo code of SH-Mine algorithm as: SH_Mine(t) { 130

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

Consider L(i) to be the level of SH-Tree t . Start from the header node at level 1. frequent_ item_list= item at level 1; for(i=1;i<=total no. of levels,i++;) if (support(item at level L(i+1)) >= min_support_threshold) frequent_item_list(i+1)= frequent_item _list(i), item at L(i+1); display frequent_item_list; Repeat the process for all header nodes at level 1. } After using above algorithm, frequent-patterns for our example can be obtained as: {b,c},{a,c},{a,b,c}.

FIGURE III SCALABILITY WITH RESPECT TO SUPPORT FOR MARKET BASKET (MB) DATASET

Similar analysis is shown in FIGURE IV and FIGURE V for Synthetic dataset, which is dense dataset taken from UCI, AI repository. Our algorithm works efficiently for sparse datasets in terms of both run-time and space. Sometimes in case of dense datasets it would give the parallel performance as that of FP-growth shown in FIGURE V, but always space efficient.

IV. PERFORMANCE STUDY WITH EXPERIMENTAL RESULTS The performance study evaluates efficiency and scalability of SH-Mine algorithm. Our experimental results give the comparative analysis between SH-Mine and FP-growth. It shows that many of the times SH-Mine outperforms FPgrowth and sometimes work parallel as FP-growth. We have performed the experiments on Pentium IV machine with 1GB main memory and 80 GB hard disk having MS Windows/NT operating system. SH-Mine and FP-growth are implemented using Dev-C++ 4.9.9.2. For SH-Mine, our experiments work with gcc-compiler which is advanced version of C/C++ compiler. We have tested their comparison on various datasets, but due to less space only typical datasets results are shown in this study.

FIGURE IV SCALABILITY WITH RESPECT TO NUMBER OF TRANSACTIONS FOR SYNTHETIC DATASET

A. Scalability Analysis Our experiments measure the scalability of SH-Mine algorithm with respect to number of transactions and with respect to support. FIGURE II and FIGURE III have shown the comparative study of both the algorithms for Market Basket (MB) dataset , which is sparse dataset taken from [13], with respect to transactions and support respectively.

FIGURE V SCALABILITY WITH RESPECT TO SUPPORT FOR SYNTHETIC DATASET

This analysis shows that our algorithm is efficient for sparse datasets especially when the minimum support threshold is less.

FIGURE II SCALABILITY WITH RESPECT TO NUMBER OF TRANSACTIONS FOR MARKET BASKET (MB) DATASET

131

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

REFERENCES

B. Memory Usage Analysis For memory usage also, we have compared both the algorithms for same datasets. FIGURE VI and FIGURE VII clearly show that SH-Mine is far better than FP-Growth in terms of space payoff. One of the main reason behind this is the use of H-Struct while creating SH-Tree as it was already proved in [3],[4] that H-Struct is space efficient than FP-Tree.

FIGURE VI MEMORY ANALYSIS FOR MARKET BASKET (MB) DATASET

FIGURE VII MEMORY ANALYSIS FOR SYNTHETIC DATASET

[1]

R. Agrawal, T. Imielienski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, Proc. Conf. on Management of Data, 207–216. ACM Press, New York, NY, USA 1993

[2]

Jiawei Han, Jian Pei, Yiwen Yin, "Mining Frequent Patterns without Candidate Generation", Intl. Conference on Management of Data, ACM SIGMOD,2000

[3]

J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang.” H-mine: hyper-structure mining of frequent patterns in large database.” In Proceedings of the IEEE International Conference on Data Mining, San Jose, CA, November 2001.

[4]

Q. Wan and A. An,” Efficient mining of indirect associations using Hi-mine”, In Proceedings of 16th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2003, alifax, Canada, June 2003.

[5]

R. Bayardo Jr., R. Agrawal, and D. Gunopulos, "Constraint-based rule mining in large, dense databases", In Proc. of the15th Int’l Con$ on Data Engineering, pages 188-197, 1999.

[6]

William Cheung, Osmar R. Zaiane, "Incremental Mining of Frequent Patterns without Candidate Generation or Support Constraint", Seventh International Database Engineering and Applications Symposium (IDEAS'03), page-111, 2003.

[7]

Y. G. Sucahyo, R. Gopalan, “CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP-Tree Data Structure”, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI), Brighton, UK, 2004.

[8]

Mingjun Song, Sanguthevar Rajasekaran, "A Transaction Mapping Algorithm for Frequent Itemsets Mining," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 4, pp. 472-481, Apr., 2006.

[9]

Wan, Q. and An, A., “HI-mine*: Efficient Indirect Association Discovery Using Compact Transaction Database”, IEEE International Conference on Granular Computing (GrC'06), Atlanta, USA, May 10-12, 2006.

[10] Jiawei Han, Micheline Kamber , Book : “Data Mining Concept & Technique”,2001.

V. CONCLUSION AND FUTURE WORK We have developed a novel data structure called SH- Struct for mining frequent patterns. The experimental results show that the algorithm is scalable and efficient in terms of both space and time for different datasets. We have bench marked the results with well-known FP-growth algorithm which is one of the most efficient frequent pattern mining algorithm. In our future work we aim to improve the performance of our algorithm for highly dense datasets. Also the algorithm would be used to generate temporal association rules for both sparse and dense datasets.

[11] Christian Borgelt, “Keeping things simple: finding frequent item sets by recursive elimination”, In Proc. of the 1st international workshop on open source data mining: frequent pattern mining implementations, Chicago, Illinois, pages: 66 – 70, 2005. [12] O.P. Vyas, Keshri Verma. “Efficient Calendar Based Temporal Association Rule” , ACM SIGMOD ,Vol 34 No.-3 pp 63-71, 2005. [13] C.L. Blake and C.J. Merz, UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, University of California at Irvine, CA, USA 1998.

132

Preparation of Papers in Two-Column Format

Society for Computational Studies of Intelligence, AI 2003, alifax,. Canada, June 2003. ... workshop on open source data mining: frequent pattern mining.

Download PDF

269KB Sizes 2 Downloads 340 Views

Report

Preparation of Papers in Two-Column Format

Recommend Documents