3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

Development of New Structure for Frequent Pattern Mining Upasna Singh, G.C. Nandi Indian Institute of Information Technology, Allahabad. {upasnasingh, gcnandi}@iiita.ac.in comparing the results with that of FP-growth algorithm and at the end Section V captures conclusion and future work.

Abstract: In this paper, we develop a new novel data structure called SH-Struct (Soft-Hyperlinked Structure) which mines the complete frequent itemset using SH-Mine algorithm. This algorithm enables frequent pattern mining with different supports. SH-Struct is based on creating SH-Tree which extends the idea of H-Struct to improvise storage compression and allow very fast frequent pattern mining. The algorithm has been tested extensively with various datasets and the experimental analysis shows that it outperforms FP-growth (Frequent Pattern growth) algorithm in terms of space and time payoffs.

I.

II. BASIC CONCEPTS A. Association Rule Mining One of the most popular technique of data mining is association rule mining. Its aim is to find useful and interesting patterns in the transaction database. The transactional database is like market-basket database that consists of a set of items and transaction id. Association Rule is indicated as X→Y, where X and Y are two disjoint subsets of all available items in the database. The main significance of the rule depends on two measures known as support and confidence. The support of the rule X →Y is P(XUY) and the confidence is given as P(Y|X). The task of association rule mining is to find all strong association rules that satisfy minimum support and minimum confidence threshold.

INTRODUCTION

Association rules are discovered to generate patterns in transaction data in supermarket. It is first introduced by Agrawal et al. in [1]. Its successive refinements, generalizations and improvements have been discussed in [2],[3],[4],[5],[6],[7],[8],[9]. One of the most popular and fastest algorithm for frequent itemset mining is the FP-growth algorithm [2],[10],[11]. After this, a closely relative of this approach is the H-mine algorithm [3]. It uses the simpler data-structure called HStruct(Hyper-Linked Structure) which is better than FP-Tree in terms of space compression [3]. FP-growth method uses Apriori property. In this method candidate sets are not generated, it partitions the databases recursively into subdatabases and then finds the most frequent pattern accordingly and after then assembles longer patterns by searching the local frequent patterns [8]. On the other hand, H-Struct is used for fast mining on temporal dataset. It has a polynomial space complexity and thus more space efficient than other pattern growth methods like FP-growth and Tree-Projection while mining sparse datasets [3]. Since both the algorithms are fast for generating frequent patterns, many researchers have used their hybrid structures such as H-mine using FP-growth [12] to extract advantage of both. In the present investigation, we present a new advanced algorithm called SH-Mine that uses SH-Struct for building SH-Tree and find frequent patterns. The basic idea behind it is to save time and space for sparse and dense datasets.

B. Frequent Pattern Mining For mining Association rules between sets of items, the frequent pattern mining problem is introduced by Agrawal, et al. in [1]. Till yet, hundreds of research papers have been published presenting new algorithms to solve these mining problems more efficiently. One of the most popular algorithm for frequent pattern mining is FP-growth (Frequent Pattern growth). Its main objective is to use extended prefix-tree (FP-tree) structure in order to store the database in a compressed form. It has also adopted a divide-and-conquer approach to decompose mining tasks and databases. To avoid the costly process of candidate generation it uses a pattern fragment growth method. III. SH-Struct The present study shows the development of a new vital data structure called SH-Struct. This structure has to do two things, building SH-Tree and maintaining H-Struct at each level of the tree for every header node. It finds the complete set of frequent patterns as FP-growth but the concept of building prefix tree is different. In our algorithm, prefix tree is known as SH-Tree. We use the sample database given in TABLE I to illustrate the construction of SH-Tree. In this database, first two columns are given to us and the third column ‘Frequent

The presentation has been arranged as follows: Section II introduces the basic concepts of association rule mining and frequent pattern mining, Section III addresses both SH-Tree and SH-Mine algorithms for mining frequent pattern, Section IV experimentally shows the performance of the SH-Mine by 129

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

Projected Items’ is created on the basis that support(item) ≥ minimum_support_threshold. Let us suppose for the given database, minimum_support_threshold=2. Thus, all the transactions are updated accordingly. After that a list called Freq_List, shown in TABLE II is created which contains the item and its support throughout the database.

while(!EOF) { gettransaction (t); if (t. node== headernode ) Insert_header _nodes(root, item); else Insert_nodes(t, item); } } Insert_header_nodes(tree *root, item ) { if(t.root == NULL) t.root=newnode; else t.root->next = newnode support(item)+=1 } Insert_nodes(tree *t, item ) { if(t.child == NULL) t.child=newnode; else t.child->next = newnode; support(item)+=1; }

TABLE I SAMPLE DATABASE Frequent Transactions Items Projected Items T001

a,b,e,g

a,b,e

T002

b,d

b,d

T003

b,c

b,c

T004

a,b,d

a,b,d

T005

a,c,h

a,c

T006

b,c,f

b,c

T007

a,c

a,c

T008

a,b,c,e

a,b,c,e

T009

a,b,c

a,b,c

TABLE II FREQ_LIST

A.

Item

a

b

c

D

e

Support

6

7

6

2

2

SH-Tree Algorithm

For creating SH-Tree, we first take an empty root node, which is considered as level 0 for the tree, and then scan the transaction database in the same way as in FP-growth. A list of header nodes for each transaction is to be created. For an example, if we consider the sample database in TABLE I, the header nodes for third column of the database after scanning each transaction are ‘a’,`b’. Thus, the children of an empty root node are ‘a’,`b’, which are considered as the nodes of the tree at level 1. After that, each transaction is scanned once again. The items which have the headers ‘a’ and ‘b’ become the children of these headers respectively for the next level. In FIGURE I it is clearly shown that header ‘a’ is followed by item ‘b’ and ‘c’ in transactions T001, T004, T008, T009 and T005 respectively, so ‘b’ and ‘c’ are the children of header ‘a’ for level 2. In this way, we can proceed for other header nodes. For each level we have to maintain H-Struct, on the basis of which we partition the database according to the headers and then maintain each level according to their subdatabases. The overall process is explained in terms of pseudo code of SH-Tree illustrated as: Build_SH Tree(t) { create_freq_list() ;

FIGURE I SH-TREE

B.

SH-Mine Algorithm

Mining of frequent pattern in SH-tree starts from the root of the tree and proceeds recursively. The pattern is said to be frequent if at each level the items or nodes of the pattern are satisfying the condition, support(item) ≥ minimum_support_threshold. For an example, from FIGURE I, we may have one of the frequent pattern as {a,b,c}. Item ‘e’ is not satisfying the above condition so the mining stops at level 3. Other patterns are mined from SH-Tree in the same manner. At each level frequent_item_list is maintained for containing frequent patterns. The procedure is explained by the pseudo code of SH-Mine algorithm as: SH_Mine(t) { 130

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

Consider L(i) to be the level of SH-Tree t . Start from the header node at level 1. frequent_ item_list= item at level 1; for(i=1;i<=total no. of levels,i++;) if (support(item at level L(i+1)) >= min_support_threshold) frequent_item_list(i+1)= frequent_item _list(i), item at L(i+1); display frequent_item_list; Repeat the process for all header nodes at level 1. } After using above algorithm, frequent-patterns for our example can be obtained as: {b,c},{a,c},{a,b,c}.

FIGURE III SCALABILITY WITH RESPECT TO SUPPORT FOR MARKET BASKET (MB) DATASET

Similar analysis is shown in FIGURE IV and FIGURE V for Synthetic dataset, which is dense dataset taken from UCI, AI repository. Our algorithm works efficiently for sparse datasets in terms of both run-time and space. Sometimes in case of dense datasets it would give the parallel performance as that of FP-growth shown in FIGURE V, but always space efficient.

IV. PERFORMANCE STUDY WITH EXPERIMENTAL RESULTS The performance study evaluates efficiency and scalability of SH-Mine algorithm. Our experimental results give the comparative analysis between SH-Mine and FP-growth. It shows that many of the times SH-Mine outperforms FPgrowth and sometimes work parallel as FP-growth. We have performed the experiments on Pentium IV machine with 1GB main memory and 80 GB hard disk having MS Windows/NT operating system. SH-Mine and FP-growth are implemented using Dev-C++ 4.9.9.2. For SH-Mine, our experiments work with gcc-compiler which is advanced version of C/C++ compiler. We have tested their comparison on various datasets, but due to less space only typical datasets results are shown in this study.

FIGURE IV SCALABILITY WITH RESPECT TO NUMBER OF TRANSACTIONS FOR SYNTHETIC DATASET

A. Scalability Analysis Our experiments measure the scalability of SH-Mine algorithm with respect to number of transactions and with respect to support. FIGURE II and FIGURE III have shown the comparative study of both the algorithms for Market Basket (MB) dataset , which is sparse dataset taken from [13], with respect to transactions and support respectively.

FIGURE V SCALABILITY WITH RESPECT TO SUPPORT FOR SYNTHETIC DATASET

This analysis shows that our algorithm is efficient for sparse datasets especially when the minimum support threshold is less.

FIGURE II SCALABILITY WITH RESPECT TO NUMBER OF TRANSACTIONS FOR MARKET BASKET (MB) DATASET

131

3rd International Conference on Advanced Computing & Communication Technologies, November 08-09, 2008, APIIT, Panipat, India.

REFERENCES

B. Memory Usage Analysis For memory usage also, we have compared both the algorithms for same datasets. FIGURE VI and FIGURE VII clearly show that SH-Mine is far better than FP-Growth in terms of space payoff. One of the main reason behind this is the use of H-Struct while creating SH-Tree as it was already proved in [3],[4] that H-Struct is space efficient than FP-Tree.

FIGURE VI MEMORY ANALYSIS FOR MARKET BASKET (MB) DATASET

FIGURE VII MEMORY ANALYSIS FOR SYNTHETIC DATASET

[1]

R. Agrawal, T. Imielienski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, Proc. Conf. on Management of Data, 207–216. ACM Press, New York, NY, USA 1993

[2]

Jiawei Han, Jian Pei, Yiwen Yin, "Mining Frequent Patterns without Candidate Generation", Intl. Conference on Management of Data, ACM SIGMOD,2000

[3]

J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang.” H-mine: hyper-structure mining of frequent patterns in large database.” In Proceedings of the IEEE International Conference on Data Mining, San Jose, CA, November 2001.

[4]

Q. Wan and A. An,” Efficient mining of indirect associations using Hi-mine”, In Proceedings of 16th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2003, alifax, Canada, June 2003.

[5]

R. Bayardo Jr., R. Agrawal, and D. Gunopulos, "Constraint-based rule mining in large, dense databases", In Proc. of the15th Int’l Con$ on Data Engineering, pages 188-197, 1999.

[6]

William Cheung, Osmar R. Zaiane, "Incremental Mining of Frequent Patterns without Candidate Generation or Support Constraint", Seventh International Database Engineering and Applications Symposium (IDEAS'03), page-111, 2003.

[7]

Y. G. Sucahyo, R. Gopalan, “CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP-Tree Data Structure”, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI), Brighton, UK, 2004.

[8]

Mingjun Song, Sanguthevar Rajasekaran, "A Transaction Mapping Algorithm for Frequent Itemsets Mining," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 4, pp. 472-481, Apr., 2006.

[9]

Wan, Q. and An, A., “HI-mine*: Efficient Indirect Association Discovery Using Compact Transaction Database”, IEEE International Conference on Granular Computing (GrC'06), Atlanta, USA, May 10-12, 2006.

[10] Jiawei Han, Micheline Kamber , Book : “Data Mining Concept & Technique”,2001.

V. CONCLUSION AND FUTURE WORK We have developed a novel data structure called SH- Struct for mining frequent patterns. The experimental results show that the algorithm is scalable and efficient in terms of both space and time for different datasets. We have bench marked the results with well-known FP-growth algorithm which is one of the most efficient frequent pattern mining algorithm. In our future work we aim to improve the performance of our algorithm for highly dense datasets. Also the algorithm would be used to generate temporal association rules for both sparse and dense datasets.

[11] Christian Borgelt, “Keeping things simple: finding frequent item sets by recursive elimination”, In Proc. of the 1st international workshop on open source data mining: frequent pattern mining implementations, Chicago, Illinois, pages: 66 – 70, 2005. [12] O.P. Vyas, Keshri Verma. “Efficient Calendar Based Temporal Association Rule” , ACM SIGMOD ,Vol 34 No.-3 pp 63-71, 2005. [13] C.L. Blake and C.J. Merz, UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, University of California at Irvine, CA, USA 1998.

132

Preparation of Papers in Two-Column Format

Society for Computational Studies of Intelligence, AI 2003, alifax,. Canada, June 2003. ... workshop on open source data mining: frequent pattern mining.

269KB Sizes 2 Downloads 298 Views

Recommend Documents

Preparation of Papers in Two-Column Format
inverted curriculum, problem-based learning and good practices ... Computer Science Education, ... On the programming courses for beginners, it is usual for the.

Preparation of Papers in Two-Column Format
QRS complex during real time ECG monitoring and interaction between ... absolute value of gradient is averaged over a moving window of ... speed and satisfactory accuracy, which does not fall below the ... heart rate as well as other vital signs [7],

Preparation of Papers in Two-Column Format
School for the Sciences (PGSS), a Science, Technology,. Engineering, and Mathematics (STEM) enrichment program that graduated nearly 2400 students over a 27- year-period. .... project, conduct experiments, and collect and analyze data. The fifth and

Preparation of Papers in Two-Column Format
1 Andréa Pereira Mendonça, Computing and System Department, Federal University of ... program coding, postponing the development of problem ... practices on software development. .... of six stages that remember the life-cycle of software ...

Preparation of Papers in Two-Column IEEE Format ...
email: {jylee, kyuseo, ywp}@etri.re.kr. Abstract - Obstacle avoidance or ..... Robots," IEEE Journal of Robotics and Automation,. Vol. 7, No. 3, pp. 278-288, June ...

Preparation of Papers in Two-Column Format for the ...
A Systolic Solution for Computing the. Symmetrizer of a ... M.S.Ramaiah Institute of Technology. Carleton ... DSP) to achieve high performance and low i/o.

Preparation of Papers in a Two-Column Format for the ...
work may also lead to some new ways that designers can adapt dialogue ... International Conference on Robotics and Automation, April 2005. [17] T. Kanda, T.

Preparation of Papers in a Two-Column Format for the ...
robot share common ground. More common ... female robot already shares some of this dating knowledge,. i.e., has .... screen on the robot's chest (see Figures 2-3). We used a ..... of Humanoid Robots,” 2005 IEEE International Conference on.

Preparation of Papers in Two-Column Format for the ...
To raise the Q factors or lower the insertion loss in dual passbands .... coupling degree in the lower band or 2.4GHz-band is always .... 1111-1117, Apr. 2004.

instructions for preparation of papers
sources (RESs) could be part of the solution [1,2,3]. A HPS is ... PV-HPSs are one of the solutions to the ..... PV Technology to Energy Solutions Conference and.

Preparation of Papers for Journal of Computing
note that use of IEEE Computer Society templates is meant to assist authors in correctly formatting manuscripts for final submission and does not guarantee ... The quality and accuracy of the content of the electronic material ..... "Integrating Data

format for preparation of btech project report
The name of the author/authors should be immediately followed by the year and other details. .... Kerala in partial fulfillment for the award of Degree of Bachelor of Technology in. Mechanical ..... very attractive for automotive applications.

format for preparation of btech project report
KEYWORDS: DI Diesel Engine, Spiral Manifold, Helical Manifold, Helical-Spiral. Combined Manifold, Computational Fluid Dynamics (CFD). In-cylinder fluid dynamics exert significant influence on the performance and emission characteristics of Direct Inj

instructions for preparation of full papers
simplify the data structure and make it very direct to be accessed so that the ... enterprise and road companies should be used as much as possible to .... transportation data management and analysis and fully integrates GIS with .... Nicholas Koncz,

instructions for preparation of full papers
combination of bus, subway, and train routes is not an easy job even for local ..... transportation data management and analysis and fully integrates GIS with .... TransCAD Software (2003) TransCAD, Caliper Corporation, Newton MA, USA.

Preparation of Papers for AIAA Technical Conferences
Ioffe Physical Technical Institute of the Russian Academy of Sciences,. St.Petersburg, Russia. E-mail: [email protected]. I. Introduction. In the work a ...

instructions to authors for the preparation of papers -
(4) Department of Computer Science, University of Venice, Castello 2737/b ... This paper provides an overview of the ARCADE-R2 experiment, which is a technology .... the German Aerospace Center (DLR) and the Swedish National Space ...

Instruction for the Preparation of Papers
In this study, we develop a new numerical model with a Finite Volume Method using an unstructured mesh for flexibility of the boundary shape, and the MUSCL ...

Preparation of Papers for AIAA Technical Conferences
An investigation on a feature-based grid adaptation method with gradient-based smoothing is presented. The method uses sub-division and deletion to refine and coarsen mesh points according to the statistics of gradients. Then the optimization-based s

Preparation of Papers for AIAA Technical Conferences
of fatigue life prediction has been proposed using a knockdown factor that is ... for the variability of test cases, Ronolod et al3 also provide the 95% confidence.

Preparation of Papers for AIAA Journals
Jul 14, 2011 - [1], require relative positioning with moderate accuracy (about 50 m, 95%). ...... illustration, this paper considers only straight-line flight. ... error, since bearing errors have a pronounced effect on relative positioning accuracy 

Format of OBC Certificate
FORM OF CERTIFICATE TO BE PRODUCED BY OTHER BACKWARD CLASSES. APPLYING FOR APPOINTMENT TO POSTS UNDER GOVERNMENT OF ...

instructions to authors for the preparation of papers for ...
cloud formation, precipitation, and cloud microphysical structure. Changes in the .... transmitter based on a distributed feedback (DFB) laser diode used to seed a ...

papers
1. INTRODUCTION. I met Robert F. Wagner shortly after I began my career in medical imaging. ... technology of photographic processes. Immediately after .... For disk detection, the best detectability was obtained at very small values of α.17. 3.