MCAIM: Modified CAIM Discretization

Viewer
Transcript

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 8, ISSUE 1, JULY 2011 16

MCAIM: Modified CAIM Discretization Shivani V. Vora and Rupa G. Mehta Abstract— Discretization is a process of dividing a continuous attribute into a finite set of intervals to generate an attribute with small number of distinct values, by associating discrete numerical value with each of the generated intervals. It produces a concise summarization of continuous attributes and leads to make learning, more accurate and faster. Discretization is usually performed prior to the learning process and has played an important role in data mining and knowledge discovery. The results of CAIM are not satisfactory in some cases, led us to modify the algorithm. The Modified CAIM (MCAIM) results are compared with other discretization techniques for classification accuracy and generated the outperforming results. Index Terms— Discretization, Class-attribute interdependency maximization, CAIM, MCAIM.

—————————— ——————————

1 INTRODUCTION In the era of Information Technology, data and information available on the Internet and in database system are increasing rapidly. Machine learning (ML) algorithms are used to automate the analysis of large datasets. ML algorithms generate knowledge from class labeled data sets, in which attributes come in mix format such as a set of numerical, nominal or continuous attributes. These algorithms are known as classification algorithms. Some classification algorithms can only handle categorical attributes while others can handle continuous attributes but would perform better on categorical attributes [1]. In order to speed up classification algorithms, improve the predictive accuracy, and generate simple decision rules, many discretization algorithms have been proposed to pre-process learning data [2],[3],[4]. CAIM algorithm discretizes attributes into possibly the smallest number of intervals; maximize the class-attribute interdependency to improve results of the subsequently used classification algorithm. Discretization is a process of dividing a continuous attribute into a finite set of intervals to generate an attribute with small number of distinct values, by associating discrete numerical value with each of the generated intervals. Discretization is usually performed prior to the learning process and has played an important role in data mining and knowledge discovery [5], [6], [7]. More detail of discretization process and algorithms are found in [2],[8],[9]. From literature survey it is found that the CAIM can generate a better discretization scheme. CAIM is topdown discretization algorithm. CAIM algorithm discretizes attributes into possibly the smallest number of intervals; maximize the class-attribute interdependency to improve results of the subsequently used classification algorithm. Main advantage of CAIM is that it does not require user interaction since they automatically pick proper

number of discrete intervals. Although CAIM has many features, it still has two drawbacks. Firstly, CAIM usually generate a simple discretization scheme in which the number of intervals is very close to the number of target classes. And secondly, for each discretized interval, CAIM considers only the class with the most samples and ignores all the other target classes [4]. The detailed discussions and examples about CAIM are presented in Section 2.3. The results of CAIM are not satisfactory in some cases. This motivated us to modify the algorithm. Main goal of MCAIM discretization is for the betterment of CAIM algorithm to increase the classification accuracy. And to verify our proposed algorithm, we compared MCAIM with five well known discretization algorithms: ♦ Unsupervised algorithms : Equal width [8], Equal count[8] and Standard deviation ♦ Supervised algorithms : Embedded discretization of the C5.0 Classification [10],CAIM [2] The results show that MCAIM gives superior result among all five discretization methods. The results of the proposed algorithm are tested through tree based classification like C5.0 using different real datasets of the UCI repository [11]. The rest of the paper is organized as follows. In Section 2, we review some related works. Section 3 presents our modified Class-Attribute Interdependence Maximization discretization algorithm. The experimental comparisons of six discretization algorithms on two real datasets are presented in Section 4. Finally, the conclusions are presented in Section 5.

2 RELATED WORKS

In this section, we review some of the related works. Since we evaluated the performance of several discretization algorithms in Section 4 by using the famous classification ———————————————— algorithm C5.0, we first gave a brief introduction of classi• Shivani V. Vora is student of M.Tech(Research)-II in Computer Engineerfication in Section 2.1. Then, we reviewed the discretizaing Department of the Sardar Vallabhbhai National Institute of Technology, Surat,Gujarat,India. tion algorithms on the axis of unsupervised versus super• Rupa G.Mehta is with the Department of Computer Engineering of the vised and brief idea of online (dynamic) and offline (statSardar Vallabhbhai National Institute of Technology, Surat, Gujarat, India. ic) discretization algorithms in Section 2.2. Finally, the © 2011 JCSE http://sites.google.com/site/jcseuk/

17

detailed discussions of CAIM were given in Section 2.3.

2.1 Classification Classification is a data mining (DM) technique used to predict group membership for data instances. Many classification algorithms are developed such as decision tree [12], classification and regression tree [13], bayesian classification [14], neural networks [15] and K nearest neighbor classification [16]. Among them, the decision tree has become more popular algorithm as it has several advantages like [17]: ♦ Compared to neural networks or a bayesian based approach; it is more easily interpreted by humans. ♦ It is more efficient for large training data than neural networks which would require a lot of time on thousands of iterations. ♦ A decision tree algorithm does not require a domain knowledge or prior knowledge. ♦ It displays good classification accuracy as compared to other techniques. A decision tree like C5.0 [10] is a flow-chart-like tree structure, which is constructed by a recursive divide-and conquer algorithm that generates a partition of the data. In a decision tree, each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node is associated with a target class (or class). The topmost node in a tree is called the root, and each path forming from the root to a leaf node represents a rule. Classifying an unknown example begins with the root node, and successive internal nodes are visited until this example has reached a leaf node. Then the class of this leaf node is the predicted class of the example. 2.2 Discretization algorithms Discretization algorithms can be divided into unsupervised versus supervised [18]. Famous unsupervised topdown algorithms are Equal Width, Equal Frequency [8] and Standard deviation.While the supervised top-down algorithms include maximum entropy [19], Paterson– Niblett [20], which are used as dynamic discretization in t in decision trees algorithm [10], Information Entropy Maximization (IEM) [9], other information gain or entropy-based algorithms [21], [22], Class-attribute Interdependence Maximization (CAIM) [2], and Fast Classattribute Interdependence Maximization (FCAIM) [3]. Kurgan and Cios (2004) have shown outperforming results of CAIM discretization algorithm compared to other top-down discretization algorithms, since its discretization schemes can generally maintain the highest interdependence between target class and discretized attributes, result into the least number of generated rules, and gain the highest classification accuracy [2]. FCAIM [3], which is an extension of CAIM algorithm, have been proposed to speed up CAIM. The C5.0 algorithm uses dynamic discretization, also

known as online discretization [23]. It discretizes continuous attributes when a classifier is being built. Many researchers have developed dynamic discretization algorithms for some particular learning algorithms [4]. Berzal (2004) built multi-way decision trees by using a dynamic discretization method in each internal node to reduce the size of the resulting decision trees [24]. Their experiments showed that the accuracy of these compact decision trees was also preserved. Q. Wu (2006) deﬁned a distributional index and then proposed a dynamic discretization algorithm to enhance the decision accuracy of naive bayes classiﬁers [1]. But the computational complexity is higher in dynamic discretization as discretization takes place during the model preparation and special discretization scheme is developed for the particular classification model. Whereas in static discretization, also known as off-line discretization, the process is completed prior to the learning task [4], [23]. The advantage of static discretization compared to dynamic discretization is the independence from the learning algorithms [18]. A dataset discretized by a static discretization algorithm can be used in any classiﬁcation algorithms that deal with discrete attributes.

2.3 CAIM discretization algorithm Given the two-dimensional quanta matrix (also called a contingency table) in Table 1, CAIM defined the interdependency between the target class and the discretization scheme of a continuous attribute A as, where qir (i = 1,2,. . . ,S, r = 1,2,. . . ,n) denotes the total …………. (1)

number of examples belonging to the ith class that are within interval [dr-1,dr], Mi+ is the total number of examples belonging to the ith class, M+r is the total number of examples that are within the interval [dr-1,dr], n is the number of intervals, maxr is the maximum value among all qir values, and M+r is the total number of continuous values of attribute A that are within the interval [dr-1,dr]. The larger value of caim will generate the better discretization scheme D. CAIM is a progressing discretization algorithm that does not require users to provide any parameter. For a continuous attribute, CAIM will test all possible cutting points and then generate one in each loop. The loop is stopped until a specific condition is met. For each possible cutting point in each loop, the corresponding CAIM value is computed according to Equation 1, and the one with the highest CAIM value is chosen. Since finding the discretization scheme with the globally optimal CAIM value would require a lot computation cost, CAIM algorithm only finds a local maximum CAIM to generate a suboptimal discretization scheme. Although CAIM outperformed the other top-down methods, it still has two drawbacks. Firstly, CAIM gives a high factor to the number of generated intervals when it discretizes an attribute. Hence, CAIM usually generates a simple discretization scheme in which the numbers of

© 2011 JCSE http://sites.google.com/site/jcseuk/

18

Table 1 The quanta matrix for attribute F and discretization scheme D

3 PROPOSED MODIFIED CAIM DISCRETIZATION BAED CLASSIFICATION MODEL In this section, we discuss the proposed modified CAIM based classification and modified CAIM discretization. 3.1 Proposed MCAIM Based Classification Model intervals are very close to the number of target classes. For example, if we take the age dataset in Table 2 as the training data, the quanta matrix generated is given in Table 3 and the discretization scheme of CAIM is presented in Table 4. In Table 4 CAIM divided the age dataset into three intervals: [3.00, 10.00], (10.00, 61.50], and (61.50, 70.00]. Interval [3.00, 10.00] contains samples 1–2, interval [10.00, 61.50] contains samples 3–9, and interval [61.50, 70.00] has samples 10–11. However, this discrete result is not good and the age dataset should obviously be discretized into five intervals: samples 1–2, 3–4, 5–7, 8–9, and 10–11. If a classifier is learning with such a discretized dataset produced by CAIM, the accuracy would be worse. Secondly, CAIM considers only the distribution of the major target class. Such consideration is also unreasonable in some cases.

Table 2 Age dataset

In the proposed model, the classification is followed by modified CAIM discretization. Before that, we prepare data by applying different pre-processing techniques

Fig. 1 Classification using modified CAIM discretization

such as normalization, outlier detection and removal, correlation analysis and feature subset selection to enable classification algorithms to operate faster and more effectively. Prepared data is given to modified CAIM discretization algorithm to generate discrete intervals. And the discretized attributes are given to classification model. The tree based classification based on information gain is proposed by Quinlan [10] which uses online discretization. The process of proposed model is shown in Fig. 1. In the paper, we more emphasize on modified CAIM discretization process. The process is described in next section.

3.2 Modified CAIM Discretization Algorithm The main goal of MCAIM is for the betterment of CAIM algorithm to increase the classification accuracy. In CAIM discretization process, it only accepts the CAIM value which is highest among all corresponding CAIM values. It will ignore all possible small intervals which have less value of maxr, compared to the maxr of the interval where the highest CAIM is found. The proposed algorithm changes the comparison scheme. Instead of finding the interval that produces highest CAIM, the search will be Table 3 Quanta matrix for Age data set

Table 4 The discretization scheme of age dataset by CAIM

Fig.2 The pseudo code of MCAIM

19

Table 6

performed for the interval which produces local highest Main properties of datasets considered in the experimentation CAIM. The pseudo code of the MCAIM algorithm is depicted in Fig. 2. Which followos same initial steps as CAIM algorithm [2]. The algorithm starts with a single interval that covers all possible values of a continuous attribute and divides it iteratively. From all possible division points that are tried (with replacement) in 2.2, it chooses the division boundary that gives the local highest value of the CAIM crite- thods, online discretization is the dynamic discretization rion. The experiment generates the expected result as de- method and remaining are static. For cmc and iris dataset scribed above. The result of the proposed algorithm is MCAIM discretization gives best result among all discreshown in Table 5. As we know that the discretization is a tization methods. Classification accuracy using CAIM preprocessing task for classification, so the resultant table discretization on iris dataset is not good. As we discussed earlier, CAIM does not ensure the best classification reTable 5 sults as far as the accuracy are concerned. The discretization scheme for Age dataset by modified CAIM

Table 7 Comparisons of six discretization schemes using two different datasets (bold value indicates best results)

after discretization will be used in classification.

4 RESULTS In the following sections, the results of the MCAIM algorithm along with five other leading discretization algorithms on the two well-known continuous and mixedmode data sets are presented.

4.1 The Experimental Setup In proposed model, the classification is followed by CAIM discretization. The tree based classification based *Std.deviation stands for Standard Deviation on information gain is proposed by Quinlan [10] which uses online discretization. The accuracy of the classifica5 SUMMARY AND CONCLUSIONS tion model is compared for classification with online discretization, CAIM discretization, proposed modified Discretization algorithms have played an important role CAIM discretization along with the traditional static in extraction of information from large data sets. It is a techniques like Equal width, Equal count, and Standard preprocessing step and thus should be characterize by very low complexity. From the existing research, CAIM Deviation. discretization is proven to be the very efficient discretizaThe two datasets used to test CAIM and modified CAIM tion technique for classification algorithm. But CAIM algorithm’s results are not satisfactory with some dataset [4] algorithms are: and to improve results the updated CAIM discretization algorithm, MCAIM is proposed. The results depicted in 1. Contraceptive Method Choice data set (cmc) [11] the resultant Table 7 shows the improved classification 2. Iris Plants data set (iris) [11] accuracy with the proposed MCAIM gives improved clasData sets are obtained from the UC Irvin ML repository sification accuracy, compared to the other static discreti[11]. A detailed description of the data sets is shown in zation and the dynamic discretization of the C5.0 algoTable 6. For evaluation of discretization algorithms we rithm. It is found that numbers of intervals generated by use 20% of examples as training dataset and 80% of ex- modified CAIM discretization algorithms are more in amples as test dataset. For measuring classification accu- numbers and requires some modification to reduce some racy we have used C5.0 classification of Clementine 8.5 numbers of intervals. In our proposed model, data is being prepared with different pre-processing methods and and test results are depicted in the Table 7. then given to discretization algorithm will leads to more accurate results. 4.2 Analysis of the Results Table 7 depicts the results of six discretization methods for two different datasets. Among six discretization me-

20

ACKNOWLEDGMENT We are grateful to the authors of the CAIM algorithm because of its important role on our paper.

REFERENCES [1]

[2]

[3] [4]

[5]

[6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

[14] [15] [16] [17]

[18]

Q. Wu, D.A. Bell, T.M. McGinnity, G. Prasad, G. Qi, X. Huang, “Improvement of decision accuracy using discretization of continuous attributes,” Proc. of the Third International Conference on Fuzzy Systems and Knowledge Discovery, Lecture Notes in Computer Science , pp. 674–683, 2006. Kurgan, L., & Cios, K.J., “CAIM Discretization Algorithm,” IEEE Transactions of Knowledge and Data Engineering, Vol.16, No.2, February 2004. Kurgan, L., & Cios, K.J.,”Fast Class-Attribute Interdependence Maximization (CAIM) Discretization Algorithm,”unpublished. Cheng-Jung Tsai , Chien-I. Lee , Wei-Pang Yang, “A discretization algorithm based on Class-Attribute Contingency Coefficient,” Elsevier-Information Sciences 178 ,pp. 714–731, 2008 J. Catlett, “On Changing Continuous Attributes into Ordered Discrete Attributes,” Proc. European Working Session on Learning, pp. 164-178, 1991. J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. Twelth Int’l Conf. Machine Learning, pp. 194-202, 1995. U.M. Fayyad and K.B. Irani, “On the Handling of ContinuousValued Attributes in Decision Tree Generation,” Machine Learning, vol. 8, pp. 87-102, 1992. Chiu D., Wong A. & Cheung B., “Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis,” Piatesky-Shapiro G., Frowley W.J. (Eds.) Knowledge Discovery in Databases, MIT Press, 1991 Fayyad, U.M., and Irani, K.B. ,”Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. of the Thirteenth Int’l Joint Conf. on Artificial Intelligence, San Francisco, CA, Morgan Kaufmann, pp.1022-1027, 1993 R. Quinlan, Data Mining Tools, http://www.rulequest.com/see5info.html, 2005. C.L. Blake, C.J. Merz, UCI repository of machine learning databases, University of California, Department of Information and Computer Science, Irvine, CA, http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998. S. Cohen, L. Rokach, O. Maimon,” Decision-tree instance-space decomposition with grouped gain-ratio,” Information Sciences, pp. 3592–3612, 2007. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., “Classification and Regression Trees,” Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software , 1984. David HeckerMann, “A Tutorial On Learning With Bayesian Networks,” March 1995 (Revised November 1996) Raul Rojas ,” Neural Networks - A Systematic Introduction ,” Springer-Verlag, 1996. Cover, T., Hart, P.,” Nearest neighbor pattern classification,” IEEE Trans. on Information Theory, vol.13, no.1,pp. 21–7, 1967. R. Rastogi, K. Shim, “a decision tree classifier that integrates building and pruning,” Proc. of the twentyforth Int’l Conf. on Very Large Databases, pp. 404–415, 1998. H. Liu, F. Hussain, C.L. Tan, M. Dash, “Discretization: an enabling technique,” Journal of Data Mining and Knowledge Discovery, vol.6, no. 4 , pp.393–423, 2002.

[19] A.K.C. Wong and D.K.Y. Chiu, “Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, pp. 796-805, 1987. [20] A. Paterson and T.B. Niblett, ACLS Manual. Edinburgh: Intelligent Terminals, Ltd, 1987. [21] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. Twelth Int’l Conf. Machine Learning, pp. 194-202, 1995. [22] X. Wu, “A Bayesian Discretizer for Real-Valued Attributes,” The Computer J., vol. 39, 1996. [23] Petr Berka, Ivan Bruha, “Discretization and Grouping: Preprocessing Steps for Data Mining”. [24] F. Berzal, J.C. Cubero, N. Marin, D. Sanchez, “Building multiway decision trees with numerical attributes,” Information Sciences, pp. 73–90, 2004.

MCAIM: Modified CAIM Discretization

are used to automate the analysis of large datasets. ML algorithms generate .... ples belonging to the ith class, M+r is the total number of examples that are within ...

Download PDF

612KB Sizes 2 Downloads 212 Views

Report

MCAIM: Modified CAIM Discretization

Recommend Documents