A Bottom-Up Oblique Decision Tree Induction Algorithm

Viewer
Transcript

A Bottom-Up Oblique Decision Tree Induction Algorithm Rodrigo C. Barros, Ricardo Cerri, Pablo A. Jaskowiak and Andr´e C. P. L. F. de Carvalho Department of Computer Science, ICMC University of S˜ao Paulo (USP) S˜ao Carlos - SP, Brazil {rcbarros,cerri,pablo,andre}@icmc.usp.br

Abstract—Decision tree induction algorithms are widely used in knowledge discovery and data mining, specially in scenarios where model comprehensibility is desired. A variation of the traditional univariate approach is the so-called oblique decision tree, which allows multivariate tests in its non-terminal nodes. Oblique decision trees can model decision boundaries that are oblique to the attribute axes, whereas univariate trees can only perform axis-parallel splits. The majority of the oblique and univariate decision tree induction algorithms perform a top-down strategy for growing the tree, relying on an impurity-based measure for splitting nodes. In this paper, we propose a novel bottom-up algorithm for inducing oblique trees named BUTIA. It does not require an impurity-measure for dividing nodes, since we know a priori the data resulting from each split. For generating the splitting hyperplanes, our algorithm implements a support vector machine solution, and a clustering algorithm is used for generating the initial leaves. We compare BUTIA to traditional univariate and oblique decision tree algorithms, C4.5, CART, OC1 and FT, as well as to a standard SVM implementation, using real gene expression benchmark data. Experimental results show the effectiveness of the proposed approach in several cases. Keywords-oblique decision trees; bottom-up induction; clustering; SVM; hybrid intelligent systems

I. I NTRODUCTION Decision tree induction algorithms are highly used in a variety of domains for knowledge discovery and pattern recognition. The induced knowledge in the form of hierarchical trees can be regarded as a disjunction of conjunctions of constraints on the attribute values [1]. Each path from the root to a leaf is actually a conjunction of attribute tests, and the tree itself allows the choice of different paths, i.e., a disjunction of these conjunctions. Such a representation is intuitive and easy to assimilate by humans, which partially explains the large number of studies that make use of these techniques. Another reason for their popularity is their good predictive accuracy in several application domains, such as medical diagnosis and credit risk assessment [2]. A major issue in decision tree induction is which attribute(s) to choose for splitting an internal node.

For the case of axis-parallel decision trees (also known as univariate), the problem is to choose the attribute that better discriminates the input data. A decision rule based on such an attribute is thus generated, and the input data is filtered according to the consequents of this rule. For oblique decision trees (also known as multivariate), the goal is to find a combination of attributes with good discriminatory power. Oblique decision trees are not as popular as the univariate ones, mainly because they are harder to interpret. Nevertheless, researchers argue that multivariate splits can improve the performance of the tree in several datasets, while generating smaller trees [3]–[5]. Clearly, there is a tradeoff to consider in allowing multivariate tests: simple tests may result in large trees that are hard to understand, yet multivariate tests may result in small trees with tests that are hard to understand [6]. One of the advantages of oblique decision trees is that they are able to produce polygonal (polyhedral) partitions of the attribute space, i.e., hyperplanes at an oblique orientation to the attribute axes. Univariate trees, on the other hand, can only produce hyper-rectangles parallel to the attribute axes. The tests at Pmeach node of an oblique tree have the form w0 + i=1 wi xji ≤ 0, where wi is a real-valued coefficient associated to the ith attribute of a given instance xj , and w0 is the disturbance coefficient (bias) of the test. For either the growth of oblique or axis-parallel decision trees, there is a clear preference in the literature for algorithms that rely on a greedy, top-down, recursive partitioning strategy, i.e., top-down induction. The most well-known algorithms for decision tree induction indeed implement this strategy, e.g., CART [7], C4.5 [8] and OC1 [9]. These algorithms make use of impurity-based criteria to decide which attribute(s) will split the data in purer subsets (a pure subset is one whose instances belong to the same class). Since these algorithms are top-down, it is not possible to know a priori which instances will result in each subset of a partition. Thus, in top-down induction, trees are usually grown until every leaf

node is pure, and a pruning method is employed to avoid data overfitting. Works that implement a bottom-up strategy are quite rare in the literature. The key ideas behind bottom-up induction were first presented by Landeweerd et al. [10]. The authors propose growing a decision tree from the leaves to the root, assuming that each class is represented by a leaf node, and that the closest nodes (according to the Mahalanobis distance) are recursively merged into a parent node. Albeit simple, their approach presents several deficiencies, e.g., it allows only a single leaf per class, which means that binary-class problems will always be modeled by a 3-node tree. This is quite problematic since there are complex binary-class problems in which a 3-node decision tree cannot model accurately the attribute space. We believe this deficiency is one of the reasons for demotivating researchers to further investigate the bottom-up induction of decision trees. Another reason may be the extra computational effort required to compute the costly Mahalanobis distance. In this paper, we propose alternatives to solve the deficiencies of the typical bottom-up approach. For instance, we propose the application of a well-known clustering algorithm to allow each class to be represented by more than one leaf node. In addition, we incorporate in our algorithm a support vector machine (SVM) [11] solution to build the hyperplane that divide the data within each non-terminal node of the oblique decision tree. We call our approach BUTIA (Bottom-Up oblique Tree Induction Algorithm), and we evaluate its performance in gene expression benchmark datasets. This paper is organized as follows. In Section II we detail the proposed algorithm, which combines clustering and SVM for generating oblique decision trees. In Section III we conduct a comparison among BUTIA and traditional top-down decision tree induction algorithms C4.5 [8], CART [7] and OC1 [9]. Additionally, we compare BUTIA to Sequential Minimal Optimization (SMO) [12] and Functional Trees [13]. Section IV presents studies related to our approach, whereas in Section V we discuss the main conclusions of this work. II. BUTIA We propose a new bottom-up oblique decision tree induction algorithm, named BUTIA (Bottom-Up oblique Tree Induction Algorithm). It employs two machine learning algorithms in different steps of tree growth, namely Expectation-Maximization (EM) [14] and Sequential Minimal Optimization (SMO) [12]. Our motivation for building bottom-up trees is twofold:

• In a bottom-up approach we have a priori information on which group of instances belongs to a given node of the tree. It means we know the result of each node split before even generating the separating hyperplane. In fact, our algorithm uses these a priori information for generating hyperplanes that maximize the separation margin between instances of two nodes. Hence, there is no need of relying on an impurity-measure to evaluate the goodness of a split; • The top-down induction strategy usually overgrows the decision tree until every leaf node is pure, and then a pruning procedure is responsible for simplifying the tree in order to avoid data overfitting. In bottom-up induction, a pruning step is not necessary because we are not overgrowing the tree. Since we start growing the tree from the leaves to the root, our approach reduces significantly the chances of overfitting by clustering the data instances. Given a space of instances X = {x1 . . . xn }, xi ∈
x2 1 1 0 1 0 1

x3 1 1 1 0 1 0

0 1

1 1

0 0 1 0 1 1

2 3 4 5 6 7 8 9 10 11 12 13

1 2 3 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23 24 25 26

T ree ← merging(Leaves) return T ree procedure merging(L) input : A set L of nodes output: The oblique decision tree root node begin if |L| = 1 then return unique node in L else s1 ← −1 s2 ← −1 smallestDistance ← ∞ distance ← 0 foreach i, j ∈ L with i 6= j do if i.class 6= j.class then distance ← distance between i and j if distance < smallestDistance then smallestDistance ← distance s1 ← i s2 ← j create new internal node t t.lef tChild ← s1 t.rightChild ← s2 t.instances ← s1.instances ∪ s2.instances t.class ← new meta-class t.centroid ← mean vector of t.instances t.hyperplane ← SVM hyperplane for t.instances L ← L ∪ {t} L ← L \ {s1} L ← L \ {s2} return merging(L)

Figure 1 depicts the execution steps of BUTIA. We believe our approach presents the following advantages over the top-down methods: • It can handle imbalanced-classes regardless of

x6 1 0 1 0 1 0

x7 0 1 0 1 0 1

xn 1 1 0 0 1 1

...

A A1 ... AkA

1 1 0 1

...

1

x5 1 1 1 0 0 0

Class

A B A C B C

...

Algorithm 1: BUTIA’s procedures. procedure butia(Dataset X, Classes C) input : A dataset X and a set of classes C output: The oblique decision tree begin Leaves ← ∅; foreach class ci ∈ C do Si ← {x | C(x) = ci } P artition ← result of EM over Si foreach cluster ∈ P artition do create new leaf node n n.instances ← instances from cluster n.centroid ← mean vector of n.instances n.class ← ci Leaves ← Leaves ∪ {n}

x4 0 0 0 1 1 1

Training data

x1 0 1 1 0 0 0

...

internal nodes). Thence, SVM needs only to deal with binary problems, regardless of the number of classes in the original dataset, which means its application is straightforward and does not rely on a majority voting scheme between pairwise class subsets. This is particularly interesting in order to preserve the interpretability of the generated hyperplane as a rule. Steps 4 and 5 are repeated recursively until a root node is reached.

B B1 ... BkB

1 A 0 C

1

C

2 C1 ... Ck C

3

4 N1 Figure 1.

5 Diagram of BUTIA’s execution steps.

any explicitly provided cost matrix. Unlike top-down approaches, BUTIA always generates at least one leaf node per class, which guarantees that rare classes are represented in the predictive model. • It can model problems in which one class is described by more than one probability distribution. By clustering each pure subset, we can identify possibly distinct probability distributions, and thus generate separating hyperplanes for the closest inter-class boundaries. Note that this is not trivially achieved by other methods, specially those whose design is based on impurity measures. In Figure 2, we detail how BUTIA would perform in a synthetic binary-class dataset with m = 2. The original data is presented in (a), and the clustering step is performed in (b) within each class. The first class (x) was clustered in two groups, as it was the second (o), resulting in 4 leaf nodes. Next, the centroids of each node are calculated, and the closest nodes according to the Euclidean distance are merged (remember that nodes can only be merged when their (meta) classes are distinct) generating a new node in the tree (c). Still referring to (c), the new merged node has its centroid computed, a meta-class is assigned to its instances, and finally the corresponding hyperplane is generated. Following, the algorithm merges the next two closest centroids, repeating the same procedure of creating a new node, assigning a new meta-class and generating a new hyperplane (d and e). The algorithm terminates when all nodes (but one, the root node) are merged, resulting in the oblique partitions presented in (f).

(a)

Table I S UMMARY OF THE GENE EXPRESSION DATASETS .

(b)

(c)

(d)

(e)

(f)

Figure 2. BUTIA’s execution in a synthetic 2-d binary-class dataset.

III. E XPERIMENTAL A NALYSIS In order to evaluate our novel algorithm, we consider a set of 35 publicly available datasets from microarray gene expression data [15]. Microarray technology enables expression level measurement for thousands of genes in a parallel fashion, given a biological tissue. Once combined, a fixed number of microarray experiments are comprised in a gene expression dataset. The considered datasets are related to different types or subtypes of cancer (e.g., prostate, lung and skin) and comprehend the two flavors in which the technology is generally available: single channel (21 datasets) and double channel (14 datasets) [16]. Hereafter we refer to single channel microarrays as Affymetrix (Affy) and double channel microarrays as cDNA, since the data were collected using either of these technologies [15]. Our final task consists in classifying different examples (instances) according to their gene (attribute) expression levels. The main characteristics of the datasets are summarized in Table I. We compared BUTIA to five different classifiers. Four of them are decision tree induction algorithms: Oblique Classifier 1 (OC1) [9] (oblique trees), Functional Trees (FT) [13] (logistic regression trees), Classification and Regression Tress (CART) [7] (univariate trees) and C4.5 [8] (univariate trees). The fifth classifier, SMO [12], is an implementation of the SVM method, which is the state of the art algorithm

Id

Dataset

Type

# Instances

# Classes

# Genes

1 2 3 4 5 6 7 8 9 10 11 12 13 14

alizadeh-v1 alizadeh-v2 alizadeh-v3 bittner bredel chen garber khan lapointe-v1 lapointe-v2 liang risinger tomlins-v1 tomlins-v2

cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA

42 62 62 38 50 180 66 83 69 110 37 42 104 92

2 3 4 2 3 2 4 4 3 4 3 4 5 4

1095 2093 2093 2201 1739 85 4553 1069 1625 2496 1411 1771 2315 1288

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

armstrong-v1 armstrong-v2 bhattacharjee chowdary dyrskjot golub-v1 golub-v2 gordon laiho nutt-v1 nutt-v2 nutt-v3 pomeroy-v1 pomeroy-v2 ramaswamy shipp singh su west yeoh-v1 yeoh-v2

Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy Affy

72 72 203 104 40 72 72 181 37 50 28 22 34 42 190 77 102 174 49 248 248

2 3 5 2 3 2 3 2 2 4 2 2 2 5 14 2 2 10 2 2 6

1081 2194 1543 182 1203 1877 1877 1626 2202 1377 1070 1152 857 1379 1363 798 339 1571 1198 2526 2526

for cancer classification in gene expression data [17]. We make use of the following implementations: (1) OC1 - publicly available code at the authors’ website1 , and (ii) FT, CART, C4.5 and SMO - publicly available codes implemented in java and found within the Weka toolkit [18]. The parameters used were the default ones. Each of the six classifiers was evaluated by its generalization capability (accuracy rates), which was estimated using 10-fold cross-validation. In order to provide some reassurance about the validity and non-randomness of the obtained results, we present the results of statistical tests by following the approach proposed by Demˇsar [19]. In brief, this approach seeks to compare multiple algorithms on multiple datasets, and it is based on the use of the Friedman test with a corresponding post-hoc test. The Friedman test is a non-parametric counterpart of the well-known ANOVA. If the null hypothesis, which states that the classifiers under study present similar performances, is rejected, then we proceed with the Nemenyi post-hoc test for pairwise comparisons. The experimental results are summarized in Table II. The average accuracies obtained in the 10-fold 1 http://www.cbcb.umd.edu/∼salzberg/announce-oc1.html

cross-validation procedure are presented, as well as the corresponding standard deviations. Additionally, the number of times each method was within the top-three best accuracies is shown in the bottom part of the table. Table II A CCURACY ANALYSIS OF THE 35 DATASETS . ( AVERAGE ± S . D ). Id

BUTIA

OC1

FT

CART

J48

SMO

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

0.94±0.14 1.00±0.00 0.94±0.10 0.92±0.14 0.86±0.13 0.84±0.08 0.85±0.12 0.98±0.05 0.88±0.12 0.86±0.09 1.00±0.00 0.74±0.15 0.90±0.11 0.90±0.08 0.99±0.05 0.97±0.06 0.97±0.03 0.92±0.08 0.85±0.24 0.93±0.07 0.94±0.07 0.99±0.02 0.88±0.16 0.56±0.18 0.57±0.21 0.73±0.34 0.89±0.14 0.77±0.15 0.62±0.08 0.89±0.10 0.77±0.07 0.93±0.05 0.84±0.18 0.98±0.04 0.87±0.04

0.72±0.19 0.87±0.12 0.66±0.20 0.56±0.23 0.76±0.08 0.82±0.07 0.74±0.13 0.79±0.14 0.80±0.07 0.66±0.19 0.68±0.28 0.57±0.18 0.61±0.11 0.53±0.22 0.71±0.38 0.74±0.11 0.92±0.06 0.39±0.51 0.68±0.17 0.83±0.13 0.81±0.09 0.96±0.02 0.83±0.22 0.24±0.31 0.80±0.22 0.73±0.24 0.57±0.50 0.63±0.18 0.36±0.27 0.71±0.10 0.76±0.11 0.65±0.15 0.68±0.36 0.99±0.03 0.75±0.10

0.89±0.15 0.99±0.05 0.82±0.14 0.81±0.18 0.78±0.11 0.94±0.04 0.82±0.13 1.00±0.00 0.76±0.12 0.84±0.09 0.93±0.12 0.78±0.18 0.87±0.12 0.82±0.14 0.96±0.07 0.90±0.10 0.97±0.03 0.95±0.05 0.65±0.24 0.89±0.11 0.94±0.07 0.98±0.03 0.80±0.21 0.66±0.21 0.37±0.07 0.68±0.23 0.79±0.15 0.79±0.18 0.65±0.11 0.80±0.14 0.89±0.10 0.89±0.09 0.90±0.14 0.99±0.03 0.85±0.06

0.69±0.26 0.90±0.13 0.71±0.10 0.58±0.14 0.76±0.13 0.85±0.06 0.76±0.12 0.83±0.11 0.77±0.07 0.65±0.08 0.76±0.25 0.53±0.25 0.54±0.15 0.59±0.13 0.91±0.09 0.75±0.15 0.89±0.09 0.97±0.05 0.75±0.17 0.85±0.14 0.92±0.07 0.95±0.02 0.76±0.18 0.58±0.18 0.85±0.20 0.63±0.20 0.73±0.19 0.53±0.17 0.57±0.08 0.78±0.11 0.74±0.10 0.76±0.12 0.78±0.22 0.99±0.03 0.76±0.10

0.69±0.20 0.89±0.14 0.70±0.16 0.56±0.23 0.84±0.13 0.84±0.06 0.80±0.10 0.87±0.09 0.72±0.16 0.63±0.18 0.79±0.20 0.45±0.24 0.55±0.10 0.56±0.17 0.90±0.07 0.76±0.09 0.91±0.08 0.93±0.07 0.73±0.25 0.86±0.13 0.96±0.07 0.95±0.04 0.89±0.14 0.56±0.21 0.82±0.20 0.60±0.33 0.73±0.29 0.59±0.17 0.62±0.12 0.81±0.13 0.82±0.10 0.81±0.10 0.86±0.10 0.99±0.02 0.70±0.10

0.94±0.14 1.00±0.00 0.94±0.08 0.88±0.13 0.86±0.10 0.94±0.07 0.80±0.14 0.99±0.04 0.85±0.15 0.85±0.08 0.98±0.08 0.81±0.16 0.93±0.11 0.91±0.07 0.99±0.05 0.96±0.07 0.96±0.05 0.96±0.05 0.90±0.17 0.97±0.06 0.94±0.07 0.99±0.02 0.97±0.11 0.72±0.25 0.93±0.14 1.00±0.00 0.92±0.14 0.79±0.21 0.72±0.08 0.93±0.09 0.92±0.06 0.90±0.05 0.86±0.16 0.98±0.03 0.84±0.08

#1 #2 #3

13 9 6

0 2 4

5 12 11

1 2 6

2 2 5

19 12 3

Results suggest that BUTIA and SMO achieved the best overall performances. The ranking provided by the Friedman test supports this assumption, showing SMO as the best-ranked method, followed by BUTIA, FT, C4.5, CART and OC1. The Friedman test also indicates the rejection of the null-hypothesis, i.e., there is a statistically significant difference among the algorithms (p-value = 3.49 × 10−21 ). Hence, we have executed the Nemenyi post-hoc test for pairwise comparison, as depicted in Table III. Notice that SMO outperforms all algorithms but BUTIA with statistical significance. BUTIA and FT, on the other hand, outperform C4.5, CART and OC1 with statistical significance, but one does not outperform the other. Roughly speaking, BUTIA is a better option than FT because it is not outperformed by SMO significantly. Moreover, the trees generated by FT (whose nodes, both internal and leaves, hold logistic regression

models) are harder to interpret than the oblique trees provided by BUTIA, which is a clear advantage of our method. BUTIA seems to be an interesting alternative to SMO since it is more comprehensible than standard SVM. Recall that, in an oblique tree, one can follow the linear models through the internal nodes until reaching a class-label in a leaf node. This path from the root to the leaves can be seamlessly transformed into a set of interpretable rules. In FT, both internal and leaf nodes can hold logistic regression models, and leaves can hold more than one model for characterizing distinct classes. This collection of models can considerably harm interpretability. SVM, in turn, when trained with a linear kernel, is only interpretable in binary-class problems (it becomes equivalent to a regular linear model). But in problems with more than two classes, SVM has to test the results in pairwise class combinations, solving the final classification after a majority voting system. This kind of system also harms model comprehensibility, and it is one of the reasons why SVM is considered by many as a black-box technique [20]. In summon, BUTIA was shown to be competitive with the state-of-art technique for gene expression analysis, SVM, with the further advantage of being a comprehensible model. The advantages of presenting a comprehensible model to the domain specialist (in this case, the biologist) are well-described in [21], and they are easily generalizable for gene expression data. Table III N EMENYI PAIRWISE COMPARISON RESULTS .

BUTIA C4.5 OC1 SMO FT CART

BUTIA

C4.5

— N N

—

N

OC1

—

SMO

FT

N N — N N

N N — N

CART

—

N - The algorithm in the column outperforms the one in the row with statistical significance at a 95% confidence level.

IV. R ELATED W ORK There are many works that propose top-down induction of oblique decision trees in the literature, and we briefly review some of them as follows. Classification and Regression Trees (CART) [7] is one of the first systems that allowed multivariate splits. It employs a hill-climbing strategy with a backward attribute elimination for finding good (albeit suboptimal) linear combinations of attributes in non-terminal nodes. It is a fully-deterministic algorithm with no built-in mechanisms to escape local-optima.

Simulated Annealing of Decision Trees (SADT) [3] is a system that employs simulated annealing (SA) for finding good coefficient values for attributes in non-terminal nodes of decision trees. First, it places a hyperplane in a canonical location, and then iteratively perturbs the coefficients in small random amounts guided by the SA algorithm. Although SADT can eventually escape from local-optima, its efficiency is compromised since it may consider tens of thousands of hyperplanes in a single node during annealing. Oblique Classifier 1 (OC1) [9] is yet another top-down oblique decision tree system. It is a thorough extension of CART’s oblique decision tree strategy. OC1 presents the advantage of being more efficient than the previously described systems. It searches for the best univariate split as well as the best oblique split, and it only employs the oblique split when it improves over the univariate split. It uses both a deterministic heuristic search (as employed in CART) for finding local-optima and a non-deterministic search (as employed in SADT - though not SA) for escaping local-optima. Ittner [22] proposes using OC1 over an augmented attribute space, generating non-linear decision trees. The key idea involved is to “build” new attributes by considering all possible pairwise products and squares of the original set of n attributes. A more recent approach for top-down oblique trees that also employs SVM for generating hyperplanes is SVM-ODT [23]. It is an extension to the original C4.5 algorithm, allowing the use of either an univariate split or a SVM-generated oblique split, according to the impurity-measure values. Several methods are applied to avoid model overfitting, such as reduced-error pruning and MDL computation, resulting in an overly-complex algorithm. Model overfitting is directly related to the own nature of the top-down strategy. Another recently explored strategy for inducing oblique decision trees is through evolutionary algorithms (EAs). The interested reader can refer to [24] for works that employ EAs for generating the hyperplanes in top-down oblique tree inducers. Bottom-up induction of decision trees is practically not explored in the research community. The first work to present the concepts of bottom-up induction is Landeweerd et al. [10], where the authors propose growing a decision tree from the leaves to the root, assuming that each class is represented by a leaf node, and that the most similar nodes are recursively united into a parent node. This approach is too simplistic, because it allows only a single leaf per class, which means that binary-class problems will always be modeled by a 3-node tree. We believe this deficiency is the main reason for demotivating researchers to further investigate the bottom-up induction of decision trees.

The bottom-up strategy has only been employed for pruning decision trees [25], or for evaluating Alternating Decision Trees (ADTree) [26], topics that are not investigated in this paper. Our approach, presented in Section II, intends to expand the work of Landeweerd et al. [10] in such a way that it can be effectively applied to distinct problem domains. Particularly, we solve the one-leaf-per-class problem by performing clustering in each class of the problem, and we solve the hyperplane generation task by employing SVMs in each internal node. V. C ONCLUSIONS AND FUTURE WORK In this work, we have presented a novel bottom-up oblique decision tree induction algorithm, named BUTIA, which makes use of well-known machine learning algorithms, namely EM [14] and SMO [12]. Due to its bottom-up strategy for building the tree, BUTIA presents some interesting advantages over other top-down algorithms, such as robustness to imbalanced data and to data overfitting. Indeed, BUTIA does not require the further execution of a pruning procedure, which is the case of virtually every top-down decision tree algorithm. The use of SVM-generated hyperplanes within internal nodes guarantees that each hyperplane maximizes the boundary margin between instances from different classes, i.e., it guarantees convergence to the global optimum solution per node division. This is not true for other optimization techniques employed in algorithms such as OC1 [9], CART [7] and SADT [3], which do not guarantee convergence to global optima. We have tested BUTIA in 35 gene expression benchmarking datasets [15]. Experimental results indicated that BUTIA outperformed traditional algorithms, such as C4.5 [8], CART [7] and OC1 [9], with statistic significance regarding accuracy, according to the Friedman and Nemenyi tests, as recommended in [19]. BUTIA is competitive with SMO [12], since there was no significant difference between both methods regarding accuracy. Since BUTIA is just as interpretable as any oblique decision tree, it can be seen as an interpretable alternative to standard SVM (a technique known to be a black-box approach [20]), which is currently considered to be the state-of-art algorithm for classifying gene expression data [17]. This work has opened several venues for future research, as follows. We plan to investigate the impact of using different clustering algorithms for the generation of the leaves, as well as different strategies for automatically defining the number of clusters (leaves) for each class. In addition, we intend to test different methods for generating the separating hyperplanes, such as Fisher’s linear discriminant

analysis (FLDA). The reason we did not execute FLDA in the first place was its restriction of dealing exclusively with linearly-separable problems. Nevertheless, we believe that in certain applications, in which a low computational cost is required, FLDA may be a useful replacement of SVM. Finally, it is interesting to investigate the performance of BUTIA in more complex problems, such as hierarchical multilabel classification [27]. A CKNOWLEDGEMENT Our thanks to Brazilian research agencies CAPES, CNPq and FAPESP for supporting this research. R EFERENCES [1] T. M. Mitchell, Machine Learning.

McGraw-Hill, 1997.

[2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005. [3] D. Heath, S. Kasif, and S. Salzberg, “Induction of oblique decision trees,” J ARTIF INTELL RES, vol. 2, pp. 1–32, 1993. [4] S. K. Murthy, S. Kasif, and S. S. Salzberg, “A System for Induction of Oblique Decision Trees,” J ARTIF INTELL RES, vol. 2, pp. 1–32, 1994. [5] L. Rokach and O. Maimon, “Top-down induction of decision trees classifiers - a survey,” IEEE T SYST MAN CY C, vol. 35, no. 4, pp. 476 – 487, 2005. [6] P. Utgoff and C. Brodley, “An incremental method for finding multivariate splits for decision trees,” in 7th Int. Conf. on Machine Learning, 1990, pp. 58–65. [7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Wadsworth, 1984. [8] J. R. Quinlan, C4.5: programs for machine learning. Francisco, CA, USA: Morgan Kaufmann, 1993.

San

[9] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, “OC1: A Randomized Induction of Oblique Decision Trees,” in AAAI, 1993, pp. 322–327. [10] G. Landeweerd, T. Timmers, E. Gelsema, M. Bins, and M. Halie, “Binary tree versus single level tree classification of white blood cells,” PATTERN RECOGN, vol. 16, no. 6, pp. 571–577, 1983. [11] C. Cortes and V. Vapnik, “Support-vector networks,” MACH LEARN, vol. 20, pp. 273–297, 1995. [12] J. C. Platt, Fast training of support vector machines using sequential minimal optimization. Cambridge, MA, USA: MIT Press, 1999, pp. 185–208. [13] J. Gama, “Functional trees,” MACH LEARN, vol. 55, pp. 219–250, 2004.

[14] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J R STAT SOC, vol. 39, pp. 1–38, 1977. [15] M. de Souto, I. Costa, D. de Araujo, T. Ludermir, and A. Schliep, “Clustering cancer gene expression data: a comparative study,” BMC BIOINF, vol. 9, p. 497, 2008. [16] A. L. Tarca, R. Romero, and S. Draghici, “Analysis of microarray experiments of gene expression profiling,” AM J OBSTET GYNECOL, vol. 195, pp. 373 – 388, 2006. [17] A. Statnikov, L. Wang, and C. Aliferis, “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC BIOINF, vol. 9, no. 1, p. 319, 2008. [18] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999. [19] J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,” J MACH LEARN RES, vol. 7, pp. 1–30, 2006. [20] A. Navia-V´azquez and E. Parrado-Hern´andez, “Support vector machine interpretation,” NEUROCOMP, vol. 69, no. 13-15, pp. 1754 – 1759, 2006. [21] A. A. Freitas, D. C. Wieser, and R. Apweiler, “On the importance of comprehensible classification models for protein function prediction,” IEEE ACM T COMPUT BI, vol. 7, pp. 172–182, January 2010. [22] A. Ittner, “Non-linear decision trees-NDT,” in 13th Int. Conf. on Machine Learning, 1996, pp. 1–6. [23] V. Menkovski, I. Christou, and S. Efremidis, “Oblique decision trees using embedded support vector machines in classifier ensembles,” in 7th IEEE Int. Conf. on Cybernetic Intelligent Systems, sept. 2008, pp. 1 –6. [24] R. C. Barros, M. P. Basgalupp, A. C. P. L. F. Carvalho, and A. A. Freitas, “A survey of evolutionary algorithms for decision tree induction,” To appear in IEEE T SYST MAN CY C, pp. 1–21, 2011. [25] F. Esposito, D. Malerba, and G. Semeraro, “A Comparative Analysis of Methods for Pruning Decision Trees,” IEEE T PATTERN ANAL, vol. 19, no. 5, pp. 476–491, 1997. [26] B. Yang, T. Wang, D. Yang, and L. Chang, “BOAI: Fast Alternating Decision Tree Induction Based on Bottom-Up Evaluation,” in LNCS. Springer, 2008, pp. 405–416. [27] R. Cerri and A. C. P. L. F. Carvalho, “Hierarchical multilabel classification using top-down label combination and artificial neural networks,” in XI SBRN, 2010, pp. 253–258.

Neural Networks, Decision Tree Induction and ...