Neural Networks, Decision Tree Induction and ...

Viewer
Transcript

Neural Networks, Decision Tree Induction and Discriminant Analysis: An Empirical Comparison Stephen P. Curram; John Mingers The Journal of the Operational Research Society, Vol. 45, No. 4. (Apr., 1994), pp. 440-450. Stable URL: http://links.jstor.org/sici?sici=0160-5682%28199404%2945%3A4%3C440%3ANNDTIA%3E2.0.CO%3B2-5 The Journal of the Operational Research Society is currently published by Operational Research Society.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/ors.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.org Sun Sep 16 15:52:10 2007

J. Opl Res. Soc. Vol. 45, No. 4, pp. 440-450 Printed in Great Britain. All rights resewed

Copyright

0160-5682194 $9.00+0.00

Society Ltd

01994 Operational Research

Neural Networks, Decision Tree Induction and

Discriminant Analysis: an Empirical Comparison

STEPHEN P. CURRAM and JOHN MINGERS

University of Warwick, UK

This paper presents an empirical comparison of three classification methods: neural networks, decision tree induction and linear discriminant analysis. The comparison is based on seven datasets with different characteristics, four being real, and three artificially created. Analysis of variance was used to detect any significant differences between the performance of the methods. There is also some discussion of the problems involved with using neural networks and, in particular, on overfitting of the training data. A comparison between two methods to prevent overfitting is presented: finding the most appropriate network size, and the use of an independent validation set to determine when to stop training the network. Key words: classification, decision-trees, discriminant analysis, induction, neural networks

INTRODUCTION In a recent paper, art' described some experiments in using neural networks for classification tasks; that is, classifying entities correctly into known groups. The networks were compared with induced decision trees (formerly known as rule induction) on two sets of artificially-generated data. Similarly, Yoon et a1.' compared neural networks with discriminant analysis on a single set of real data. This paper reports more extensive experimentation involving more datasets and all three techniques. Seven different datasets are used-four are real and three artificial (including two used by Hart). In the first section the different methods are explained briefly and then the datasets and the experimental methodology are described. Some attention is paid to the important problem of producing effective network structures. The final section reports the results, then conclusions and recommendations for future work are made.

METHODS FOR CLASSIFICATION This section describes the three classification techniques and the way in which they were used. Reference is made to replications of the data. As described in the section on experimental procedure, each dataset was split into a training and testing set. This split was repeated nine times for each set of data and a replication refers to one of these testingltraining combinations.

Neural networks Neural networks are a family of methods that use a number of simple processors which are linked together to 'learn' the relationships between sets of variables. Neural networks have been used as models for investigating the way that the brain works, but have also been exploited for their mathematical properties as signal processors and non-linear statistical models3-'. A back-propagation network6 (also known as the multilayer perceptron) was chosen for these experiments. It is a layered network with an input layer, hidden layers and an output layer (see Figure 1) where neurons (nodes) in one layer are linked to neurons in the next layer by weighted connections. Neurons in the input layer simply pass on the values of the Correspondence: J . Mingers, Warwick Business School, University of Warwick, Coventry CV4 7AL, U K

440

S. P. Curram and J. Mingers-Neural Input Layer

Networks, Decision Trees and Discriminant Analysis Hidden Layer

Output Layer

FIG.1 A 3:2:3 hack-propagation network

input variables to the next layer of the network through the weighted connections. Neurons in the hidden and output layers sum all the weighted input signals coming to them, add a bias term (which regulates the sensitivity of the neuron) and then pass the net result through a sigmoid function, producing an 'activation' between 0 and 1. In the case of the hidden layers, this activation is passed onto neurons in the next layer via the weighted connections. For the output layer, the activations represent the network's output in response to the original input variable values. The back-propagation network uses supervised learning in which the examples are presented to the network along with their target outputs. Learning is done by comparing the network output with the target outcome for each example and propagating these errors back through the network. The weights of the connections between neurons are changed using a steepest descent method to try to reduce the sum of squared errors for the example set. The size of the weight change steps is controlled by a gain parameter, and a degrading momentum term helps to push changes in a direction which has been historically beneficial. The learning procedure used differs from that used by Hart1 in that it incorporates This uses a gain term (step adaptations to the learning algorithm as suggested by Vogl et size for connection weight changes) which is not fixed, but which is allowed to vary depending on the success of learning. Where there is an improvement in the total sum of squared errors (i.e. the network has improved) the step size is allowed to increase, representing an increase in the confidence in the direction of learning. If the total sum of squares worsens by more than some small proportion then the step size is reduced, the momentum term is ignored (i.e. past weight changes are not allowed to affect current weight changes) and the weight changes from this iteration are not used. The momentum term is switched back on when a successful learning iteration occurs. These adaptations have the effect of often speeding up the learning process, and also seem to reduce the chance of convergence to poor local minima solutions (a problem which can be encountered when using the steepest descent method). The back-propagation network was used because of its flexibility. It is able to cope with binary or real valued inputs and outputs (scaled from 0 to 1). For the datasets described in this paper, both real and binary input values were used, the real values being scaled linearly between 0 and 1, while all the outputs were binary valued classes. The disadvantage of back-propagation is that by using the steepest-descent method to determine weight changes, it can converge to local minima (although the chance of this is reduced by the use of the adapted gain term described above) and the rate of learning can be slow, particularly as the minimum error level is approached. The size and shape of a network is critical to the way in which it can perform. The size of the input and output layers is dictated largely by the number of input variables and output classes, although the outputs can be coded in different ways. The output classes could be coded into binary numbers so that each output node represents 1 binary digit (i.e. 1 = 01;

Journal of the Operational Research Society Vol. 45, No. 4 2 = 10; 3 = 11). Alternatively, each of the classes could be represented by one node so that a class could be identified by which node has the value 1, with the others all equal to 0. The choice of representation makes little difference to how well the network learns for the datasets we used. The former method has the advantage of requiring less nodes (forming a smaller and slightly faster network) while the latter is easier to read. The former binary number representation was used for all the datasets except for the wave-form data which used one node for each class so as to be consistent with the approach used by Hart1. Since the output from the network can be in the range 0 to 1, our final classifications were made using a winner takes all criteria, where values below 0.5 are rounded down to 0, and values of 0.5 and above are rounded to 1. In practice, it is often useful to highlight classifications where one or more of the outputs are not strongly classified (say in the range 0.3 to 0.7) as being 'unsure' so they may be reviewed further. Setting the number and size of hidden layers is more difficult. Hidden layer(s) are used to represent non-linearities or interactions between variables" The more complex the interactions, the more hidden units are required. If too few hidden units are used, the network may fail to learn. If too many hidden units are used, the data may be overfitted (fitting to individual data points rather than the trend) and this consequently reduces the network's ~ ~ .aim then is to use the minimum number of The ability to generalize for new e x a r n p l e ~ ~ hidden units to learn (without converging to poor local minima) so that the generalizing capabilities of the network are optimized. There are no hard and fast rules for deciding the number of hidden units that are required for learning to occur unless the nature of the relationships in the data are fully understoodlO. Past experience with similar datasets can help to choose the approximate size of the network, but a certain amount of trial and error is required to find a sensible network size. An alternative approach is to use an independent validation step in the training of the network1'. Here, a network is used which has sufficient hidden units to ensure that learning takes place, but has not been streamlined to stop overfitting from occurring. At the end of each learning iteration, the network is tested on an independent validation data set and training stopped where the error is minimized for the validation set. The argument for this approach is that the underlying model lies closer to the network's initial starting point than a model which has been overfitted to the individual data points. Since the steepest descent method tries to minimize the fitting error in as short a distance from the starting point as possible, it will fit the underlying structure before the detail of the individual data points. Thus, stopping the network at the right time can prevent overfitting from occurring. The comparisons with LDA and decision tree induction are done using the networks with an optimized number of hidden units. Further experiments were then done using the independent validation step method, and the results from these compared with those obtained from the optimized network sizes.

Linear discriminant analysis ( L D A ) Linear discriminant analysis, developed by Fisher12, is the classic method for this classification task. It is theoretically optimal for situations where the underlying populations are multivariate normal and where all the different groups have equal covariance structures. Such populations can be well separated by straight lines and planes. With multivariate populations having unequal covariance structures quadratic discriminant analysis can be used. These assumptions are very restrictive in that they apply, in full, only rarely in practice. However, LDA has been found to be very robust to deviations from them and works well with many 'well-behaved' datasets. It works poorly where the groups form substantially non-linear shapes, as in Hart's example (Reference 1 p. 217, Fig. 3). In using LDA, it would normally be the practice to remove independent variables which were not significant in a similar manner to multiple regression using, for example, a stepwise technique. That was not done in this case partly because of the effort of tuning LDA to the 63 different datasets, and partly because it would have given an advantage over the other

S. P. Curram and J. Mingers- Neural Networks, Decision Trees and Discriminant Analysis methods. To check the effect of this, one data set-Babs-did have each replication analysed separately using stepwise analysis but this made no improvement to the results. There are other forms of discriminant analysis which may cope better with such data. In particular, non-parametric methods such as kernel D A and nearest neighbour D A which make no assumptions about the underlying populations. These are, however, essentially heuristic procedures which do not provide consistent performance.

Induced rule trees Several approaches to inductive learning have been de~eloped'"'~, and one of these involves the construction of decision trees. Based on initial work by Hunt et a1.16, ~ u i n l a n "developed the ID3 algorithm for deterministic problems such as chess endgames. At the same time, Breiman et al.l8 were developing a similar approach to classification problems, which is known as CART (classification and regression trees). Research then focused on the use of such methods in domains where the data are not in the data may be deterministic but uncertain (Quinlan19, ~iblett", ~ i n g e r s ~ ~Uncertainty '~~). due to noise in the measurements or to the presence of factors which cannot be measured. There are three phases to rule induction with uncertain data-first, creating an initial, large rule tree from the set of examples, second, pruning this tree to remove branches with little statistical validity and, third, processing the pruned tree to improve its understandability. ~ i n ~ e r compared s'~ several measures for tree creation in terms of the size and classificatory accuracy of the trees produced. The study concluded that there was little difference between the methods, and that their use reduced the size of a tree rather than improved its accuracy. A further paper24 reported comparisons between different pruning methods, concluding that there were differences between methods in terms of both accuracy and reliability. For the experiments reported in this paper the best combination of measure (G-statistic with Marshall correction) and pruning method (Breiman's error-complexity) were used.

EXPERIMENTAL COMPARISON O F THE METHODS The experimental design compared the three classification methods on seven datasets with widely differing characteristics.

Data sets The experiments used seven datasets, four from natural domains and three artificially constructed. These datasets cover a wide range of conditions including degree of noisiness and residual variation, number of classes and attributes, mix of integer and discrete attributes, and variation in the distribution of classes.

B A Business Studies degree student profiles (Babs) These data relate various attributes of each student on entry to the course, to the final class of degree achieved. There are 186 observations with seven attributes-age (years), type of entry qualification (A-level, BTEC Ordinary National Diploma, or some other), sex (male/ female), number of 0-levels, number of points at A-level (0-20), grade of maths 0-level (A, B, C, Fail), full-time employment before the course (yeslno). There are four possible classes of degree-first, upper second, lower second, or third. Three of the attributes are integer and four categorical. There is no known noise, but there are many other factors affecting the results that have not been (and probably could not be) measured, giving high residual variation. The recurrence of breast cancer (Cancer) These data, containing 286 examples, derive from those used in Bratko and ~ o n o n e n k o ' ~ and concern the recurrence of breast cancer. There are two classes (recur or not recur) and 443

Journal of the Operational Research Society Vol. 45, No. 4 nine attributes, of which four are integer. These include age, tumour size, number of nodes, malignant (yes/no), age of menopause (< 60, > = 60, not occurred), breast (left, right), radiation treatment (yeslno), quadrant of breast (left, right, top, bottom, centre). There are both missing data and residual variation.

Classifying types of Iris (Iris) Kendall and stewartZ6 use these data as a test of discriminant analysis. There are 150 examples of three different varieties of iris, with roughly equal numbers of each. The four integer attributes are measurements such as petal length and petal width, from which the examples can be classified. There is little noise or residual variation. Recognizing L C D display digits (Digits) This is an artificial domain suggested by Breiman18. A digit in a calculator display consists of seven lines, each of which may be on or off. Thus, there are ten classes (one for each digit) and seven binary-valued attributes (one for each line). Residual variation is introduced by assuming that a malfunction leads to a 10% chance of a line being incorrect. Such errors affect the attributes but not the class. Note that the chance of an example being completely correct is (0.9)' = 0.48. 3000 cases were randomly generated. In Hart's paper1 another version is also considered in which 17 random 'noise' variables are included. Predicting soccer results (Footb) These data contain the results of 346 British league soccer matches. There are three classes (win, lose, draw), and five integer attributes measuring the past performance of the terms. There is little noise, but a high degree of residual variation. Waveform data (Wave) This is an artificial dataset developed by Breiman18 and used also by Hart1. It is described more fully in both. Briefly, it consists of three different wave shapes formed from 21 continuous variables whose values have had random values added to them. A set of 1000 examples were generated. In Hart's paper another version is also considered in which 19 extra random 'noise' variables are included. Sphere data (Sphere) This is another artificial dataset, generated for these experiments, specifically designed not to be amenable to traditional discriminant analysis. It consists of points either inside or outside a sphere. There are thus two classes and three continuous variables. There are roughly equal numbers of each class and no noise or residual variation. A set of 1000 examples was used. Experimental procedure The procedure splits each data set randomly into one training and two testing sets in the proportions 60%, 20%, 20%. Two testing sets are needed since the pruning method makes use of a separate set to select the final pruned tree, and the neural network approach makes use of it to find the best network size. This would bias the results in favour of these two methods if the same set were used for the accuracy measurements. As a single random split of the data may give unrepresentative results, the whole procedure is repeated nine times, giving nine different groups of training and testing data for each different domain.

Neural network training The main method used for comparison was to find the number of hidden units which optimized the classification results on a second test set (separate to the one used in the final analysis). The first step was to find roughly the number of hidden units which were able to learn effectively the structure of the detail, without too much overfitting. This was done by

S. P. Curram and J. Mingers- Neural Networks, Decision Trees and Discriminant Analysis considering only one of the replications from each dataset (to save time) and experimenting with a number of different network sizes. Some past knowledge helped find the starting point. For instance, knowing that the classes for the Iris data are largely linearly separable suggested that the number of hidden units needed to be small, while previous experiments with data classifying points inside or outside a circle suggested that a single hidden layer with eight units may be sufficient for the Sphere data (a three-dimensional version of the circle problem). Once an approximate network size had been found, all nine replications were trained using this network size and for slightly smaller and larger networks to find the optimal network size (using the second test set). If the larger or smaller network turned out to produce a better result, then the next size in that direction was tried, and so on until the best network size was found. In some cases, different network sizes were found to be best for different data replications; however, in general, the differences were small and the results presented later refer to a single network size for each of the main datasets. In all cases the networks were allowed to learn to their full potential (i.e. including any overfitting). Further experiments were done on the seven datasets using an independent validation step and the results compared with those for the 'optimal' network size. Training was stopped when the network achieved its best percentage of mis-classifications, with the lowest mean square error being used to distinguish between results with the same percentage mis-classification. In practice, the network (the weights and biases) was saved when there was an overall improvement in the error rate, and training stopped when no improvement had occurred for 200 iterations, with the last saved network being used as the final result.

Analysis of results The results (in Table 3 later) show the mean error rate, i.e. percentage of mis-classifications, for each combination of method and dataset. They are averaged across the nine replicated sets. These results were then analysed using Analysis of Variance (ANOVA) to detect statistically significant differences within each factor and also interactions between them. From a statistical point of view, ANOVA assumes that all the data combinations have equal variance although it is robust to this assumption2'. In our case particular datasets have larger variances. A variance-stabilizing transformation was applied but the results were so similar that the original data was used for ease of interpretation. An ANOVA test was also used to analyse the difference between the results from the networks with an 'optimized' number of nodes in the hidden layer and those from the networks with independent validation, using the same approach as described above.

RESULTS

Efficiency of using the classification methods Before looking at the results of the classifications, it is worth discussing the time efficiency of the three methods. It should be noted that the timings are dependent on the type of machine used and are given to illustrate the order of magnitude of processing time. The neural network was the most time consuming of the methods. Finding sensible network sizes required running networks of different sizes and comparing the results. This took differing amounts of time for each of the datasets depending on the size of the networks (for computational speed) and the appropriateness of the initial size estimate (for the number of re-runs required). Once a sensible network size had been found, learning times were still slow, ranging from 1 hour for the Iris data to 18 hours for the Waveform data to run the nine replications of each dataset on a 33 MHz 486 PC. However, it must be pointed out that a large number of iterations were used to ensure that training had been done to its full potential. For the neural network approach using the extra independent validation step, there were generally far fewer iterations required (see Table 4 later), so that while each iteration took longer (needing to do the extra validation step), total learning times ranged from 30 minutes to 3 hours. 445

Journal of the Operational Research Society Vol. 45, No. 4 With decision tree induction the time depends on the number of examples, how easily they can be classified and the pruning method. In this case, times for the nine replications ranged from 20 to 40 minutes on a 386 PC. The LDA was done using the SAS Discrim function28with macros to repeat the process for the nine replications of each dataset. Running times were relatively fast, ranging from 5 to 30 minutes for the nine replications of each dataset on a 386 PC. It is worth noting though that stepwise discriminant analysis would have considerably increased the amount of work to be done for all the datasets.

Size of network and decision tree Table 1 shows the size of neural network used for each of the datasets. The size of the input and output layers is determined by the number of variables and classes in the dataset respectively. The number of hidden nodes was chosen to be sufficient to model the data, but small enough to prevent overfitting on the example data. The size of the hidden layer is an indication of the level of interaction between the input variables, although the extra hidden nodes may also be required to overcome strong local minima in the data. Table 1 indicates that there is a relatively high degree of interaction for the Sphere, Wave and Digit data, but little for the other datasets. TABLE 1 . Optimal neural network sizes (Input*Hidden*Output) Wave

Sphere

Digit

Babs

Cancer

Iris

Footb

Table 2 shows the size of the decision trees before and after pruning, averaged across the replications. The size before is determined by the difficulty of classification and the number of observations. The size after pruning depends on the number of classes and the statistical reliability in the data-the less reliable, the more the tree will be pruned. Babs, Cancer and Footb have large initial trees and small pruned ones. This shows that there is a high degree of residual variation in these datasets. Digit is large after pruning partly because it has a large number of classes. Sphere is still very large after pruning, reflecting the fact that it is trying to approximate a sphere with a series of straight lines. Iris is small because it is easy to split the classes linearly on the variables. TABLE 2 . Mean size of induced decision trees (No. of leaves) Wave Sphere Digit Babs Cancer Iris Footb Original Pruned No. of Classes No. of Observations

87 13 3 1000

88 57 39 13 2 1 1000 3000

69 63 8 94 2 3 3 2 2 0 4 3 3 186 286 150 346

Classification performance of the three methods Table 3 gives the main results of the experiment. It shows, for each combination of method and dataset, the percentage of errors on an independent test set averaged across the nine replications. The bracketed figures are the corresponding standard deviations. Overall, it can be seen that the datasets vary considerably in their predictability from the Iris data with an overall average error of 5 % , to the Football data with 46% -hardly better than chance. The methods also differ in overall terms with neural networks having an average of 23%, and LDA 29%. However, these averages are affected by interactions between the methods and particular datasets. The standard deviations reveal the degree of variability between replications and thus the precision of the estimates of the mean. As can be seen, in some cases (for example, Babs and

S. P. Curram and J. Mingers -Neural Networks, Decision Trees and Discriminant Analysis TABLE 3. Mean error rates across differing methods and datasets (%) Wave Sphere Digit Babs Cancer Iris Footb

All

LDA

15.0 (2.4)

53.4 24.4 35.6 28.1 3.1 (4.5) (1.4) (6.5) (5.7) (2.9)

Induction

28.5 (4.6)

14.3 26.2 42.1 29.3 7.9 44.1 27.5 (3.1) (2.4) (7.4) (4.2) (3.0) (8.4)

Neural

13.6 (1.1)

5.6 26.3 36.1 28.5 4.6 48.2 23.3 (1.2) (1.7) (10.7) (5.9) (3.5) (4.9)

All

19.1

24.4

25.6

37.9

28.6

5.2

45.4 29.3 (5.4)

45.9

26.7

(Standard deviations shown in brackets)

Cancer) it is quite high and this shows how important it is to employ replications. If only a single point estimate, based on a single split of the data, is provided (as in Yoon's results) it may be significantly different from the true mean. In fact, the ideal situation would be to draw a number of independent samples from the population instead of repeatedly splitting the sample. In our case this would have been possible for the artificial datasets but not for the real datasets and so, for consistency, we adopted the replication approach. Looking at the results in more detail using ANOVA, there are significant differences between the datasets (as would be expected), significant differences between the methods (F2,,,, = 24.0), and significant interactions (Fl2,I6,= 40.7). Looking at particular interactions, the main effects are induction with the Wave data, and both LDA and networks with the Sphere data. Induction is poor in comparison with the other methods on the wave data with an error rate of 28% as opposed to 15% and 14%. This is, in fact, the same error rate achieved by ~ r e i m a n " on this type of data and so replicates his results. It could be said that LDA and networks do very well, matching the notional minimum Bayesian error rate1,". Some attempts were made to improve this figure by trying alternative pruning methods and it was possible to reduce it to 24% using pessimistic pruning but this is still worse than the other methods. The sphere data was designed to be highly non-linear and so proved to be hard for LDA but was coped with well by the networks. The 53% error rate for LDA is no better than chance, reflecting the impossibility of separating the sphere from non-sphere with a plane. Networks did very well with an error rate of just 6%, and induction did reasonably with 14% although it generated large trees. Taking the size of the neural network as an indication of the level of interaction between the input variables (and subsequently an indication of the level of non-linearity), a comparison between the size of network (Table 1) and the mean error rates shows that LDA tends to do better than the neural networks where the level of non-linearity is low, but that neural networks do better when there is a greater degree of non-linearity. The main purpose of the experiments was to test for overall differences between the methods. The initial results show LDA worst; however, this difference is influenced by the extreme values on the Sphere dataset. On the other sets the two methods performed similarly well. A further ANOVA was carried out without the Sphere results. This still produced significant (but smaller) row and interaction effects but now LDA was slightly better than networks. Induction was clearly worst with a mean of 30% and the poorest results on the majority of the datasets. A final comparison of LDA and networks without induction was performed. This showed no significant row (F1,,, = 0.97) or interaction (F,,,, = 0.36) effects. Thus, on all but the Sphere data, LDA and neural networks gave indistinguishable levels of performance.

Performance of neural networks with independent validation The comparison between the standard networks and those using the independent validation method is shown in Table 4. The table shows the average percentage of mis-classifications across the replications of each data set, the size of network used in each case, and the number 447

Journal of the Operational Research Society Vol. 45, No. 4 TABLE 4. Comparison of optimal networks against networks with validation step

Optimal

Validation

Wave

Sphere

Digit

Babs

Cancer

Iris

Footb

All

13.6 21*10*3 2000

5.6 3*8*1 2000

26.3 7*10*4 1000

36.1 9*1*2 3000

28.6 13*1*1 2000

4.6 4*2*2 2000

48.2 4*2*2 5000

23.3

% Error rate 13.5 Network size 21*15*3 Average iterations 310

5.5 3*12*1 1262

25.9 7*15*4 80

38.4 9*6*2 38

29.7 13*4*1 150

4.2 4*6*2 856

47.3 4*6*2 29 1

23.5

% Error rate Network size Iteratio~is

of iterations used in training. In the case of the independent validation step, the figure for the number of iterations is the average for the nine replications of each dataset. In all cases the independent validation step was used with networks having greater than the optimal number of hidden nodes and thus, without stopping the training, would be subject to overfitting of the data. A comparison of the number of iterations used shows that on average, far fewer iterations were required when using the validation step, although it should be pointed out that the number of iterations used for the optimal-sized networks did include a considerable safety margin to ensure that these networks were trained to their full potential. Even taking this safety margin into account, the validation step does afford considerable advantages in terms of training times. It can be seen from the means that there is a similar performance from the two approaches, with the ANOVA showing no significant row (F,,,,, = 0.05) or interaction (F6,,,, = 0.22) effects. In the datasets, where the validation approach does noticeably worse (Babs and Cancer) it is interesting to note that much lower error rates were observed for the validation data than on the test data (11.7% lower for Babs and 8.8% for cancer) whereas the difference was much smaller for the other datasets. It is expected that the error rates for the validation set will be lower than for the test set, but the large differences suggest that, in some cases, the validation data were not wholly representative of the test data.

Comparison with previous results In comparison with Hart's results1, our results are more wide-ranging and have been analysed statistically but there are some points of direct comparison. In particular, the network results on the wave and digit data without the extra noise variables (Hart's problems 1 and 3). On the digit data Hart achieved a 29% error rate with a 20 hidden node network. We managed an improvement to 26% error rate using a much smaller network with 10 hidden nodes. LDA also did better with a 24% error rate. With the wave data we achieved 14% compared with Hart's 17% and LDA's 15%. Again a significant improvement with a smaller network. Hart's results are likely to be subject to overfitting of the example data, caused by using more hidden nodes than necessary. The networks which we used were smaller and better able to generalize to the test data. It is worth commenting on how these results for Digit data compare with the supposed minimum Bayesian error rate. The theoretical figure is 26% yet our results are lower. How does this come about? The Bayesian error rate is the best that can be achieved given complete knowledge of the prior distribution of classes and the multivariate distributions across all variables for all classes. For the digit data, the variables are independent of each other and we know the theorectical probability distributions for each class given the noise (0.9, 0.1 or 0.1, 0.9). We also know that the classes are equally likely. Given this, the theoretical value of 26% as ' ~ ~ a r t ' )can be calculated. quoted in ~ r e i m a n (and However, any particular sample of data will not exactly match these theoretical distributions and so will have a different actual Bayesian rate. A computer program has been written

S. P. Curram and J. Mingers-Neural Networks, Decision Trees and Discriminant Analysis to calculate the true Bayesian rate for any particular sample. This has been used to explore the sensitivity to changes and to calculate the actual rate for our data. The general conclusions are as follows. (i) Changing the prior probabilities of the classes away from equal probability always reduces the Bayes error rate as exactly equal priors are the hardest to predict. (ii) Changing the distribution of the variables has a potentially bigger effect in either direction. Going from 0.9:O.l to 0.95:0.05 reduces the error rate to about 18%. Going to 0.85:0.15 increases the error rate to about 31%. The actual calculated value for our Digit data is 23.8% which is below our figures. As for Yoon's results2, comparison is harder because we have no dataset in common, and their results are rather limited. In particular, they only use one dataset and that is rather small; they only allow the network access to the four variables selected by DA; and, most importantly, they do not replicate their split into testing and training sets. We found considerable variation in error rate between replications and it is quite possible that their results are unrepresentative of the true situation. We have also found in the past that although quadratic discriminant analysis is theoretically appropriate for the unequal covariance case, in practice, LDA often works better. It would be interesting to see their results repeated with more data, replications, and using LDA.

CONCLUSIONS LDA performed well on the datasets which proved to be linearly separable. It performed poorly on the sphere dataset which had highly non-linear relationships, but reasonably on the wave data which also contained non-linearities. The datasets which were linearly separable tended to be the real datasets, where it was not obvious at the outset that this would be the case, and which contained various levels of noise and residual variation. In this respect, the performance of LDA has shown it to be reasonably robust, even where its assumptions are not strictly adhered to, and able to cope with certain levels of non-linearity. The neural networks performed well on the datasets which were non-linear, and reasonably well across all the datasets. Classification rates were better than LDA when non-linearity existed, but generally slightly worse where the data were linearly separable. The main problems with the back-propagation network are deciding on the size of the network and the long training periods necessary. However, the use of a validation step in the learning cycle removes the need for fine-tuning the network size and also requires less iterations to learn the general form of the data, resulting in little or no degradation of the classification rate compared with the optimized network size. The decision tree induction generally performed worse than the other two methods, particularly against neural networks on the non-linear data, and was susceptible to noisy data. It does, however, have the advantage that the resulting decision tree is more transparent than the results from the other two methods, and may give some insight into the relationships between the factors. The results suggest that both LDA and neural networks are valuable tools for discriminant analysis. In cases where non-linear relationships are suspected, neural networks should be tried, while if the data is thought to be linearly separable, LDA should be used. However, when using real data, the degree of non-linearity is often not known. In this case, the use of both methods to compare their performance may prove beneficial, particularly where even a small improvement in the classification rate is of value.

REFERENCES 1. A. HART(1992) Using neural networks for classification tasks-some advice. J . Opl Res. Soc. 43, 215-226.

experiments on datasets and practical

Journal of the Operational Research Society Vol. 45, No. 4 2. Y. YOON,G. SWALESJr and T . M. MARGAVIO (1993) A comparison of discriminant analysis versus artificial neural networks. J. Opl Res. Soc. 44, 51-60. 3. R. P. LIPPMAN (1987) An introduction to computing with neural nets. IEEE ASSP Magazine 4(2), 4-22. 4. P. D. HAWLEY, J. D. JOHNSON and D. RAINA(1990) Artificial neural systems: a new tool for financial decision making. Financial Analysts J. 46(6), 63-72. 5. L. I. BURKE(1991) Introduction to artificial neural systems for pattern recognition. Comp Ops Res. 18, 211-220. 6. D. E . RUMELHART, G. E . HINTONand R . J. WILLIAMS(1986) Learning internal representations by error propagation. In Parallel Distributed Processing, Vol 1. (D. E. RUMELHART, J. L. MCCLELLAND and the PDP Research Group, Eds) pp. 318-362. MIT Press, Cambridge, MA.

7. T. P. VOGL,J. K. MANGIS,A. K. RIGLER,W. T . ZINKand D . L. ALKON(1988) Accelerating the convergence of the back-propogation method. Biol. Cyb. 59, 257-263. 8. A. M. WILDBERCER(1990) Neural networks as a modelling tool. In Proceedings of the SCS Eastern Multiconference, (W. WEBSTERand J. UITAMSINGH, Eds) April 1990, Nashville Tennessee, pp. 65-74. Society for Computer Simulation. San Diego, California. 9. J. SIETSMAand R. J. F. Dow (1991) Creating artificial neural networks that generalize. Neural Networks 4, 67-79. 10. S. C. HUANGand Y. F. HUANG(1991) Bounds on the number of hidden neurons in multilayer perceptrons. IEEE Trans. Neural Networks 2(1), 47-55. 11. R. G. HOPTROFF (1993) The principles and practice of time series forecasting and business modelling using neural networks. Neural Comput. Applic. 1(1), 59-66. 12. R . A. FISHER(1936) The use of multiple measurements in taxonomic problems. Ann. Eugenics 7 , 179-188. 13. R. MICHALSKI, J. CARBONELL and T. M~TCHELL (1983) Machine Learning: A n Artificial Intelligence Approach Vol I. Morgan Kaufman, Los Altos. 14. R. MICHALSKI, J. CARBONELL and T. MITCHELL (1986) Machine Learning: A n Artificial Intelligence Approach Vol 11. Morgan Kaufman, Los Altos. 15. I. BRATKOand N. LAVRAC (Eds) (1987) Progress in Machine Learning. Sigma Press, UK. 16. E . HUNT,J. MARINand P. STONE(1966) Experiments in Induction. Academic Press, New York. 17. J. R . QUINLAN (1983) Learning efficient classification procedures and their application to chess end games. In Machine Learning: A n Artificial Intelligence Approach, Vol. 1. (R. MICHALSKI, J. CARBONELL and T. MITCHELL, Eds) pp. 463-482. Morgan Kaufman, Los Altos.

18. L. BREIMAN,J. FREIDMAN, R. OLSHENand C. STONE(1984) Classification and Regression Trees. Wadsworth International, California.

19. J. R. QUINLAN (1987) Simplifying decision trees. Int. J. Man-Machine Studies 27, 221-234. 20. T. NIBLET-T (1987) Constructing decision trees in noisy domains. I11 Progress in Machine Learning (I. BRATKOand N. LAVRAC,Eds) pp. 67-79. Sigma Press, UK. 21. J. MINCERS(1987). Expert systems-rule induction with statistical data. J. Opl Res. Soc. 38, 39-47. 22. J. MINGERS(1987) Rule induction with statistical data-a comparison with multiple regression. J. Opl Res. Soc. 38, 347-352. 23. J. MINGERS (1989) An empirical compariso~lof selection measures for decision-tree induction. Machine Learning 3 , 319-342. 24. J. MINGERS (1989) An empirical comparison of pruning methods for decision-tree induction. Machine Learning 4, 227-243. 25. I. BRATKOand I. KONENKO (1986) Learning diagnostic rules from incomplete and noisy data. Seminar on A I methods in statistics. London Business School, UK, Unicom Seminars Ltd. 26. M. KENDALL and A. STEWART (1976) The Advanced Theory of Statistics (Vol 3). Griffen, London. 27. W. COCHRAN (1947) Some consequencies when the assumptions for the Analysis of Variance are not satisfied. Biometrica 3 , 22-38. 28. SAS INSTITUTE INC(1988) SASISTAT liserk Guide, Release 6.03 Edition. Cary, NC: SAS Institute Inc.

http://www.jstor.org

LINKED CITATIONS - Page 1 of 2 -

You have printed the following article: Neural Networks, Decision Tree Induction and Discriminant Analysis: An Empirical Comparison Stephen P. Curram; John Mingers The Journal of the Operational Research Society, Vol. 45, No. 4. (Apr., 1994), pp. 440-450. Stable URL: http://links.jstor.org/sici?sici=0160-5682%28199404%2945%3A4%3C440%3ANNDTIA%3E2.0.CO%3B2-5

This article references the following linked citations. If you are trying to access articles from an off-campus location, you may be required to first logon via your library web site to access JSTOR. Please visit your library's website or contact a librarian to learn about options for remote access to JSTOR.

References 1

Using Neural Networks for Classification Tasks -- Some Experiments on Datasets and Practical Advice Anna Hart The Journal of the Operational Research Society, Vol. 43, No. 3. (Mar., 1992), pp. 215-226. Stable URL: http://links.jstor.org/sici?sici=0160-5682%28199203%2943%3A3%3C215%3AUNNFCT%3E2.0.CO%3B2-Y 2

A Comparison of Discriminant Analysis versus Artificial Neural Networks Youngohc Yoon; George Swales, Jr.; Thomas M. Margavio The Journal of the Operational Research Society, Vol. 44, No. 1. (Jan., 1993), pp. 51-60. Stable URL: http://links.jstor.org/sici?sici=0160-5682%28199301%2944%3A1%3C51%3AACODAV%3E2.0.CO%3B2-E 21

Expert Systems-Rule Induction with Statistical Data John Mingers The Journal of the Operational Research Society, Vol. 38, No. 1. (Jan., 1987), pp. 39-47. Stable URL: http://links.jstor.org/sici?sici=0160-5682%28198701%2938%3A1%3C39%3AESIWSD%3E2.0.CO%3B2-7

NOTE: The reference numbering from the original has been maintained in this citation list.

http://www.jstor.org

LINKED CITATIONS - Page 2 of 2 -

22

Rule Induction with Statistical Data-A Comparison with Multiple Regression John Mingers The Journal of the Operational Research Society, Vol. 38, No. 4. (Apr., 1987), pp. 347-351. Stable URL: http://links.jstor.org/sici?sici=0160-5682%28198704%2938%3A4%3C347%3ARIWSDC%3E2.0.CO%3B2-M

NOTE: The reference numbering from the original has been maintained in this citation list.

A Bottom-Up Oblique Decision Tree Induction Algorithm