Invasive Connectionist Evolution Paulito P. Palmes and Shiro Usui RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, Saitama 351-0198 Japan
[email protected],
[email protected]
Abstract. The typical automatic way to search for optimal neural network is to combine structure evolution by evolutionary computation and weight adaptation by backpropagation. In this model, since structure and weight optimizations are carried out by two different algorithms each using its own search space, every change in network topology during structure evolution requires relearning of the entire weights by backpropagation. Because of this inefficiency, we propose that the evolution of network structure and weights shall be purely stochastic and tightly integrated such that good weights and structures are not relearned but propagated from generation to generation. Since this model does not depend on gradient information, the entire process allows more flexibility in the implementation of its evolution and in the formulation of its fitness function. This study demonstrates how invasive connectionist evolution can easily be implemented using particle swarm optimization (PSO), evolutionary programming (EP), and differential evolution (DE) with good performances in cancer and glass classification tasks.
1
Introduction
Artificial Neural Network (ANN) has been a popular tool in many fields of study due to its general applicability to different problem domains that require intelligent processing such as classification, recognition, clustering, prediction, generalization, etc. The most popular algorithm in ANN learning is BP (backpropagation) which uses minimization of the error surface by gradient descent. Since BP is a local search algorithm, it has fast convergence but can easily be trapped to local optima. Moreover, choosing the optimal architecture for a particular problem remains to be an active area of research because of BP’s tendency to overfit or underfit the training data due to its sensitivity to the choice of architecture. A typical approach to help BP figure out the appropriate architecture is by evolving its structure. Many studies have been conducted how to carry out structure evolution by evolutionary computation. A comprehensive review of papers related to evolutionary neural networks can be found in [1]. Recent insights and techniques for effective evolution strategies are found in the papers of [2,3]. The most typical approach is non-invasive [4]. This type of evolution uses dual representation: one for stochastic or rule-based structure evolution and the other for gradient-based weight adaptation. While non-invasive evolution makes L. Wang, K. Chen, and Y.S. Ong (Eds.): ICNC 2005, LNCS 3612, pp. 1119–1127, 2005. c Springer-Verlag Berlin Heidelberg 2005
1120
P.P. Palmes and S. Usui
the hybridization process straightforward, there is no tight integration between its structure evolution and weight adaptation. Hence, every time its network structure evolves, there is a need for the relearning of the entire weights by BP. In a typical evolutionary model, optimal parameter values are not relearned but propagated to the succeeding generations. This is not possible, however, in a gradient-based weight adaptation. One alternative approach we proposed belongs to the class of “invasive evolutionary model” [4] which relies on pure stochastic evolution of the network structure and weights. Invasive evolution uses a network representation where weights and structures are tightly integrated such that changes to the former bring corresponding changes to the latter, and vice-versa. It avoids relearning of good weights and structures by propagating them in the succeeding generations. Since invasive connectionist evolution uses direct representation and does not rely on fix rules or heuristics, it can easily utilize the evolution process of other evolutionary models such as particle swarm optimization (PSO) [5], differential evolution (DE) [6], and evolutionary programming (EP) [7]. Dynamic adaptation is important since fix rules or parameter values optimized for a particular problem domain become useless for another set of problem domain [8]. What is needed is to let the processes of mutation, crossover, adaptation, and selection filter the most appropriate set of rules, traits, and parameters to solve the problem under consideration. It is important, therefore, to avoid developing evolutionary systems that rely on fix rules or heuristics. We believe in the principle that a pure stochastic implementation with a proper adaptation strategies are important for a robust connectionist evolution.
2
Invasive Connectionist Model
ANN learning can be considered as a form of optimization with the main objective of finding the appropriate network structure and weights that has optimal generalization performance. Its performance is measured using quality function Qf it which measures the distance of ANN’s output F (X, S, W ) from the target output T (X): Qf it = T (Xi ), F (Xi , Si , Wi )θ (1) where X, S, and W are the network’s input, structure, and weights, respectively; and xθ is a similarity metric or error function. The main objective is to evolve the appropriate structure and weights so that the output of the function F is as close as the output of the target T . The function F uses the typical feedforward computation commonly used in n-layered network: ⎞ ⎛ (2) oi = F ⎝ wij ⎠ j
1 F (x) = 1 + e−x
(3)
Invasive Connectionist Evolution
1121
l a
x m
b
y
n a
b l m n
⎛ oi = f ⎝
l m n x y
⎞ Wij ⎠
f(x) =
j∈col
1 1 + e−x
Fig. 1. Subnetwork Nodule
Figure 1 shows the invasive connectionist’s building-block component which is composed of two weight marices. The first weight matrix contains the topology, strength of connections, and threshold values between the input and the hidden layer. Similarly, the second weight matrix describes the topology, threshold values, and connection strengths between the hidden layer and the output layer. While each nodule can be considered as a complete network capable of performing neural computation or learning, more complex structures can be easily created by combining several of these nodules to address more challenging problems in machine learning. Figure 2 shows an example of a complex network formed by combining a population of nodules. Evolutionary operators such as mutation and crossover can independently operate on the weight matrices of each nodule to improve the fitness of the entire network. In the next section, we will discuss several ways to induce invasive evolution on swarm of networks using PSO, DE, and EP.
3
Invasive Connectionist Algorithm
Figure 3 shows a neural network swarm model. Each independent nodule optimizes its structure and weights through its interactions with other nodules in the neighborhood. In this example, the degree of overlapping is set to two. Hence, every network pair has 2 neighboring pairs. This model can be reduced into the commonly used single population model by considering just one neighborhood. From the implementation point of view, multi-neighborhood representation is a generalization of the single neighborhood representation. This allows us to develop both single and multiple neighborhood
1122
P.P. Palmes and S. Usui
nodule
weight matrices and threshold vectors
Fig. 2. Invasive Connectionist Model
Network Swarms
Fig. 3. Connectionist Swarm
approaches without changes to the representation of the base component network. The invasive connectionist evolution algorithm is summarized in Fig. 4. For PSO implementation, the update of component’s position relative to its best neighbor and personal best has the following formulation [9,5]:
Invasive Connectionist Evolution
1123
Start
Initialize Structures and Weights Initialize Neighborhood Assignment
Compute Fitness Update Position of BestNeighbor and PersonalBest
Fly components
wi = wi + ∆wis ∗ sf ∗ rnd(0, 1) + ∆wip ∗ pf ∗ rnd(0, 1)
wi =
Perform Differential Evolution wip1 + α ∗ (wip2 − wip3 ) wi = wi + ρ(0, σ)
if rnd(0, 1) < cr otherwise.
Perform Evolutionary Programming sspi
=
mi
=
Qf iti ) Qtot mi + αρ(0, 1)sspi
=
ωi + ρ(0, mi )
ωi
NO
U(0, 1)(β +
Stop? YES
Test component with best validation
End
Fig. 4. Invasive Connectionist Algorithm
wi = wi + ∆wis ∗ sf ∗ U(0, 1) + ∆wip ∗ if ∗ U(0, 1)
(4)
∆wis = (wi − wis )
(5)
∆wip
(6)
such that:
= (wi −
wip )
where sf = 1.0 and if = 1.0 refer to the component’s sociability and individuality factors, respectively. More sociable components have higher sf over if and have greater tendency to converge towards the best component in their neighborhood. On the other hand, components with higher if over sf have greater tendency to converge towards their personal best. It is through the interactions of each component based on their sociability and individuality that allows the entire population to perform both local and global searching of the weight and structure spaces. All weights are randomly initialized between the range of -1 and 1. The invasive connectionist model also supports the incorporation of other evolutionary approaches such as differential evolution (DE) [6] and evolutionary programming (EP) [7]. The DE and EP in the current implementation operate
1124
P.P. Palmes and S. Usui
on the entire population although it can also be applied to each neighborhood. The feasibility of the latter scheme will be studied in the future. The weight update of DE resembles roughly with that of the PSO. It randomly selects 3 neighbors (p1, p2, p3) from the entire population as bases for changing the weights and structure of a component. Equation (7) is a modification of the DE implementation. There are two main operations, namely: exploitation and discovery. The exploitation part uses information from 3 randomly selected parents to form a new set of weights while the discovery part introduces new weights by gaussian perturbation: wip1 + α ∗ (wip2 − wip3 ) if U(0, 1) < cr (7) wi = wi + ρ(0, σ) otherwise. where cr = 0.99 is the probability of exploitation and 1 − cr is the probability of discovery; U is a uniform distribution; ρ is the gaussian distribution with mean 0 and standard deviation σ; and α is a scaling factor. Network initialization starts from zero weights and the only way for the components to have new weights is through the discovery operation in (7). The purpose of having the probability cr set to a value close to 1 is to give the population more time to exploit the existing weight space before dealing with the new weights slowly added by the discovery operation. Selection follows the standard DE policy where only new components with better fitness replace their corresponding parents. EP implementation [4], on the other hand, uses uniform crossover, rank-based selection, and adaptation of the step size parameter (ssp) during mutation: Qf iti ) Qtot mi = mi + αρ(0, 1)sspi
sspi = U(0, 1)(β + ωi
= ωi +
ρ(0, mi )
if U(0, 1) < mp
(8) (9) (10)
where: α = 0.25 and β = 0.5 are arbitrary constants that minimize the occurrences of too large and too weak mutations, respectively; Qf it and Qtot refer to the component’s fitness and total fitness, respectively; U is the Uniform random function which minimizes large ssp occurrences; mp = 0.01 is the mutation probability; ρ is the gaussian; and ω refers to weights and threshold values. The parameter m accumulates the net amount of changes in the mutation strength intensity over time. It is expected that those networks that survived in the later generation have the appropriate m that enabled them to adapt their structure and weights better than the other networks. EP implementation uses elitist replacement policy to avoid loosing the best traits found so far. The initial state of all networks start with no connection. This ensures that introduction of new connections and weights are carried out gradually by stochastic mutation. All algorithms use the stopping criterion described in our previous papers [10,11,4]. It monitors the presence of overfitness using validation performance and stop training as soon as overfitness becomes apparent.
Invasive Connectionist Evolution
4
1125
Simulations and Results
The quality or fitness function we used in this study considers two major criteria, namely: classification error and normalized mean-squared error: Qf it = α ∗ Qacc + β ∗ Qnmse correct ) Qacc = 100 ∗ (1 − total P N 100 Qnmse = ∗ (Tij − Oij )2 N P j=1 i=1
(11) (12) (13)
where: N and P refer to the number of samples and outputs, respectively; Qacc is the percentage error in classification; Qnmse is the percentage of normalized mean-squared error (NMSE); α = 0.7 and β = 0.3 are user-defined constants used to control the strength of influence of their respective factors. Simulation results include comparisons of the performances of four types of connectionist evolution, namely: connectionist EP (cEP), connectionist DE (cDE), connectionist PSO (cPSO), and connectionist PSO-DE (cPSO-DE). These four variants were tested using cancer and glass classification tasks from the UCI repository [12]. The datasets from each task were copied from the experiments of Prechelt. They were divided into 50% training, 25% validation, and 25% testing [13]. Also, results from Prechelt’s manually optimized pivot BP architecture were included for benchmarking purposes. Table 1 summarizes the main features of the different variants. Analysis of variance (ANOVA) and Tukey’s HSD test using α = 0.05 level of significance were used for significance and multiple comparison testing. Figure 5 shows the plots of means and standard deviations of the different variants in the cancer and glass problems, respectively. A line connecting two or more means indicates no significant difference within this group of means. Result of the ANOVA for the cancer problem indicates no significant difference in the mean classification error among the five approaches. However, the ANOVA for the glass problem indicates significant difference in their performances. A closer analysis using Tukey HSD indicates that cEP has the best Table 1. Main Features of Invasive Connectionist Variants
1.47
40
38.5 35
1.5
1.72 1.55
Mean Classification Error (%)
2.0
1.72
44.3
39.0
35.9 33.4
30
1.0
Mean Classification Error (%)
1.95
45
50
P.P. Palmes and S. Usui
2.5
1126
pBP
cEP
cPSO cDE cPSO−DE a) Cancer
cEP
cDE
cPSO pBP cPSO−DE b) Glass
Fig. 5. Mean Classification Error Performance
performance. Its performance, however, is not significantly different from the performance of cPSO, cDE, and pBP.
5
Conclusion
All variants performed as good as the manually optimized BP architecture in spite of using a relatively large hidden layer (see Table 1). These two preliminary results demonstrated the feasibility of using invasive connectionist evolution. Furthermore, this study showed several advantages of invasive evolution such as high degree of flexibility in formulation and ease in implementation such that incorporating and combining other stochastic evolutionary techniques becomes trivial. The swarm model of neural network is one way of combining several nodules to achieve complexity base on their collective behavior. This nodule organization has great potential to be used for expert ensembling. The idea is to have several swarms specializing on different parts of the solution space. Finding the best solution requires identifying which swarm will be used for evaluation. The degree of overlapping can be minimized to increase specialization or search localization of each swarm. This may provide better identification or discrimination in noisy classification or clustering tasks. This concept will be further investigated in the near future.
References 1. Yao, X.: Evolving artificial neural networks. Proceedings of the IEEE 87 (1999) 1423–1447 2. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA (1998) 3. K.C.Tan, Lim, M., Yao, X., Wang, L., eds.: Recent Advances in Simulated Evolution and Learning, Singapore, World Scientific (2004)
Invasive Connectionist Evolution
1127
4. Palmes, P., Hayasaka, T., Usui, S.: Mutation-based genetic neural network. IEEE Transactions on Neural Network 16 (2005) 5. Eberhart, R.C., Kennedy, J.: A new optimizer using particle swarm theory. In: Proceedings of the Sixth International Symposium on Micromachine and Human Science, Nagoya, Japan (1995) 39–43 6. Storn, R., Price, K.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 11 (1997) 341–359 7. Fogel, D.: Evolutionary Computation. Toward a New Philosophy of Machine Intelligence. IEEE Press, Piscataway, NJ (1995) 8. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1 (1997) 67–82 9. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Network. Volume 8., Piscataway, NJ (1995) 1942–194 10. Palmes, P., Hayasaka, T., Usui, S.: Evolution and adaptation of neural networks. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN. Volume II., Portland, Oregon, USA, IEEE Computer Society Press (2003) 397–404 11. Palmes, P., Hayasaka, T., Usui, S.: Sepa: Structure evolution and parameter adaptation. In Paz, E.C., ed.: Proceedings of the Genetic and Evolutionary Computation Conference. Volume 2., Chicago, Illinois, USA, Morgan Kaufmann (2003) 223 12. Murphy, P.M., Aha, D.W.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1994) 13. Prechelt, L.: Proben1–a set of neural network benchmark problems and benchmarking. Technical report, Fakultat fur Informatik, Univ. Karlsruhe, Karlsruhe, Germany (1994)