GLOBAL OPTIMIZATION OF RADIAL BASIS ...

Viewer
Transcript

GLOBAL OPTIMIZATION OF RADIAL BASIS FUNCTION NETWORKS BY HYBRID SIMULATED ANNEALING Jong-Seok Lee∗, Cheol Hoon Park†

Abstract: This paper presents a global optimization method of radial basis function networks. In the proposed method, stochastic search by simulated annealing is combined with a local search technique in order to perform global optimization of the network parameters with enhanced convergence speed. Its convergence property is proved mathematically. Experimental results demonstrate that the proposed method improves the performance of the networks over the conventional local and global training methods and reduces influence of the initial parameter values on the final results. Key words: Radial basis function network, hybrid simulated annealing, global optimization Received: July 9, 2009 Revised and accepted: May 12, 2010

1.

Introduction

The radial basis function network (RBFN), which is motivated by the locally tuned response observed in biologic neurons, is a popular class of artificial neural networks for solving various problems, such as function approximation and pattern recognition [1–3]. An RBFN has a feedforward structure consisting of a nonlinear hidden layer and a linear output layer. The Gaussian function is the most commonly used neuron characteristics of the hidden layer. For a given input vector x, the output of an RBFN having one output node is given by f (x) =

µ ¶ ||x − ci ||2 wi exp − + w0 , 2σi i=1

N X

(1)

∗ Jong-Seok

Lee School of Electrical Engineering and Computer Science, KAIST, Republic of Korea, E-mail: [email protected] (Current affiliation: Institute of Electrical Engineering, Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL), Switzerland, E-mail: [email protected]) † Cheol Hoon Park School of Electrical Engineering and Computer Science, KAIST, Republic of Korea, E-mail: [email protected]

c °ICS AS CR 2010

519

Neural Network World 4/10, 519-537

where N is the number of hidden neurons, ci , σi and wi are the center, the width and the output weight of the i-th hidden neuron, respectively, and w0 is the bias of the output layer. It has been shown that an RBFN can approximate any continuous function within an arbitrary error bound [4, 5]. Before an RBFN is used to solve a given problem, its parameters (ci , σi , wi and w0 ) should be trained with the training data. An intuitive way for this is to utilize gradient descent algorithms, such as the error back-propagation method, as in the case of multilayer perceptrons. Wettschereck and Dietterich [6] showed that this idea can be successfully applied to the training of RBFNs. Karayiannis et al. [7] showed that RBFNs trained by a gradient descent algorithm competes with feedforward neural networks having sigmoidal hidden neurons. For faster training and better performance of RBFNs, second-order algorithms, such as the quasi-Newton and the Levenberg-Marquardt algorithms, can be adopted [8]. In [9], the extended Kalman filter was utilized for the training of RBFNs with reduced computational complexity, which was shown to produce nearly the same performance as gradient descent training. Typically, the hidden and the output layers of an RBFN can be trained separately due to the locality of the Gaussian functions and the linear characteristic between the outputs of hidden units and the network outputs. Broomhead and Lowe [10] used a method in which the centers of the Gaussian functions are randomly chosen from the training samples. This method has a drawback in that the final performance is quite sensitive to the randomly selected data. Moody and Darken [11] applied the k-means clustering algorithm to the training input data and set the centers of the Gaussian functions to the cluster centers. The widths of the Gaussian functions can be determined independently of the way of determining the centers. Simply, the widths of all Gaussian functions in the hidden layer can be set to the same value [12]. However, it was shown that using different widths for each Gaussian function significantly improves performance of the RBFN [13]. It was proposed to set the width values to the average distances of the centers and their neighboring samples [11]. In [14], the values of the widths were set to a half the maximum distance between any training input samples. Once the centers and the widths of the Gaussian functions are determined, optimizing the output layer is a linear problem, and thus the weights and the bias of the output layer can be uniquely obtained by the pseudo-inverse method [15]. Separate training of the parameters of the Gaussian functions and the output layer, which is described above, is called two-phase learning [16]. In addition, it was shown that further training of the whole network after the two-phase learning by gradient-based training algorithms improves the performance of the network, which is called three-phase learning [16]. However, the training strategies described above have a limitation in that they can only perform local optimization of the network in the parameter space and, thus, tend to be trapped in local optima depending on the initial values of the parameters. Consequently, the performance of the trained network may not be satisfactory and, moreover, the parameters and the performance of a trained network may be sensitive to the initial parameter values of the network. In order to overcome such limitation, this paper proposes a global optimization method of RBFNs by using the hybrid simulated annealing (HSA) algorithm, 520

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

which was originally developed for optimization of hidden Markov models [17]. The method combines the exploration capability of the underlying stochastic search by simulated annealing (SA) over the whole search space and the exploitation capability of the local search by the gradient-based learning algorithm, so that it performs global optimization with improved accuracy and convergence speed. Thus, the RBFNs trained by the proposed algorithm show improved generalization performance and consistent results with low sensitivity to different initial network parameters. Additionally, a careful design of the scheme is invented for determining the algorithm parameters. In previous researches, global training methods of RBFNs were proposed based on genetic algorithms (GAs) [18, 19]. In these methods, suitable genetic representations of the RBFNs parameters and effective search operators were designed for global optimization of RBFNs. Compared to those methods, our method has two distinct features. First, it has a strong mathematical background on global convergence. We prove that it converges in probability to the global optimal solution of an RBFNs parameter set. Second, it has a simpler algorithm structure than the GAbased methods; while the latter methods have several algorithm parameters, such as population sizes, probabilities and their evolving schemes for mutation, crossover and selection operators, and termination conditions, the user-specific parameters in our method are only the termination condition and the degree of local optimization. Moreover, experimental results demonstrate that the proposed method shows a better global search capability than the GA-based methods. In the following section, we describe the proposed method in detail. In Section 3, we prove the global convergence property of the method. Section 4 shows the experimental results of comparing the performance of the proposed method and that of the conventional methods. Finally, Section 5 gives the conclusion of the paper and directions for future work.

2.

Proposed Algorithm

The proposed algorithm basically utilizes the SA algorithm, which models the cooling process of a metal to the minimum energy crystalline structure [20]. Starting from an initial solution, the SA algorithm repeats generation, evaluation and selection of new solutions with a controlled annealing schedule. An important algorithm parameter, called temperature, is gradually decreased by the annealing schedule and governs the processes of generation and selection of solutions. An advantage of SA is its ability to escape from local optima. Since the transition to a solution having a worse cost than the current solution can occur with a nonzero probability, SA has a possibility to reach the global optimum even when there are local optima in the solution space. SA has been successfully applied to many optimization problems in various areas (e.g. [21, 22]). In the proposed algorithm, we combine the SA algorithm with a local optimization technique. Once a new solution is generated by random mutation of the current solution, it is locally optimized to some extent. Introducing the local optimization technique into SA helps the algorithm to converge fast and produce solutions of improved quality. This hybrid algorithm can be shown to converge to 521

Neural Network World 4/10, 519-537

the global optimum in probability for an arbitrary initial solution, which will be given in the following section. Once the centers and the widths of the Gaussian functions are set, the parameters of the output layer can be determined by the pseudo-inverse method. Thus, the solution vector in our method is composed of the parameters of the Gaussian functions. The procedure of the algorithm is summarized in Fig. 1 and its detailed description is given in the following subsections.

Initialize ci , σ i , wi and w0 ( i = 1,..., N ) by two-phase learning. Set the initial solution p 0 to the collection of normalized ci and σ i . Initialize the temperature T0 . repeat for k=1,2,… Generate an intermediate solution q k from the current one p k : q k ← p k + ∆p k . Apply the LM algorithm to obtain a new solution rk : rk ← LM (q k ) . Evaluate the cost of rk : C (rk ) ← M log(RMSE(rk )) . Select rk over p k for p k +1 with the following probability:   C (rk ) − C (p k )   pa = min exp −  ,1 . Tk    

Decrease the temperature: Tk ← T0 / k . until the termination condition is satisfied.

Fig. 1 Pseudo-code of the proposed algorithm.

2.1

Initialization

We start the algorithm by normalizing the training input data so that they are within [−1, 1]d , where d is the dimension of the input vectors. This step enables the algorithm to be applied to data sets having different scale factors. The RBFN to be trained is initialized by two-phase learning. The Gaussian centers are determined by random selection among the training samples or by clustering, and the widths are determined by calculating the distances between the centers and a few nearest samples [16]. Then, the weights and the bias of the output layer are determined by the pseudo-inverse method. 522

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

In the algorithm, it is necessary to set the minimum and the maximum possible values of the centers and the widths. The bound of the centers is set to [−1.1, 1.1], which allows a margin of 10% for each dimension around the data space, and that for the widths to [0.1, 1]. Then, the values of the centers and the widths initialized by two-phase learning are normalized with these bounds and compose the initial solution vector p0 ∈ Θ so that the search space is a unit hypercube, i.e., Θ = [0, 1]n , where n is the dimension of the solution vector, i.e. n = N d + N . The initial temperature T0 is determined in a way that, in the early stage of the algorithm, global search is performed with the solution exploring over the whole search space by making the largest change of the solution in the space easy through random mutation. Since the largest change corresponds to the move from a corner of the search space Θ to its opposite corner along the diagonal line of the space, we set the initial temperature to be proportional to the length of the diagonal line of the space, i.e. √ T0 = c n,

(2)

where c is a scaling factor. From experiments we observed that we can simply put c = 1 over various problems and network structures.

2.2

Generation of new solutions

A new solution is generated by the hybrid of the random mutation of the current solution and a local optimization technique. First, an intermediate solution qk is generated from the current one pk by random mutation ∆pk : qk = pk + ∆pk ,

(3)

where k is the iteration index. The Cauchy probability distribution function is used for determining the amount of mutation ∆pk [23]: g(∆pk ) =

an Tk , (||∆pk ||2 + Tk2 )(n+1)/2

(4)

where Tk is the temperature at iteration k and an = π −(n+1)/2 Γ((n + 1)/2) is the normalizing constant. Note that the magnitude of displacement between the current and the new solutions is controlled by the temperature Tk . In the beginning of the algorithm, the temperature is high and the solution is likely to be changed largely. As the temperature is decreased by the annealing step (which will be explained in Subsection 2.4), the random displacement of the solution tends to become small. For the new Gaussian functions parameters determined by (3), the parameters of the output layer of the RBFN are obtained by the pseudo-inverse method. Then, we apply a few iterations of a gradient-based learning algorithm to train the whole network. We use the Levenberg-Marquardt (LM) algorithm, which is one of the fastest training algorithms of neural networks [8]. After application of the LM algorithm, the output weights and the bias are adjusted by the pseudo-inverse 523

Neural Network World 4/10, 519-537

method. The centers and the widths of the RBFN form the final candidate solution at iteration k, rk . We calculate the cost of the new solution rk by C(rk ) = M log(RM SE(rk )),

(5)

where RM SE(rk ) is the root-mean-squared error of the network determined by rk for the training data. M is the scaling parameter, which is determined in the initialization step of the algorithm so that the acceptance probability (given by (6)) of the worst solution against the best one among 100 trials is 0.5. The use of the logarithm in (5) is based on the observation that the cost change over iteration is quite well-fitted with the logarithm.

2.3

Selection

After the new candidate solution rk is obtained, we determine if it is accepted as the solution of the next iteration pk+1 over the current one pk according to the Metropolis rule [24]. The acceptance probability of rk is given by · ½ ¾ ¸ C(rk ) − C(pk ) pa = min exp − ,1 . (6) Tk In other words, if the new solution is better than the current one, the transition from the current solution to the new one always occurs; otherwise, the transition occurs probabilistically. A nonzero acceptance probability for transition to a worse solution helps the algorithm escape from local optima. The temperature governs the probability of this hill-climbing transition. In the beginning of the algorithm, the temperature is high and thus the acceptance probability of a worse solution against the current one is large. When the temperature becomes low, a hill-climbing transition is hardly accepted, which allows the algorithm to perform fine-tuning of the solution at its final stage rather than exploration in the search space.

2.4

Annealing

The annealing schedule decreases the temperature gradually. The temperature controls the optimization process by affecting generation and selection of new solutions. A high temperature at the beginning of the algorithm encourages exploration of the solution by allowing large amounts of mutation of the solution in (3) and hill-climbing transitions in (6); a low temperature at the final stage of the algorithm lowers the probabilities of large changes of the solution and acceptance of worse solutions than the current ones. In previous applications of SA, various annealing schedules have been adopted: • The inverse-logarithmic annealing: Tk = T0 / log k

(7)

Tk = T0 γ k

(8)

• The exponential annealing:

524

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

• The reciprocal annealing: Tk = T0 /k

(9)

The inverse-logarithmic schedule comes from the classical SA algorithm [20], which is known to produce too slowly decreasing temperature sequences. While the exponential annealing schedule is frequently used in many studies because it is reasonably fast (e.g. [25, 26]), an appropriate value of the parameter γ should be determined manually. The reciprocal annealing has been used in the fast SA algorithm by Szu and Hartley and found to be fast for practical use [27]. Therefore, we use the reciprocal annealing schedule which is fast and has no additional parameter to be tuned. Another important reason for adopting this schedule is to ensure the global convergence property of the algorithm, which will be shown in Section 3.

2.5

Termination

The algorithm continues until the termination condition is satisfied. We consider the algorithm converged when the relative change of the RMSE value over s iterations is less than a pre-defined small positive value, i.e. RM SE(pk−s ) − RM SE(pk ) < ². RM SE(pk−s )

3.

(10)

Convergence

In this section, the global convergence property of the proposed algorithm is presented. For this, let us consider a general HSA algorithm which adjusts all parameters to be optimized simultaneously. A new solution is randomly generated from the current one according to a generating function, is optimized by a local optimization operator, and competes with the current solution for being selected as the solution of the next iteration, which is repeated with an annealing schedule. Our algorithm differs from the general HSA in that, while the centers and the widths of the Gaussian function of an RBFN are changed by random mutation as in the general HSA algorithm, the output layer of the RBFN is determined by the pseudo-inverse method. The following lemma gives the conditions for convergence of the general HSA algorithm to the global optimum. Lemma 1. Let uk be the solution vector of the k-th iteration of the general HSA algorithm, which is defined in the feasible space Ψ ⊂ Rm . If a new solution is generated by (3), the temperature update function is given by (9), the Metropolis selection rule in (6) is used, and the local optimization never makes the solution worse, then the cost value sequence of the general HSA algorithm, {C(uk ), k ≥ 0}, converges in probability to the global optimum for any initial solution u0 ∈ Ψ, i.e. for any δ > 0, lim P (C(uk ) − C ∗ ≥ δ) = 0, (11) k→0

where C ∗ is the global optimum. 525

Neural Network World 4/10, 519-537

The proof of the lemma can be found in [28, 29]. It is worth mentioning that the combination of the Cauchy generating function (4) and the reciprocal annealing schedule (9) is important for obtaining the convergence in the above lemma. If one uses too fast decreasing a temperature function along with the Cauchy generating function or a generating function having too short tails such as the Gaussian distribution together with the reciprocal annealing schedule, the global convergence of the algorithm is not guaranteed. Based on Lemma 1, the global convergence of the whole network parameter set in our method is proved in the following theorem. Theorem 1. The cost value sequence {C(pk ), k ≥ 0} in the HSA algorithm for RBFNs converges in probability to the global optimum for any initial solution p0 ∈ Θ. Proof. Let uk = [pk , vk ] ∈ Rm be the set of all parameters of the RBFN at iteration k, where vk ∈ Rm−n consists of the weights and the bias of the output layer. For a given pk , finding the optimal vk minimizing the training error is a quadratic problem, which is unimodal and can be solved by the pseudo-inverse method uniquely. Let us define Φ as follows: © ª Φ = u = [p, v] ∈ Rm |p ∈ Θ, C([p, v]) ≤ C([p, v0 ]) for any v0 ∈ Rm−n , (12) i.e. Φ is the composite space of Θ and the optimal output weights and bias for each point of Θ. The proposed HSA algorithm for RBFNs can be considered to perform the general HSA algorithm in Φ. According to Lemma 1, the algorithm converges in probability to the global optimum of Φ regardless of the initial value of p. Since the global optimum of Φ coincides with the global optimum of the RBFNs parameters, the proposed HSA algorithm converges in probability to the global optimum in the space of all parameters of the RBFN. Since the cost function in our algorithm is a monotonic function of the RMSE for the training data as in (5), global convergence in terms of the cost function is equivalent to that in terms of the RMSE.

4.

Experimental results

We demonstrate the performance of the proposed method through experiments on problems of different domains: three function approximation problems, nine pattern recognition problems, and a time series prediction problem.

4.1

Function approximation

The function approximation problems are to approximate the following functions having two inputs and one output [29, 30]: 526

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

F1 (x1 , x2 )

=

1.9(1.35 + e0.5(x1 +1) sin(13(0.5x1 − 0.1)2 )e−0.5(x2 +1) sin(3.5x2 + 3.5)),

(13)

2 −9x21 −(3x2 +1)2

F2 (x1 , x2 )

=

sin(3x1 + sin(3x2 )) + 0.25{3(1 − 3x1 ) e

F3 (x1 , x2 )

−10(1.5x1 − 27x31 − 81x42 )e−9x1 −9x2 − e−(3x1 +2) = sin(4x1 x2 ).

2

2

2

−9x22

/3}, (14) (15)

For each function the training data included 225 random points within [−1, 1]2 and four corner points. A regular grid of 100×100 in [−1, 1]2 was used for the test data. We compared the proposed HSA algorithm and the three-phase learning method which uses the LM algorithm for training. Note that the three-phase learning here would show better results than that in [16] because we used the LM algorithm which is known to be superior to the error backpropagation algorithm used in [16]. For initialization of RBFNs in both methods, the Gaussian centers were either chosen randomly among training inputs or determined by k-means clustering, and the widths were set to be the average distances between the corresponding centers and the five nearest data points. We applied 30 iterations of the LM algorithm for the local optimization in HSA. For the termination condition in (10), we used (s, ²) = (100, 0.001) and (s, ²) = (500, 0.001) for the three-phase learning and the proposed methods, respectively. In other words, we regarded the training process converged if the relative change of RMSE over 100 and 500 iterations was less than 0.1% for LM and HSA, respectively. The results for various numbers of hidden neurons are shown in Tabs. I, II and III, where the average and the standard deviation values of RMSEs over 20 repetitions with different random seeds are shown. In addition, in order to examine the significance of the performance difference between the two algorithms, the two-sample t-test was performed under the null hypothesis that their results are independent random samples from normal distributions having equal means. In the tables, the cases where the improvement by HSA is statistically significant at the 95% significance level are marked with bold fonts. From the tables, one can observe that HSA outperforms the conventional method in terms of the training and the test performance in all cases, which shows effectiveness of global optimization by HSA. The statistical analysis also confirms that the performance difference is significant. Besides, there is little difference in the results by the two different initialization methods in the case of HSA and the standard deviations over multiple runs are much smaller for HSA than for the three-phase learning with LM. These results indicate that the effect of initial network parameters was significantly reduced by HSA. In some cases of the tables, it is observed that the performance for the test data is better than that for the training data, which may not be usual in a general sense. This phenomenon comes from the fact that the data used do not contain noise and form smooth surfaces in the three dimensional space and, moreover, the number of the test data is larger than that of the training data. Therefore, when an RBFN produces an approximation for the given training data, the errors for some test data can be smaller than those for their neighboring training data, so 527

Neural Network World 4/10, 519-537

#hidden Initialization 5 10 15 20

Random Clustering Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 0.273±0.071 0.248±0.078 0.220±0.015 0.201±0.012 0.118±0.073 0.103±0.056 0.141±0.041 0.115±0.032 0.067±0.068 0.055±0.052 0.098±0.048 0.074±0.034 0.075±0.060 0.056±0.043 0.097±0.041 0.070±0.029

HSA Train 0.139±0.013 0.149±0.021 0.018±0.002 0.018±0.002 0.008±0.001 0.008±0.001 0.004±0.001 0.004±0.001

Test 0.140±0.013 0.150±0.019 0.023±0.003 0.023±0.002 0.014±0.002 0.013±0.003 0.009±0.006 0.010±0.008

Tab. I Averages and standard deviations of RMSEs by the three-phase learning and the proposed HSA methods for F1 . #hidden Initialization 5 10 15 20

Random Clustering Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 0.495±0.117 0.487±0.113 0.308±0.074 0.306±0.075 0.168±0.070 0.174±0.067 0.193±0.087 0.193±0.082 0.123±0.090 0.130±0.086 0.118±0.103 0.126±0.098 0.065±0.088 0.077±0.085 0.072±0.074 0.105±0.116

HSA Train 0.189±0.007 0.189±0.009 0.049±0.004 0.048±0.005 0.021±0.004 0.021±0.004 0.015±0.003 0.015±0.003

Test 0.176±0.007 0.174±0.006 0.058±0.004 0.057±0.005 0.032±0.004 0.032±0.004 0.027±0.004 0.028±0.004

Tab. II Averages and standard deviations of RMSEs by the three-phase learning and the proposed HSA methods for F2 . that the average error (i.e. RMSE) for the test data becomes smaller than that for the training data. We used 30 iterations of the LM algorithm for local optimization in HSA for obtaining the results in the tables. Actually, the number of iterations of the LM algorithm does not greatly affect the results of HSA. Instead, different numbers of LM iterations cause different degrees of fine-tuning of the final solution. To verify this, we plot the training RMSE values of RBFNs having 20 hidden units for F1 with varying the number of iterations of the LM algorithm in Fig. 2. In the figure, we show the average training RMSE value of the networks trained by HSA with respect to the number of LM iterations. In addition, to examine the results without the effect of fine-tuning in HSA, the networks obtained by HSA were additionally trained by LM so that a fixed number of LM iterations (50 iterations) was applied to the networks at the last iteration of HSA; for example, the networks produced by HSA with local optimization of 10 LM iterations were trained further by the LM algorithm of 40 iterations. The figure also shows the training RMSE for the three-phase learning method, which appears constant. In the results of HSA, there exist slight differences between the results for different numbers of LM 528

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

#hidden Initialization 5 10 15 20

Random Clustering Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 0.335±0.092 0.341±0.093 0.339±0.119 0.394±0.121 0.146±0.033 0.150±0.035 0.130±0.034 0.135±0.034 0.061±0.019 0.069±0.021 0.065±0.019 0.071±0.018 0.032±0.020 0.041±0.024 0.029±0.014 0.037±0.017

HSA Train 0.216±0.005 0.217±0.006 0.069±0.008 0.069±0.010 0.032±0.003 0.031±0.003 0.020±0.002 0.020±0.003

Test 0.221±0.002 0.223±0.005 0.072±0.008 0.071±0.009 0.037±0.004 0.036±0.005 0.025±0.003 0.026±0.003

Tab. III Averages and standard deviations of RMSEs by the three-phase learning and the proposed HSA methods for F3 . iterations. Applying more LM iterations yielded smaller training errors (note that, however, these results are always much better than those obtained by the threephase learning method). However, such performance differences became negligible when we used the additional LM training step after HSA. This demonstrates that varying the number of LM iterations in HSA mostly influences only fine-tuning of the final solutions. 0.1

RMSE

0.08

Three−phase learning HSA HSA (fine−tuning)

0.06

0.04

0.02

0 10

20 30 40 Number of LM iterations

50

Fig. 2 Training errors of the RBFNs obtained by HSA with respect to the number of iterations of the LM algorithm for the local optimization step. The superiority of HSA was obtained at the expense of additional computational complexity. The training time for HSA was about 20-70 times that for the three529

Neural Network World 4/10, 519-537

phase learning method in the MATLAB environment. However, this complexity is worth accepting when the performance improvement is considered.

4.2

Pattern classification

In this subsection, the performance of HSA on pattern classification problems is examined. The data sets used are nine real-world classification problems from the UCI Machine Learning Repository [31]. Tab. IV summarizes the characteristics of the data sets. It can be seen that the problems are different with respect to the numbers of samples (from 101 to 1484), attributes (from 4 to 18), and classes (from 2 to 10). Since the data sets did not explicitly separate training and test data, four-fold cross-validation experiments were conducted; we divided the whole data of each problem into four parts and repeated experiments four times by using the data of the three parts for training and those of the remaining part for generalization test. And, we repeated this procedure ten times with different random seeds and report the mean and the standard deviation values over the ten experiments. The training input data of each problem were scaled so that their values lay within [−1, 1] for each dimension. Then, the test data were scaled by using the minimum and the maximum training input values. Other experimental setups, such as the termination condition and the initialization of RBFNs, were the same as those of the function approximation experiments. Data set Cancer Glass Iris Liver Pima Vehicle Wine Yeast Zoo

#sample 699 214 150 345 768 846 178 1484 101

#attribute 9 9 4 6 8 18 13 8 17

#class 2 6 3 2 2 4 3 10 7

Tab. IV Characteristics of data sets for pattern classification. Tabs. VII to XIII show the results for each problem. The tables compare the performance of the three-phase learning method and that of the proposed HSA method for various numbers of hidden units of RBFNs. As in the function approximation problems, we performed the t-test with the 95% significance level to examine the statistical significance of the performance difference between the two methods, and the results are shown with bold fonts in the tables. From the tables, it can be observed that the classification performance of the RBFNs trained by HSA is at least similar to and better in most cases than that by the conventional method, which was verified by the statistical analysis. It is also noticed that the performance of HSA is similar for both initialization methods, while careful initialization of RBFNs is crucial for the conventional method. For most of the data sets, 530

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

HSA considerably outperforms the three-phase learning method in terms of both accuracies and variances over repeated experiments. For the Cancer and the Wine data sets, the two methods show comparable performance only when the networks by clustering were initialized. #hidden 4 8 12

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 78.4±5.7 77.8±6.1 96.8±0.1 96.7±0.2 86.5±2.6 85.7±2.9 97.0±0.1 96.7±0.2 90.9±2.1 90.7±2.9 97.3±0.2 96.8±0.3

HSA Train 98.0±0.2 98.1±0.2 98.4±0.3 98.4±0.2 98.5±0.2 98.6±0.2

Test 96.8±0.3 96.7±0.3 96.4±0.4 96.5±0.6 96.5±0.4 96.5±0.3

Tab. V Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Cancer problem.

#hidden 5 10 15

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 47.5±2.4 44.2±3.7 58.2±2.0 51.6±3.3 57.0±2.3 51.3±2.6 69.1±2.2 61.3±2.8 61.5±1.7 55.1±2.9 72.2±2.5 61.2±4.0

HSA Train 72.6±0.8 73.0±1.3 78.3±0.9 78.8±1.2 81.8±1.1 80.6±1.5

Test 62.8±1.8 62.8±2.0 64.8±2.0 65.8±1.1 65.4±2.0 65.1±2.1

Tab. VI Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Glass problem.

#hidden 4 8 12

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 75.0±1.7 73.2±2.7 89.9±0.7 88.5±0.6 81.2±2.0 78.1±3.1 95.5±2.2 93.2±2.5 84.9±2.2 81.7±3.1 95.5±1.4 92.3±1.9

HSA Train 98.7±0.4 98.7±0.4 99.0±0.2 98.8±0.3 98.9±0.3 98.7±0.3

Test 95.9±0.8 94.9±0.9 95.4±1.1 95.5±1.1 95.5±0.9 94.9±1.0

Tab. VII Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Iris problem. We also compared the proposed and the three-phase learning methods for the same computational resources. For this, the following procedure was adopted for the three-phase learning method: 531

Neural Network World 4/10, 519-537

1. Initialize an RBFN. The Gaussian parameters were set by the clustering method and the output weights by the pseudo-inverse method. 2. Train the network by the LM algorithm. 3. Repeat the above steps and select the network showing the best performance for the training data. The steps 1 and 2 were repeated until the accumulated number of LM iterations became the same as that of HSA, which enabled us to compare the two methods

#hidden 10 15 20

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 62.0±0.8 59.2±1.4 65.6±0.8 58.0±1.8 64.0±0.9 60.7±1.9 69.6±1.6 61.1±3.1 66.1±1.3 61.0±1.8 74.4±2.1 65.2±3.4

HSA Train 78.7±0.8 78.9±0.7 79.5±0.8 79.4±1.0 79.7±0.6 79.4±0.8

Test 70.8±1.6 69.9±1.0 70.1±1.9 70.6±1.6 69.5±1.4 69.9±0.9

Tab. VIII Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Liver problem. #hidden 4 8 12

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 66.3±0.7 65.6±0.6 73.5±0.9 71.9±1.2 67.1±0.9 66.6±0.9 76.3±0.7 74.7±0.8 68.1±1.1 67.5±1.2 77.8±0.6 75.7±0.9

HSA Train 79.3±0.3 79.5±0.4 80.6±0.5 80.7±0.4 81.2±0.5 81.2±0.6

Test 76.6±0.5 76.3±0.8 76.3±0.7 76.3±0.6 76.0±0.8 75.8±0.9

Tab. IX Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Pima problem. #hidden 10 15 20

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 44.4±1.5 41.3±3.2 65.6±1.5 62.4±1.6 49.6±1.8 47.0±1.7 72.7±1.1 69.0±1.4 53.3±1.8 49.7±1.9 75.8±0.9 71.0±1.2

HSA Train 79.1±1.5 79.4±1.9 79.6±1.5 79.4±1.5 80.4±0.9 80.5±0.9

Test 76.3±1.4 75.7±1.9 75.4±1.4 75.9±1.7 76.0±1.4 76.6±1.0

Tab. X Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Vehicle problem. 532

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

#hidden 4 8 12

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 77.2±3.2 75.4±4.3 98.3±0.2 97.8±0.5 88.3±1.2 85.6±3.1 98.8±0.3 97.6±0.5 91.8±1.1 89.2±2.4 99.2±0.3 97.7±0.7

HSA Train 99.9±0.1 99.7±0.1 100.0±0.0 100.0±0.1 100.0±0.0 100.0±0.0

Test 97.3±0.8 97.4±1.0 95.8±1.4 96.8±1.2 96.3±0.7 96.8±1.2

Tab. XI Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Wine problem. #hidden 4 8 12

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 34.6±1.1 33.5±1.1 46.9±1.4 45.7±1.6 37.1±1.1 35.7±1.1 57.2±1.4 56.3±1.7 39.8±1.2 38.3±1.2 58.3±0.9 56.9±1.4

HSA Train 56.8±0.5 57.0±0.3 59.7±0.4 59.8±0.4 61.1±0.3 60.8±0.6

Test 55.5±1.1 55.9±0.5 57.8±0.8 57.7±0.6 58.3±0.5 58.1±0.6

Tab. XII Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Yeast problem. #hidden 4 8 12

Initialization Random Clustering Random Clustering Random Clustering

Three-phase learning Train Test 67.2±2.7 63.4±5.9 84.4±1.9 79.3±3.8 79.5±2.6 73.5±6.0 90.6±2.6 83.7±3.1 86.6±1.5 81.1±2.9 91.3±3.5 82.1±5.5

HSA Train 88.5±0.9 89.0±1.8 94.6±1.1 95.2±1.3 98.4±1.0 98.3±0.7

Test 83.0±2.9 83.8±2.7 88.4±1.9 87.3±3.2 90.0±1.9 90.3±2.7

Tab. XIII Averages and standard deviations of recognition accuracies (%) by the three-phase learning and the proposed HSA methods for the Zoo problem. for the same computational complexity. Tab. XIV shows the results for some of the pattern classification problems for which the performance gap between the HSA and the three-phase learning is large in Tabs. V-XIII. It is observed that, while the best performance of the repeated three-phase learning in Tab. XIV is slightly better than that of the three-phase learning in Tabs. V-XIII, it is still outperformed by HSA. This demonstrates that HSA is completely different from the repetition of random initialization and local optimization, and it is a carefully designed global optimization process. 533

Neural Network World 4/10, 519-537

Problem

#hidden

Glass Liver Pima Vehicle Yeast

5 10 4 10 4

HSA Train 73.0 78.9 79.5 79.4 57.0

Test 62.8 69.9 76.3 75.7 55.9

Best of Train 65.6 75.2 76.6 72.8 51.9

three-phase learning Test 58.4 67.3 75.0 68.7 49.9

Tab. XIV Comparison of HSA and three-phase learning with the same computational resources for the pattern classification problems.

4.3

Time series prediction

Finally, a time series prediction problem was used to test the performance of HSA in comparison with the existing GA-based global optimization methods [18, 19]. The method presented in [18] works as a clustering algorithm to globally optimize the centers of the Gaussian functions. A population consists of diverse solutions which compete with each other. Evolution of the solutions is performed through generation of solutions by the genetic operators, such as random mutation and recombination, and substitution of some bad solutions in the current population with newly generated solutions. The widths are determined as the variances of the clusters and the output layer is optimized by the recursive least squares method. In the cooperative-competitive genetic evolution method [19], a population of competing basis functions composes a single RBFN for modeling the function given by data in a cooperative manner. A fitness function was designed to promote this cooperative-competitive evolution of the centers and the widths of the Gaussian functions. Recombination, creep and mutation operators are employed to generate a new population from the current one at each iteration. We used the Mackey-Glass time series prediction problem given by the following differential equation: ax(t − τ ) dx(t) = −bx(t) + , dt 1 + x(t − τ )10

(16)

which results in chaotic time series data. This problem has been widely used as a benchmark for comparing learning and generalization capabilities of neural networks. As in previous studies [11,18,19,32,33], we used the parameters a = 0.2, b = 0.1 and τ = 17. The task was to predict x(t + α) from past samples x(t), x(t − β), x(t − 2β) and x(t − 3β). Following the studies cited above, we set α = 85 and β = 6, which made the problem a significant challenge because it is a long-term prediction problem. We also used the same way as in many of previous studies for the preparation of training and test data; 500 training samples were randomly chosen from the points from t = 500 to t = 4000 and test samples were 500 samples for t > 4000. The termination condition and the iteration number of the LM algorithm in HSA were set to the same as in the experiments of the previous subsections. Fig. 3 compares the normalized prediction errors (i.e. the prediction RMSE divided by the standard deviation of the correct prediction) of the trained RBFNs 534

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

having 25 hidden units. The results of HSA are the mean values over 20 trials with different random seeds and those of the conventional ones are from their original papers. The results show that HSA outperforms the GA-based methods significantly, which demonstrates that the global search capability of HSA is better than those of the conventional methods. Again, one can see that the initialization method of HSA influences little the final results.

Normalized prediction error

0.5

0.4

0.3

0.2

0.1

0

[16]

[17]

HSA (random)

HSA (clustering)

Fig. 3 Normalized prediction error by the two conventional GA-based methods and the proposed method with different initialization for the Mackey-Glass time series prediction problem. As mentioned in the introduction, HSA has additional advantages over the GA-based methods other than in performance. Since HSA has fewer algorithm parameters to be determined by users than the GA-based methods, the former is easier to be applied to various problems without much consideration of the tuning of the parameters than the latter. Moreover, we showed the mathematical convergence property of HSA in Section 3, whereas convergence of the GA-based methods has not been shown.

5.

Conclusion

We have presented the HSA algorithm to perform global optimization of RBFNs. The algorithm utilizes both the global and the local search techniques and performs optimization of RBFNs via iterative generation, local optimization, evaluation and selection of the solution with the reciprocal annealing scheme. We mathematically proved that the training error converges to the global optimum value in probability. Experimental results showed that the algorithm generated RBFNs, showing better performance than the conventional local and GA-based global training algorithms, and reduced variances in results for different initial parameters of networks. 535

Neural Network World 4/10, 519-537

As our future work, we are considering to devise a similar algorithm to the proposed one for optimization of neural networks having sigmoidal hidden neurons. In addition, it would be desirable to extend the algorithm to perform structural optimization of neural networks as well as parameter optimization.

References [1] Poggio T., Girosi F.: Networks for approximation and learning. Proc. IEEE, 78, 1990, pp. 1481-1497. [2] Monfared M., Daryani A. M., Abedi M.: Online tuning of genetic based PID controller in LFC systems using RBF neural network and VSTLF technique. Neural Network World, 18, 2008, pp. 309-322. [3] Tatarinov V.: Vigilance classification based on EEG analysis. Neural Network World, 14, 2004, pp. 325-335. [4] Chen T., Chen H.: Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Trans. Neural Networks, 6, 1995, pp. 904-910. [5] Park J., Sandberg I. W.: Approximation and radial basis function networks. Neural Computation, 5, 1993, pp. 305-316. [6] Wettsschereck D., Dietterich T.: Improving the performance of radial basis function networks by learning center locations. In: J. E. Moody, S. J. Hanson, R. P. Lippman (Eds.), Advances in Neural Information Processing Systems 4, Morgan Kaufmann, Los Altos, CA, 1992, pp. 1133-1140. [7] Karayiannis N. B., Balasubramanian M., Malki H.: Evaluation of cosine radial basis function neural networks on electric power load forecasting. In: Proc. Int. Joint Conf. Neural Networks, Portland, OR, 2003, pp. 2100-2105. [8] Bishop C.: Neural networks for pattern recognition, Oxford Univ. Press Inc., New York, 1995. [9] Simon D.: Training radial basis neural networks with the extended Kalman filter. Neurocomputing, 48, 2002, pp. 455-475. [10] Broomhead D. S., Lowe D.: Multivariable functional interpolation and adaptive networks, Complex Systems, 2, 1988, pp. 321-355. [11] Moody J., Darken C.: Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 1989, pp. 291-294. [12] Chen S., Cowan C. F. N., Grant P. M.: Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Networks, 2, 1991, pp. 302-309. [13] Benoudjit N., Archambeau C., Lendasse A., Lee M. V. J. A.: Width optimization of the Gaussian kernels in radial basis function networks. In: Proc. European Symposium on Artificial Neural Networks, Bruges, Belgium, 2002, pp. 425-432. [14] Orr M. J. L.: Regularisation in the selection of radial basis function centers. Neural Computation, 7, 1995, pp. 606-623. [15] Orr M. J. L.: Introduction to radial basis function networks. Technical Report, Center for Cognitive Science, Univ. Edinburgh, 1996. [16] Schwenker F., Kestler H. A., Palm G.: Three learning phases for radial-basis-function networks. Neural Networks, 14, 2001, pp. 439-458. [17] Lee J.-S., Park C. H.: Training hidden Markov models by hybrid simulated annealing for visual speech recognition. In: Proc. Int. Conf. Systems, Man and Cybernetics, Taipei, Taiwan, 2006, pp. 198-202. [18] Aiguo S., Jiren L.: Evolving Gaussian RBF network for nonlinear time series modeling and prediction. Electronics Letters, 34, 1998, pp. 1241-1243.

536

Lee J.-S., Park C. H.: Global optimization of radial basis function networks. . .

[19] Whitehead B. A., Choate T. D.: Cooperative-competitive genetic evolution of radial basis function centers and widths for time-series prediction. IEEE Trans. Neural Networks, 7, 1996, pp. 869-880. [20] Kirkpatrick S., Gerlatt C. D., Vecchi M. P.: Optimization by simulated annealing. Science, 220, 1983, pp. 671-680. [21] Posp´ıchal J., Kvasni˘ cka V.: Simulated annealing construction of shortest paths on incomplete graphs. Neural Network World, 13, 2008, pp. 277-290. [22] Lee J.-S., Park C. H.: Discriminative training of hidden Markov models by multiobjective optimization for visual speech recognition. In: Proc. Int. Joint Conf. Neural Networks, Montreal, Canada, 2005, pp. 2053-2058. [23] Nam D., Lee J.-S., Park C. H.: n-dimensional Cauchy neighbor generation for the fast simulated annealing. IEICE Trans. Information and Systems, E87-D, 2004, pp. 2499-2502. [24] Metropolis N., Rosenbluth A. W., Rosenbluth M. N., Teller A. H., Teller E.: Equation of state calculations by fast computing machines. J. Chemical Physics, 21, 1953, pp. 1087-1092. [25] Kahng A. B., Mandoiu I. I., Xu X., Zelikovsky A. Z.: Enhanced design flow and optimizations for multiproject wafers. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 26, 2007, pp. 301-311. [26] Lee L.-W., Wang L.-H., Chen S.-M.: Temperature prediction and TAIFEX forecasting based on high-order fuzzy logical relationships and genetic simulated annealing techniques. Expert Systems with Applications, 34, 2008, pp. 328-336. [27] Szu H., Hartley R.: Fast simulated annealing. Physics Letters A, 122, 1987, pp. 157-162. [28] Lee J.-S., Park C. H.: Hybrid simulated annealing and its application to optimization of hidden Markov models for visual speech recognition. IEEE Trans. Systems, Man and Cybernetics: Part B (accepted). [29] Lee Y., Lee J.-S., Lee S.-Y., Park C. H.: Improving generalization capability of neural networks based on simulated annealing. In: Proc. IEEE Congress on Evolutionary Computation, Singapore, 2007, pp. 3447-3453. [30] Lee J.-S., Lee H., Nam D., Park C. H.: Self-organizing neural networks by construction and pruning. IEICE Trans. Information and Systems, E87-D, 2004, pp. 2489-2498. [31] Blake C. L., Merz C. J.: UCI repository of machine learning database. Univ. California, Dept. Information and Computer Science, Irvine, CA, 1998. [32] Gonzalez J., Rojas I., Ortega J., Pomares H., Fernandez F. J., Diaz A. F.: Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation. IEEE Trans. Neural Networks, 14, 2003, pp. 1478-1495. [33] Mishra D., Yadav A., Kalra P. K.: A learning algorithm for a novel neural network architecture motivated by integrate-and-fire neuron model. Neural Network World, 16, 2006, pp. 513-532.

537

Radial Basis Function.pdf

A Probabilistic Radial Basis Function Approach for ...

$pdf-0749\radial-basis-function-rbf-neural-network-control-for ...$

pdf-0749\radial-basis-function-rbf-neural-network-control-for ...

Global optimization of minority game by intelligent agents - Springer Link

Towards global optimization of combined distillation ...

Electromagnetic interaction of arbitrary radial ...

A synthesis of radial growth patterns preceding tree mortality SI.pdf ...

Summer temperature drives radial growth of alpine shrub willows on ...

Global Optimization for Hash-based Splitting

Efficient Global Optimization Based 3D Carotid AB-LIB ...

A synthesis of radial growth patterns preceding tree mortality.pdf ...

Drum and method of shaping a radial tire

general radial after 2 races.pdf

Molecular Basis of Colorectal Cancer

Biogeosystemic Basis of Free-Standing Neurosensing.pdf ...

BuettnerEnneverJA-2002-neuroanatomical-basis-of-disorders-dual ...

OPTIMIZATION OF INTENSITY-MODULATED RADIOTHERAPY ...

MOLECULAR BASIS OF INHERITANCE.pdf

The pathogenic basis of malaria