Self-Organizing Neural Networks by Construction and ...

Viewer
Transcript

IEICE TRANS. INF. & SYST., VOL.E87–D, NO.11 NOVEMBER 2004

2489

PAPER

Self-Organizing Neural Networks by Construction and Pruning Jong-Seok LEE† , Member, Hajoon LEE† , Jae-Young KIM†† , Nonmembers, Dongkyung NAM† , Member, and Cheol Hoon PARK†a) , Nonmember

SUMMARY Feedforward neural networks have been successfully developed and applied in many areas because of their universal approximation capability. However, there still remains the problem of determining a suitable network structure for the given task. In this paper, we propose a novel self-organizing neural network which automatically adjusts its structure according to the task. Utilizing both the constructive and the pruning procedures, the proposed algorithm finds a near-optimal network which is compact and shows good generalization performance. One of its important features is reliability, which means the randomness of neural networks is eﬀectively reduced. The resultant networks can have suitable numbers of hidden neurons and hidden layers according to the complexity of the given task. The simulation results for the well-known function regression problems show that our method successfully organizes near-optimal networks. key words: self-organizing neural network, construction, pruning, impact factor, pool of candidates

1. Introduction The artificial neural network, or simply the neural network, models the structure and the function of the brain in an artificial way and thus can be applied to many real world problems. Among models of the neural network, feedforward multilayer neural networks are used in many areas such as pattern recognition, signal processing, optimization, control and identification [1]–[5] because of their universal approximation capability [4]. The design goal of a neural network for a specific problem is to find a proper network which can learn the training data suﬃciently and generalize well for the untrained data. When designing a suitable network, we need to determine an appropriate network structure for the problem and train the network with training samples because the structure of the network influences its overall performance considerably. In general, when the network is too simple compared to the given task, it fails to learn the underlying structure of the task fully (underfitting). On the other hand, when the network is too complicated, it can learn but may show poor generalization performance (overfitting) [5]. Therefore, it is necessary to make the complexity of the network match closely that of the task. The simplest way to find a suitable network structure is a trial-and-error method. Manuscript received May 10, 2004. Manuscript revised July 20, 2004. † The authors are with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, 373–1 Guseong-dong, Yuseong-gu, Daejeon 305– 701, Korea. †† The author is with Makus, Inc., 748–14 Yeoksam-dong, Gangnam-gu, Seoul 135–080, Korea. a) E-mail: [email protected]

However, it is very exhaustive because we have to try many diﬀerent structures that are each initialized to several random values of weights for training of neural networks. Thus, it is desirable to find an algorithm of determining a suitable architecture of the network for the given task in a selforganized way. Researchers have developed several methods of solving this problem among which the two major approaches are the constructive and the pruning methods. The former begins with a small-sized network and increases the number of hidden neurons until the network succeeds to learn [6]. In the latter approach, one trains a network having a sufficiently complex structure, and then removes some of its hidden neurons or weights which have small contributions to the output [7]. Each approach has its own advantages and disadvantages. For the constructive method, it is straightforward to specify an initial network, whereas the pruning method requires a prior knowledge about the suﬃcient network complexity for the task. However, the network produced by the constructive method may have some insignificant neurons or weights, which can be removed by the pruning method. Therefore, it will be helpful to combine both methods for optimizing the network structure for the given problem. Motivated by these considerations, we propose a novel systematic self-organizing method for optimizing the network structure having (possibly) multiple hidden layers with sigmoidal hidden neurons. Algorithms for the “optimal” network usually concentrate on producing as small networks as possible that have good generalization performance for the given task. Furthermore, we consider another important aspect of an algorithm, reliability. Because networks having diﬀerent initial weight values end up with ones that have quite diﬀerent final weight values and generalization performance, we call this phenomena the randomness of neural networks. Randomness causes unreliability of an algorithm in a sense that diﬀerent trials give quite diﬀerent performance. Therefore, the main objectives of the proposed method are compactness, good generalization performance and reliability. In order to meet these objectives, we use both constructive and pruning algorithms. In the constructive procedure, the network grows by addition of hidden layers as well as hidden neurons to an initial network having one hidden neuron. And we introduce the concept of a pool of candidates for the optimization of newly added neurons’ weights in order to enhance the reliability of the algorithm.

IEICE TRANS. INF. & SYST., VOL.E87–D, NO.11 NOVEMBER 2004

2490

After the network has grown enough to satisfy the desired error requirement by the constructive procedure, it is made to shrink by elimination of some insignificant hidden neurons because the constructive procedure might produce neurons which do not contribute to the output much and even degrade the generalization performance. In the pruning procedure, we use an eﬀective method to measure a neuron’s significance, called impact factor [8], [9]. We test the developed method through the computer simulations on the well-known regression problems. In the following section, we briefly review the previous approaches of the constructive and the pruning methods. In Sect. 3, we describe the proposed self-organizing neural networks in detail. Section 4 demonstrates the performance of the proposed method via computer simulations on four regression problems. Finally, conclusion is made in Sect. 5. 2. Previous Approaches A pruning algorithm is a top-down approach; it starts with a suﬃciently complex network and, after training of the network, eliminates some useless neurons or weights of the network. One important issue is which neurons or weights are to be eliminated among the existing ones. There have been developed a few methods measuring the significance (saliency) of a neuron or a weight. Le Cun et al. [10] proposed optimal brain damage (OBD) in which the saliency of a weight is obtained by the estimation of the second derivative of the error with respect to the weight. This method needs to calculate the Hessian matrix under the assumption that it is diagonal. Another Hessian-based approach, optimal brain surgeon (OBS), was proposed by Hassabi and Stork [11]. This approach is similar to the OBD but does not assume that the Hessian matrix is diagonal; accordingly, it contains the optimal brain damage procedure as a special case. Regularization can be used for pruning. It drives unnecessary weights to zero during training by adding the penalty terms to the objective function. When the network is larger than required, there will be some unnecessary connections in the network. By forcing them to take values close to zero during training, which can be pruned, we can prevent overfitting and improve generalization performance for the untrained data. The weight decay [12] and the weight elimination [13] methods are among the most frequently used regularization algorithms. In these methods, the penalty terms are functions of the weight values and thus some of the weights are forced to have small values in magnitude. In fact, regularization itself does not alter the network structure; the network structure must be determined a priori. The balancing factor between the conventional error function and the complexity penalty term should be set to an appropriate value. Usually the performance of regularization is quite sensitive to the parameter, whose value is diﬃcult to determine. The constructive algorithm is basically a bottom-up approach, where the network incrementally grows from a

small size until the network error reaches a pre-defined value or the amount of error reduction is less than a threshold value. Cascade-correlation (CC) architecture [14], proposed by Fahlman, is a constructive network architecture having multiple hidden layers with one node in each layer. Each inserted neurons are connected not only with the input and the output but also with the existing hidden neurons, thereby enabling the network to have high-order characteristics. Project pursuit learning (PPL) [15] constructs a network whose hidden units’ activation functions are not prescribed in advance but determined from data as a part of the training procedure. Typically, each hidden unit is represented by a linear combination of the Hermite functions. In dynamic node creation (DNC) methods [16], [17], sigmoidal hidden units with random initial weights are repeatedly added into the same hidden layer and the network is retrained after the addition of each hidden unit. A review of various constructive algorithms can be found in [6]. The constructive algorithms are more robust because the pruning algorithms require a prior knowledge about the suﬃcient network complexity for the given problem. In [18], it was shown that the constructive algorithms represent a class of universal learners. Furthermore, they can utilize pruning algorithms as parts of their procedures for removing some ineﬃcient neurons and weights. The networks produced by the constructive methods may have some insignificant neurons/weights contributing little to the output or even causing poor generalization performance due to overfitting. In these cases, the pruning procedures can eliminate them to yield more compact network showing improved generalization performance. 3. The Proposed Self-Organizing Neural Network In this section, we describe the proposed self-organizing network in detail. First of all, the features and the advantages of the proposed method are summarized as follows. • Both the constructive and the pruning algorithms are involved, and so are the benefits of both methods. From an initial network, the network grows by the constructive procedures until the desired error goal is reached. After the network becomes large enough to learn the training data, there might exist some insignificant neurons in the network and the pruning procedures are applied to eliminate them to make the network more compact and enhance the generalization performance. • The constructive procedures can build hidden layers as well as hidden neurons. Although it is well-known that neural networks with one hidden layer are universal approximators [19], networks having more than one hidden layers are sometimes more eﬃcient in terms of learning speed, network size and generalization capability [20]. When the addition of neurons becomes ineﬀective, a new hidden layer is inserted and new neurons are added in the new hidden layer. Thus, the network complexity is eﬀectively increased when neces-

LEE et al.: SELF-ORGANIZING NEURAL NETWORKS BY CONSTRUCTION AND PRUNING

2491

sary. • We use the eﬀective pruning method, called impact factor pruning, to remove insignificant neurons of the network generated by the constructive procedures. • When a new hidden neuron is added, it is chosen from the pool of candidates [14] which consists of a few neurons having diﬀerent weight values. This scheme allows more careful and optimized initialization of inserted neurons than the random initialization. Moreover, it improves the reliability of the algorithm. • Instead of the classical error backpropagation algorithm or its variants, the network is trained by the Levenberg-Marquardt (LM) algorithm [21] which is one of the fastest second-order algorithms. Without loss of generality, we assume that the organized network has one output node. It is easy to extend the proposed method to the networks having multiple output nodes. Also, we describe the network having a linear output node to apply it to the regression problems. However, the proposed method can be used in a similar manner for the classification problems in which the output activation function is usually sigmoid. Figure 1 shows the overall procedure of the proposed algorithm. We start with an initial network of a small size. Neurons are added in the hidden layer until the amount of

the error reduction by the addition of neurons becomes saturated. Then, some insignificant neurons are removed and a new hidden layer is constructed. Again, hidden neurons are inserted in the new hidden layer. When the network error reaches the given error goal by the construction of hidden neurons and hidden layers, insignificant hidden neurons in all hidden layers are pruned and the algorithm ends. After each addition of neurons/layers and each pruning of neurons, the whole network is retrained by the LM algorithm. Through the repeated constructive and pruning procedures, the network structure is adaptively determined. We should consider eﬀective and optimized ways for each growing/pruning step in order to help the network organize its structure eﬃciently. Therefore, we devise useful methods for the specification of the initial network, selection of newly added neurons, construction of new hidden layers and pruning of neurons, which are explained one-by-one in detail in what follows. 3.1 An Initial Network The initial network has one sigmoidal hidden neuron and one linear output neuron as shown in Fig. 2 (a). It is known that the network in Fig. 2 (a) has local solutions [22]. Therefore, in order to reduce the randomness, various initial networks are trained in the pool among which the best is selected as the initial network. When the input dimension of the network is low, initial input weights can be assigned uniformly in angle without random number generation while the magnitudes of the weight vectors are the same. For example, when the input of the network is two-dimensional, the Nc candidates of the input weight vector are given by

(a)

(b)

Fig. 1

A flowchart of the proposed algorithm.

Fig. 2 The initial network and the initialization scheme of its input weights. (a) Structure of the initial network. (b) Example of the generation of candidates for an initial network having two input nodes.

IEICE TRANS. INF. & SYST., VOL.E87–D, NO.11 NOVEMBER 2004

2492

w = [w1 , w2 ] = [r · cos

πi

, r · sin

πi

], Nc Nc i = 0, 1, . . . , Nc − 1,

3.3 Addition of Layers (1)

where r is the magnitude of w. This situation is illustrated in Fig. 2 (b). We only have to consider the range [0, π) since the vectors from [π, 2π) can be obtained by the inversion of the signs of the output weights. This initialization scheme has some merits. First, since it is deterministic, that is, there is no random number generation, the randomness of the overall self-organizing procedures can be reduced. Second, the initial geometry of the hyperplane of the sigmoid function can be explored evenly in angle in the weight space. We determine the initial bias of each candidate by making the decision boundary of the sigmoid function go through the maximal target point so that the network learns the target eﬀectively and the active region (sensitive region) of its sigmoid function is located in the input domain. For example, if the maximal target occurs at the training input xk , the equation of a candidate neuron’s decision boundary going through xk can be written by w · xk + b = 0

(2)

and thus the bias is given by b = −w · xk .

(3)

The output weight of each candidate is obtained by the pseudo-inverse method. Among these candidates, we select the initial network which shows the smallest training error. 3.2 Addition of Neurons Figure 3 depicts the procedure of adding a new hidden neuron. The new interconnections between the new neuron and the existing neurons in adjacent layers (dashed line) are established. At each addition of a neuron, we make a pool of candidates consisting of a few neurons with randomly assigned initial weights and train them separately to learn the existing network’s residual error. Among them, the best is selected as the inserted neuron. As in the case of the initial network, we obtain the initial input bias by making the decision boundary of the sigmoidal function go through the maximal error point. After a neuron is added, the whole network is retrained by the LM algorithm.

Fig. 3 Addition of a new hidden neuron. Each end point of the dashed line forms a candidate.

As new hidden neurons are added one by one, the error reduction tends to decrease gradually. When it becomes small, the addition of a new hidden layer becomes more eﬀective than the addition of neurons. Therefore, the network increases its layer when the following condition is satisfied. E(k) − E(k + n) < , E(k)

(4)

where E(k) is the training error of the network after the kth neuron is inserted and the network is retrained, and n is the comparison interval (in number of neurons) for checking whether the error reduction is saturated or not. From our extensive experiments, we observed that the value of n greater than three causes addition of excessive neurons and thus we set n to three in our simulation. Before the creation of a new layer, we prune some insignificant neurons in the current hidden layer in order to make the layer more compact, which will be explained in the following subsection. Then, a new hidden layer is constructed and new hidden neurons are added in the new hidden layer in the following way (Fig. 4): First, the existing linear output neuron becomes the first sigmoidal neuron of the new hidden layer. Then, the new output neuron is added and the second hidden neuron of the new hidden layer is created. Because the output neuron of the previous network changes from the linear neuron to the sigmoidal neuron, there will be a jump in the network error. In order to prevent this discontinuity, the output weight from the neurons in the existing hidden layer to the output neuron of the previous network (wa in Fig. 4) is so scaled down that the input to the changing neuron has the value within the linear range of the sigmoid. And then, the outgoing weight from the first neuron of the new hidden layer to the new output neuron (wb in Fig. 4) is inversely scaled. In our experiments, the

Fig. 4

Addition of a new hidden layer.

LEE et al.: SELF-ORGANIZING NEURAL NETWORKS BY CONSTRUCTION AND PRUNING

2493

linear range is determined so that the slope of the tangent of every point in the range is between 0.9975 and 1.0. For classification, since the output neuron is sigmoidal, this scaling scheme is not necessary and can be omitted. 3.4 Pruning of Neurons Pruning of neurons aims at producing a compact network and improving the generalization performance. In our method, pruning is applied before a new hidden layer is constructed and after the network grows enough for approximating the training data within a desired error bound. As a measure of a neuron’s significance, we proposed impact factor (ImF) [8]. The key idea of the ImF is to define a significance of a neuron by measuring how important role it plays to the next layer in the data domain. Consider a part of a network as shown in Fig. 5. The input of the j-th neuron in the next layer from the current layer for the m-th training input, xmj , is written by xmj = w ji ym i + bj i

=

w ji (ym i − y¯i ) +

i

w ji y¯ i + b j ,

(5)

i

where ym i is the i-th neuron’s output in the current layer for the m-th input data, w ji the weight from the i-th neuron in the current layer to the j-th neuron in the next layer, b j the bias of the j-th neuron in the next layer and y¯i the average value of the i-th neuron’s output for all the training input patterns. Since the second and the third terms in (5) are constants for the current network, the output of the j-th neuron is determined by the first term of the equation. And, for xmj , the amount of contribution of the i-th neuron in the current layer through w ji is w ji (ym i − y¯i ). Consequently, the total amount ofcontribution of the i-th neuron to the next layer is given by j w ji (ym i − y¯i ). Therefore, the i-th hidden neuron’s ImF is defined as follows: ImFi = w2ji σ2i , (6) j

σ2i

is the sample variance of the i-th neuron’s where output. The ImF becomes small physically when 1) j |w ji | is

small, i.e., the connection to the next layer is weak, 2) the variance σ2i is small because the sigmoidal neuron operates in the saturation region for the training data, or 3) σ2i is small because the connection weights from the previous layer are small. Since small ImF of a neuron implies that it behaves like a bias without contributing much to the regression, it can be selected to be pruned. When eliminating a neuron, we add the sample average of its outputs for all training data to the threshold value of the next layer to minimize the jump in the network error and help the network to be trained easily. Therefore, after the ith neuron is eliminated in Fig. 5, the bias of the j-th neuron in the next layer is compensated by = bold ¯i. bnew j j + w ji y

(7)

Then, the whole network is retrained by the LM algorithm. Before adding a new hidden layer, we prune some insignificant neurons in the current hidden layer. Pruning continues until the network error becomes greater than the error before pruning multiplied by a limiting parameter β (≥ 1). Usually β has a value close to unity. We observed that, without this procedure, the final networks tend to be larger. This is because it is convenient to perform subsequent organization procedures with the compact network resulting from the elimination of insignificant neurons. After the given error goal is reached, we prune insignificant neurons of the whole network. In this case, pruning is performed in a layer-bylayer manner; alternating each layer, the neuron having the smallest ImF in the layer is pruned and the whole network is retrained. This procedure is repeated until no more neuron can be pruned in all of the hidden layers. It is not clear in what order of layer pruning is performed, but a rule of thumb based on our simulation results is to prune a neuron in the layer having the most neurons first. The ImF pruning has some advantages over other pruning algorithms [9]. First, the ImF pruning method produces simpler networks showing better generalization performance than other conventional methods such as the magnitude-based pruning, OBD and OBS. Its superiority was shown to be prominent especially for complex networks having multiple hidden layers. Second, the ImF pruning is computationally less expensive than OBD and OBS because the ImF measures the significance of neurons in the data space while others do in the weight space. Third, the increase of the network error is reduced by compensating the bias term after each pruning. Fourth, the ImF pruning has physical meanings as explained above. 4. Simulation The developed method, called the self-organizing neural networks (SONN), is applied to regression modeling problems. We choose four functions, gi : [0, 1]2 → R, i = 1, 2, 3, 4, which are widely used nonlinear functions [8], [15], [23], [24]. They are given by the following equations:

Fig. 5

A part of a network. yi is the i-th neuron’s output.

IEICE TRANS. INF. & SYST., VOL.E87–D, NO.11 NOVEMBER 2004

2494

• Sine function: g1 (x1 , x2 ) = 1.58 · sin 16(x1 − 0.5)(x2 − 0.5) + 1 .

(8)

• Additive function:

g2 (x1 , x2 ) = 1.35 1.5(1 − x1 ) + e2x1 −1 sin 3π(x1 − 0.6)2 + e3(x2 −0.5) sin 4π(x2 − 0.9)2 .

(9)

• Complicated interaction function: g3 (x1 , x2 ) = 1.89 1.35 + e x1 sin 13(x1 − 0.6)2 e−x2 sin(7x2 ) .

(10)

• Harmonic function: g4 (x1 , x2 ) = 1.46 · sin 4π (x1 − 0.5)2 + (x2 − 0.5)2 + 1 . (11) Figure 6 shows the three-dimensional perspective plots of the functions. It is practically important to use the twodimensional univariate functions because the input-output relationship of the neural network can be visualized graphically. Randomly selected 225 data are used for training neural networks and a regular grid of 10,000 points for generalization test.

Fig. 6

We measure the network error in terms of the square root of the fraction of variance unexplained (SFVU) [15], defined by

M ˆ (xm ) − g(xm ))2 m=1 (g , (12) SFVU = M m ¯ )2 m=1 (g(x ) − g where xm is the m-th training pattern, g the desired output, gˆ the actual response of the network, g¯ the mean of the target data, and M the number of training data. The denominator of the SFVU is the standard deviation of the targets of training data and the numerator is the usual “root mean squared error (RMSE)”. Therefore, the SFVU measure is proportional to the RMSE. The training SFVU goal is set to 0.04 for all cases. This goal is determined on the basis of our observation that the networks trained below this goal can show good generalization performance. First, we compare our method with the DNC algorithm, which has similarity to the proposed method in that both methods yield multilayer neural networks having sigmoidal hidden neurons and contain the process of repeated neuron addition. We try the DNC algorithm by Setiono and Hui [17] where they use quasi-Newton method for training neural networks. Since DNC adds hidden neurons only, the network does not construct hidden layers in the proposed method by setting = −∞ in (4) for comparisons. The performance of the proposed method is compared with DNC in terms of the number of weight parameters, the generalization SFVU and the reliability over diﬀerent trials. We perform each experiment ten times.

(a)

(b)

(c)

(d)

Perspective plots of test functions. (a) g1 . (b) g2 . (c) g3 . (d) g4 .

LEE et al.: SELF-ORGANIZING NEURAL NETWORKS BY CONSTRUCTION AND PRUNING

2495

Then, we compare SONN with the neural networks of fixed topology where, varying the numbers of hidden neurons and hidden layers, we train 100 diﬀerent networks with random initial weights generated within [−0.05, +0.05] by the LM algorithm for each test function until they meet the SFVU goal of 0.04. The maximum number of epochs is set to 1200. This comparison is to demonstrate the ability of the proposed method to generate the compact networks with good generalization capability. For SONN, the number of training epochs was set to 100 for addition and 1200 for pruning. The value of β for pruning before adding a layer was set to 1.02. We use 18 candidates of an initial network (i.e., Nc = 18) and five candidates for a new additional hidden neuron. These parameter settings are obtained from the experimental results that more candidates are not much helpful for the performance of the algorithm. The length of the initial weight vector of the initial network, r, is set to 0.05 and the initial weights of the candidates for a newly added neuron are randomly chosen within [−0.05, +0.05]. Figure 7 compares the performance of the candidates for the initial network and Fig. 8 that of the candidates for an added hidden neuron in the case of g1 . Each candidate shows the diﬀerent performance from others and there exists the one that has the smallest training error and is selected as the initial network or the newly inserted neuron. Table 1 shows the comparison of the results of DNC using the quasi-Newton training method (DNC-QN), DNC using the LM training method (DNC-LM) and the proposed method. The number of weight parameters, i.e., degrees of freedom (DoFs), the generalization performance measured by SFVU, and their standard deviations for ten experiments are shown. For the proposed method, along with the final results, the results before pruning are also included. We can see that the final networks by SONN show better performance than those by DNC. The networks by SONN have much fewer weights than DNC-QN and DNC-LM by about 10 ∼ 45% and 15 ∼ 26%, respectively. SONN shows better generalization performance by 10 ∼ 15% except that all

methods show similar results for g2 . The standard deviations of both DoFs and SFVU over ten trials are much smaller in SONN than in DNCs, which means that the reliability is improved in our method; when we compare the results of SONN before pruning and those of DNCs, it is observed that using the pool of candidates for initial networks and neuron addition reduces the randomness of neural networks. Also, we see that the pruning procedure improves the reliability as well as the network compactness and the generalization performance. We observed that the pruning procedure is the most time-consuming in the proposed method while the time for choosing the initial network and new hidden neurons from the pool of candidates is less than 10% of the total time for the overall self-organizing procedure. Hence, the proposed method has higher time complexity than DNC which does not contain the pruning step. Actually, SONN takes just about 1.5 ∼ 2 times as much time as DNC-LM on average† . In Table 2, we compare SONN with the networks of fixed topology for two-layer networks. Addition of a layer is not allowed in SONN. We see that the networks by proposed method have just about one or two more hidden neurons on the average than the trainable minimum networks of fixed topology. As for the best results of SONN, it finds the 2-211 network (i.e., 2 input nodes, 21 hidden neurons and one output neuron) which is not trainable with fixed topology

(a)

(b) Fig. 8 Performance of the candidates for (a) the second hidden neuron and (b) the third hidden neuron for g1 . The ones showing the smallest error (marked as the gray bars) are selected as the inserted neurons. Fig. 7 Performance of the candidates for the initial network for g1 . The 17th candidate (marked as the gray bar) shows the smallest training error and is selected as the initial network.

† As for DNC-QN, since the quasi-Newton algorithm showed much slower convergence than the LM algorithm, DNC-QN was slower than SONN.

IEICE TRANS. INF. & SYST., VOL.E87–D, NO.11 NOVEMBER 2004

2496 Table 1 Comparisons of the proposed method and DNC. The standard deviation values for ten trials are shown in ( ).

g1

g2

g3

g4

Measures

DNC-QN

DNC-LM

DoFs (std) Test SFVU (std) DoFs (std) Test SFVU (std) DoFs (std) Test SFVU (std) DoFs (std) Test SFVU (std)

56.6 (9.32) 0.0566 (0.01082) 32.2 (5.60) 0.0414 (0.00295) 77.4 (13.2) 0.0678 (0.00878) 162.2 (24.8) 0.0690 (0.01605)

53.0 (13.2) 0.0516 (0.00390) 27.8 (3.80) 0.0423 (0.00298) 57.0 (6.52) 0.0633 (0.00674) 113.8 (10.3) 0.0697 (0.02373)

Proposed before pruning after pruning 57.8 (5.00) 0.0540 (0.00129) 37.0 (6.80) 0.0421 (0.00081) 61.4 (2.28) 0.0657 (0.00194) 108.6 (10.8) 0.0617 (0.01076)

39.4 (2.64) 0.0497 (0.00144) 23.4 (3.68) 0.0420 (0.00178) 49.0 (0) 0.0548 (0.00064) 90.2 (3.60) 0.0604 (0.00758)

Table 2 Comparisons of the two-layer networks by the proposed algorithm with the networks having fixed structures. For the fixed topology, the success rate of 100 trials is given in ( ).

g1

g2

g3

g4

Measures DoFs Structure Test SFVU DoFs Structure Test SFVU DoFs Structure Test SFVU DoFs Structure Test SFVU

29 (0%) 2-7-1 – 17 (0%) 2-4-1 – 41 (0%) 2-10-1 – 85 (0%) 2-21-1 –

Fixed topology 33 (2%) 37 (11%) 2-8-1 2-9-1 0.0513 0.0497 21 (4%) 25 (92%) 2-5-1 2-6-1 0.0430 0.0411 45 (3%) 49 (21%) 2-11-1 2-12-1 0.0627 0.0598 89 (8%) 93 (38%) 2-22-1 2-23-1 0.0639 0.0615

41 (52%) 2-10-1 0.0510 29 (96%) 2-7-1 0.0411 53 (48%) 2-13-1 0.0617 97 (65%) 2-24-1 0.0611

Proposed (avg) 39.4 – 0.0497 23.4 – 0.0420 49.0 – 0.0548 90.2 – 0.0604

Proposed (best) 37 2-9-1 0.0482 21 2-5-1 0.0384 49 2-12-1 0.0542 85 2-21-1 0.0519

Table 3 Comparisons of the three-layer networks by the proposed algorithm with the networks having fixed structures. For the fixed topology, the success rate of 100 trials is given in ( ). Measures g1

g3

g4

DoFs Structure Test SFVU DoFs Structure Test SFVU DoFs Structure Test SFVU

Fixed topology 19 (0%) 2-2-3-1 – 37 (0%) 2-4-4-1 – 20 (0%) 2-3-2-1 –

20 (2%) 2-3-2-1 0.0376 43 (8%) 2-6-3-1 0.0554 25 (3%) 2-3-3-1 0.0415

25 (2%) 2-4-2-1 0.0400 44 (25%) 2-5-4-1 0.0575 31 (6%) 2-4-3-1 0.5029

for g4 and the trainable minimum network for g2 . Along with the compactness of the networks, the test SFVUs of SONN are similar to or even lower than those of the fixed topology. We show the results when the layer addition is allowed in Table 3, where g2 is not considered because it is so simple that the addition of a layer does not occur. The obtained final networks were three-layer networks and are compared with the three-layer networks of fixed structures. From the table, it is observed that there is an apparent similarity to the case of two-layer networks in Table 2. The networks by SONN have about one more hidden neuron on the aver-

30 (71%) 2-5-2-1 0.0460 51 (73%) 2-5-5-1 0.0582 44 (15%) 2-5-4-1 0.2921

Proposed (avg)

Proposed (best)

26.2 – 0.0411 48.0 – 0.0594 34.1 – 0.0439

20 2-3-2-1 0.0384 43 2-6-3-1 0.0499 25 2-3-3-1 0.0328

age than the trainable minimum networks of the fixed structure. Moreover, the minimum structures by SONN are always the simplest ones that can be trained with the success rates of only 2%, 8% and 3% using fixed topology. The performance of SONN is most outstanding in the case of g4 ; it finds the trainable minimum network (2-3-3-1) with better generalization performance without showing overfitting shown by the fixed topology. Note that, in terms of DoFs, three-layer neural networks show better performance than two-layer networks when we compare the results in Tables 2 and 3. This indicates that the addition of a layer in the proposed method provides a possibility of finding more

LEE et al.: SELF-ORGANIZING NEURAL NETWORKS BY CONSTRUCTION AND PRUNING

2497

Fig. 9 Changes of the network errors for training and test data during the self-organizing procedure for g4 . The dashed line indicates the training goal of 0.04. The network structure at some specific time steps are also 2 : pruning of neurons before 1 : addition of neurons. Phase

shown. Phase

3 : addition of neurons in the second addition of a hidden layer. Phase

4 : pruning of neurons in all hidden layers. hidden layer. Phase

compact networks. Figure 9 shows a typical example of the self-organizing procedure for g4 . We plot the network errors for the training and the test data after each addition/pruning of neurons and retraining of the whole network. The network error gradually decreases as neurons are inserted to the initial network, but the amount of error reduction becomes saturated when 1 ). Theremore than about ten neurons are added (phase

fore, it is time to prune some neurons and add a new hid2 ), a new den layer. After five neurons are removed (phase

hidden layer is constructed by changing the existing output neuron to the first neuron in the new hidden layer and adding a new output neuron (at time step 18 in the figure). Slight 2 is due to the value of β increase of the error in phase

greater than one. Then, neurons are added into the new hid3 ). One should note that, in this time, the den layer (phase

error reduction ratio is much larger than the latter part of 1 , which supports the eﬀectiveness of the layer addiphase

tion scheme. After three more neurons are inserted, the desired error bound is reached (at time step 21). Finally, some 4 ). We neurons in the two hidden layers are pruned (phase

can see that the generalization performance is improved by the pruning steps. 5. Conclusion We have proposed self-organizing neural networks using both the constructive and the pruning procedures and carried out a comparative study via computer simulation. By the proposed method, the networks autonomously adapt their structures to the given problem by repeating addition of neurons, addition of layers and pruning of neurons. The important features of the developed method are the simplicity of the resultant network, good generalization performance

and the reliability of the results. In order to achieve these features, we analyzed the processes of the organizing algorithm, developed the eﬀective methods for the processes and combined them to obtain an integrated algorithmic approach. The comparisons with DNC showed that the proposed method yields more compact networks having better generalization reliably due to the careful initialization of the inserted weights and the pruning of neurons. The compactness of the produced two- and three-layer networks was demonstrated by the comparisons with the networks having fixed structure. In the future study, it would be necessary to apply the algorithm to high-dimensional regression and pattern classification problems and to perform analytic study on convergence. And the proposed method can be extended to selforganizing algorithms for the networks having the activation functions other than the sigmoid function. Also, a deterministic way of adding neurons is currently under investigation. Acknowledgements This work was supported by grant No. R01-2003-00010829-0 from the Basic Research Program of the Korea Science and Engineering Foundation. References [1] S. Chen, S. Billings, and P. Grant, “Nonlinear system identification using neural networks,” Int. J. Control, vol.51, no.6, pp.1191–1214, 1990. [2] N. Sadegh, “A perceptron network for functional identification and control of nonlinear systems,” IEEE Trans. Neural Netw., vol.4, no.6, pp.982–988, Nov. 1993. [3] H. Kabre, “Robustness of a chaotic modal neural network applied to audio-visual speech recognition,” Proc. IEEE Neural Networks for Signal Processing, pp.607–616, 1997. [4] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, Oxford, UK, 1995. [5] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Englewood Cliﬀs, Upper Saddle River, NJ, 1999. [6] T.-Y. Kwok and D.-Y. Yeung, “Constructive algorithms for structure learning in feedforward neural networks for regression problems,” IEEE Trans. Neural Netw., vol.8, no.3, pp.630–645, May 1997. [7] R. Reed, “Pruning algorithms — A survey,” IEEE Trans. Neural Netw., vol.4. no.5, pp.730–747, Sept. 1993. [8] J.-S. Lee and C.H. Park, “Self-organizing neural networks using adaptive neurons,” Proc. Int. Conf. Neural Information Processing, pp.935–939, Singapore, Nov. 2002. [9] H. Lee, S.-B. Chung, and C.H. Park, “A pruning algorithm of neural networks using impact factors,” J. Inst. Electronics Engineers of Korea, vol.41CI, no.2, pp.77–86, March 2004. [10] Y. Le Cun, J.S. Denker, and S.A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, ed. D.S. Touretzky, pp.598–605, Morgan Kaufmann, San Mateo, CA, 1990. [11] B. Hassabi and D.G. Strork, “Second order derivative for network pruning: Optimal brain surgeon,” in Advances in Neural Information Processing Systems 5, ed. D.S. Lippman, J.E. Moody, and D.S. Touretzky, pp.164–171, Morgan Kaufmann, San Mateo, CA, 1993. [12] G.E. Hinton, “Connectionist learning procedures,” Artif. Intell., vol.40, pp.185–234, 1989. [13] A.S. Weigend, D.E. Rumelhart, and B.A. Hyberman, “Generalization by weight-elimination applied to currency exchange rate prediction,” Proc. Int. Conf. Neural Networks, vol.1, pp.837–841, Seattle,

IEICE TRANS. INF. & SYST., VOL.E87–D, NO.11 NOVEMBER 2004

2498

WA, 1991. [14] S.E. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,” in Advances in Neural Information Processing Systems 2, ed. D.S. Touretzky, pp.524–532, Morgan Kaufmann, San Mateo, CA, 1990. [15] J.N. Hwang, S.R. Lay, M. Maechler, D. Martin, and J. Schimert, “Regression modeling in backpropagation and projection pursuit learning,” IEEE Trans. Neural Netw., vol.5, no.3, pp.342–353, 1994. [16] T. Ash, “Dynamic node creation in backpropagation networks,” Connection Sci., vol.1, no.4, pp.365–375, 1989. [17] R. Setiono and L.C.K. Hui, “Use of a quasi-Newton method in a feedforward neural network construction algorithm,” IEEE Trans. Neural Netw., vol.6, no.1, pp.273–377, Jan. 1995. [18] E. Baum, “A proposal for more powerful learning algorithms,” Neural Comput., vol.1, no.2, pp.201–207, 1989. [19] K. Hornik, M. Stinchdombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Netw., vol.2, no.5, pp.359–366, 1989. [20] S. Tamura and M. Tateishi, “Capabilities of a four-layered feedforward neural network: Four layers versus three,” IEEE Trans. Neural Netw., vol.8, no.2, pp.251–255, March 1997. [21] M.T. Hagan and M.B. Menhaj, “Training feedforward networks with the Marquardt algorithm,” IEEE Trans. Neural Netw., vol.5, no.6, pp.989–993, Nov. 1994. [22] F.M. Coetzee and V.L. Stonick, “On the uniqueness of weights in single-layer perceptrons,” IEEE Trans. Neural Netw., vol.7, no.2, pp.318–325, March 1996. [23] C.H. Park, J.P. Yu, L.-J. Park, and S. Park, “A new neural network construction algorithm using a pool of hidden candidates,” Proc. 4th Int. Conf. on Soft Computing, pp.654–657, Iizuka, Japan, Sept. 1996. [24] V. Cherkassky and H. Lari-Najafi, “Constrained topological mapping for nonparametric regression analysis,” Neural Netw., vol.4, no.1, pp.27–40, 1991.

Jong-Seok Lee received the B.S. degree in Electrical and Electronic Engineering from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 1999 and the M.S. degree in Electrical Engineering and Computer Science from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 2001, where he is currently working toward the Ph.D. degree. His research interests include neural networks, audio-visual speech recognition and evolutionary computation.

Hajoon Lee received the B.S. degree in Electronic and Electrical Engineering from Kyungpook National University, Daegu, Korea, in 2000. He received the M.S. degree in Electrical Engineering and Computer Science from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 2002, where he is currently working toward the Ph.D. degree. His research interests are intelligent control, neural networks and evolutionary computation.

Jae-Young Kim received the B.S. and M.S. degrees in Electrical and Electronic Engineering from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 1996 and 1998, respectively. He has been a research engineer in Makus, Inc. since 1998. His current research interests are IEEE802.16d, e and highspeed portable internet.

Dongkyung Nam received the B.S. and M.S. degrees in Electrical Engineering from Seoul National University, Seoul, Korea, in 1994 and 1996, respectively. He received the Ph.D. degree in Electrical Engineering and Computer Science from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 2002, where he is currently a postdoctoral fellow. His research interests are evolutionary algorithms, optimization, artificial life and cognition theory.

Cheol Hoon Park received the B.S. degree in Electronics Engineering with the best student award from Seoul National University, Seoul, Korea, in 1984 and the M.S. and Ph.D. degrees in Electrical Engineering from California Institute of Technology, Pasadena, California, in 1985 and 1990, respectively. He joined the Department of Electrical Engineering at the Korea Advanced Institute of Science and Technology in 1991, where he is currently a Professor. His current research interests are in the area of intelligent systems including intelligence, neural networks, fuzzy logic, evolutionary algorithms, and their application to recognition, information processing, intelligent control, dynamic systems and optimization. He is a senior member of IEEE and a member of INNS and KITE.

Self-Organizing Neural Networks by Construction and ...

Nov 11, 2004 - ing and Computer Science, Korea Advanced Institute of Science and Technology .... it is well-known that neural networks with one hidden layer are univer- ..... The number of weight parameters, i.e., degrees of freedom (DoFs) ...

Download PDF

449KB Sizes 0 Downloads 152 Views

Report

Self-Organizing Neural Networks by Construction and ...

Recommend Documents