RFCMAC: A novel reduced localized neuro-fuzzy ...

Viewer
Transcript

Expert Systems with Applications 38 (2011) 12066–12084

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

RFCMAC: A novel reduced localized neuro-fuzzy system approach to knowledge extraction Richard J. Oentaryo a, Michel Pasquier b, Chai Quek c,⇑ a

Intelligent Robotics Lab, S2.1-B4-01, School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore College of Engineering, EB1-202, American University of Sharjah, Dubai, United Arab Emirates c Centre for Computational Intelligence, N4-B1a-02, School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore b

a r t i c l e

i n f o

Keywords: Fuzzy Cerebellar Model Articulation Controller (FCMAC) Localized neuro-fuzzy system Yager inference Knowledge extraction Structural reduction

a b s t r a c t Neuro-fuzzy system (NFS) and especially localized NFS are powerful rule-based methods for knowledge extraction, capable of inducing salient knowledge structures from data automatically. Contemporary localized NFSs, however, often demand large features and rules to accurately describe the overall domain data, thus degrading their interpretability and generalization traits. In light of these issues, a new localized NFS termed the Reduced Fuzzy Cerebellar Model Articulation Controller (RFCMAC) is proposed that models the two-stage neural development of cortical memories in the brain to construct and reduce the human’s memory structure. This idea is realized in both label generation and rule generation phases of the RFCMAC learning process to derive a compact and representative rule base structure, prior to an iterative parameter tuning phase. The incorporation of reduction mechanisms provides RFCMAC with several beneﬁts over classical localized NFSs, including discovery of highly concise and intuitive rules, satisfactory generalization performances, and enhanced system scalability. A series of experiments on nonlinear regression, water plant monitoring, and leukemia diagnosis tasks have demonstrated the efﬁcacy of the proposed system as a novel knowledge extraction tool. 2011 Published by Elsevier Ltd.

1. Introduction Pattern recognition systems have witnessed great successes in a wide range of real-world applications today, and especially in advanced decision support frameworks. In general, two main objectives should prevail in their developments: to achieve satisfactory performance in either classiﬁcation or regression tasks, and to extract useful and comprehensible knowledge structures from data. Most techniques developed in statistical pattern recognition, neural networks, evolutionary computation, and machine learning, however, focus mainly on the ﬁrst goal. This results in black-box models that may give accurate predictions but are obscure to the human operators (Duch, Setiono, & Zurada, 2004; Duda, Hart, & Stork, 2000), thus lacking the explanatory power to describe the salient knowledge structures, association patterns, or causal dependencies in data. Such knowledge extraction ability is yet often desired in many applications, e.g., medical diagnosis or ﬁnancial engineering, so as to provide human-veriﬁable explanations and increase conﬁdence in the recommendations made by the system. Recent interests in knowledge extraction methodologies have led to the development of various computational techniques, a popular example of which involves encapsulating knowledge in-

⇑ Corresponding author. Tel.: +65 6790 4926; fax: +65 6792 6559. E-mail address: [email protected] (C. Quek). 0957-4174/$ - see front matter 2011 Published by Elsevier Ltd. doi:10.1016/j.eswa.2011.02.058

duced from data in the form of a set of expressive, logical rules (Duch et al., 2004; Lin & Lee, 1996; Nauck, Klawonn, & Kruse, 1997). Our main interest in this enterprise lies in one class of self-organizing rule-based systems known as neuro-fuzzy system (NFS) (Lin & Lee, 1996; Nauck et al., 1997), which provides a novel approach to extraction of fuzzy rules from data. NFS is a powerful hybrid that combines the learning capability, parallelism, and robustness of a neural network model with the human-like, symbolic, and approximate reasoning capacities of a fuzzy rule-based system. The NFS approach accordingly offers an effective means for automatically inducing decision logic from raw numerical data in the form of highly intuitive fuzzy linguistic rules. Contemporary approaches to NFS learning and recall broadly consist of two groups: globalized and localized. In the former, learning and inference are accomplished via a global activation of the entire rule base (i.e., the underlying network). Globalized NFSs, such as (Angelov & Zhou, 2008; Chakraborty & Pal, 2004; Jang, 1993; Kasabov, 2001; Lin & Lin, 1997; Liu, Quek, & Ng, 2007; Oentaryo, Pasquier, & Quek, 2008), typically exhibit good accuracy and generalization performances, since all network parameters are utilized to compute the output for any given input. However, the acquisition of new information in these systems affects all parameters and may cause a catastrophic interference with (or forgetting of) knowledge previously gained. Moreover, the intrinsic network plasticity and the transient nature of the acquisition process often render unstable learning (e.g., oscillation or divergence) that may

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

be undesirable in some applications, such as controller design and signal processing. An attractive alternative to the globalized approach is the localized NFS, whose learning and inference processes involve only the activation of a small portion of the entire rule base deﬁned within certain neighborhood or spatial constraints. One prominent family of localized NFSs is the Fuzzy Cerebellar Model Articulation Controller (FCMAC), ﬁrst proposed in Jou (1992), and the related variants (Hwang & Hsu, 2002; Ng, Quek, & Jiang, 2008; Nguyen, Shi, & Quek, 2006; Nie & Linkens, 1994; Peng, Wang, & Sun, 2007; Sim, Tung, & Quek, 2006; Su, Lee, & Wang, 2006; Ting & Quek, 2009). These systems are well-known to be more resilient to interference and can provide more efﬁcient knowledge recall and learning than the globalized type. However, as only local information is used in learning and inference, one major issue is their lack of an overall view of the domain data, leading possibly to generalization deﬁciency. As such, an accurate representation of the overall data usually requires large number of rules and/or features, resulting in increased memory requirements and degraded system interpretability. The work presented in this article aims at addressing the aforementioned issues in localized NFSs by exploiting current understanding of cognitive processes in the brain. In particular, neuroscience studies have established that the neural development of human long-term memory from birth, essentially that of the cortical systems, involves two overlapping stages (Estomih & Gregory, 2006; Kandel, Schwartz, & Jessel, 2000). In the ﬁrst stage, the basic cortical memory structure and coarse connections are laid out as a result of neurogenesis, producing a large, excessive formation of neurons. In the second stage, these initial structure and connections are reduced and reﬁned via various activity-dependent experiences occurring throughout one’s life span. Active neurons gradually stabilize via the update of neurotrophic factors, whereas the losing neurons eventually perish and the extraneous connections get pruned. These two processes enable humans to consolidate the knowledge learned, compressing information into a compact memory while keeping its semantic content sufﬁciently accurate. In this article, a localized NFS termed the Reduced Fuzzy Cerebellar Model Articulation Controller (RFCMAC) is proposed that implements the brain inspiration discussed above. Speciﬁcally, RFCMAC includes in its learning procedure a label generation and a rule generation phase, each composed of construction and reduction steps that model the two-stage cortical development process. These are then followed by a parameter tuning phase for further reﬁnements of the crafted rule base structure. Compared to conventional localized NFSs, a distinct feature of RFCMAC lies precisely in the provision of effective structural reduction mechanisms to induce a concise and interpretable rule base, while at the same time improving the system’s accuracy/generalization. This approach enables the system to deal with large high-dimensional data and to enhance robustness by pruning spurious or redundant fuzzy rules and labels that typically stem from noise or outliers. It is also worth noting that, to the best of our knowledge, RFCMAC constitutes a new and the ﬁrst FCMAC that employs rule base reduction mechanisms. Research on FCMAC has yielded a variety of models that provide some inspirations in developing RFCMAC. In Nie and Linkens (1994), for instance, a self-organizing FCMAC model capable of adapting its structure and parameters in real time was developed. Hwang and Hsu (2002) proposed an FCMAC with reinforcement learning ability based on stochastic actor-critic model to compute the optimal actions. Meanwhile, to improve the interpretability and address the rigid structural limitations in classical FCMAC, Ng et al. (2008) devised an FCMAC realizing Compositional Rule of Inference (FCMAC-CRI), and a similar model realizing Yager Inference (FCMAC-Yager) was developed in Sim et al. (2006). Further improvements were made in Nguyen et al. (2006) by

12067

integrating Bayesian Ying-Yang learning into FCMAC in order to derive good input fuzzy sets that can more accurately capture the input data distribution. To improve the learning speed, Su et al. (2006) proposed a non-uniform credit assignment scheme for updating the FCMAC weights. More recently, Peng et al. (2007) built a recurrent FCMAC for dynamic tasks that has feedback connections capturing the dynamics of a system via time delays. Regardless, in contrast to RFCMAC, all these models still lack a comprehensive account for reduction mechanisms, and their abilities to extract concise, intuitive rules and handle large, highdimensional data have yet to be demonstrated. Meanwhile, comparisons can be made with several contemporary NFSs incorporating reduction mechanisms. Chakraborty and Pal (2004) developed a feature and rule reduction technique based on certainty factor. This approach, however, requires multiple rule base tuning phases that yield rather slow learning, whereas RFCMAC uses only a single phase. In Angelov and Zhou (2008), an evolving fuzzy system is proposed that can dynamically discard ambiguous or old rules, but its reduction method does not yet account for removal of inconsequential input features and labels as in RFCMAC. Ang and Quek (2005) proposed the Rough Set-Based Pseudo Outer Product (RSPOP) that employs a consistency measure derived from rough set theory to reduce input features and subsequently discard redundant remaining rules that have inconsequential input labels. Unlike RFCMAC, however, RSPOP does not warrant optimal input–output mapping, for there is no tuning phase after the reduction step. Recently, Liu et al. (2007) devised the Hebb Rule Reduction (HRR) that addresses this issue by tuning the rule base parameters after feature and label reductions are performed. Yet its reduction is not extended to omit redundant remaining rules, unlike RSPOP and RFCMAC. Moreover, these models still do not scale very well to large datasets and, being globalized NFS type, remain subject to catastrophic interference. The remainder of this article is organized as follows. Section 2 describes the architecture of the RFCMAC system. Section 3 details its learning procedure, accompanied by a pedagogical illustration in Section 4. Subsequently, experimental results and discussion are presented in Section 5. Section 6 ﬁnally concludes this article. 2. System architecture 2.1. Connectionist structure The RFCMAC system is a fuzzy associative memory that is built upon the original CMAC (Albus, 1975) which models the physiological and localized learning traits of the cortical circuits in the brain, and expanded to model fuzzy linguistic rules to provide system interpretability similar to Sim et al. (2006). It realizes the Mamdani model of fuzzy rule (Mamdani & Assilian, 1999), where the input and output feature spaces are partitioned into a number of antecedent and consequent fuzzy sets/labels, respectively, in a nonuniform manner. The model is deﬁned in (1)

IF x1 is A1;j and . . . xi is Ai;j and . . . xI is AI;j THEN y1 is C l;1 and . . . ym is C l;m and . . . yM is C l;M

ð1Þ

where xi is the ith input feature of interest (e.g., velocity and distance to obstacle in the case of a car control system (Pasquier, 2001; Pasquier & Oentaryo, 2008)), ym is the mth output feature (e.g., brake control), and I and M are the total number of inputs and outputs, respectively, Ai,j is the jth antecedent label of the ith input (e.g., Slow, Fast and Near, Far for the velocity and distance inputs, respectively), and Cl,m is the lth consequent label of the mth output (e.g., Weak, Strong for the brake control output). An example of RFCMAC architecture with two inputs x1, x2 and an output y is shown in Fig. 1(a).

12068

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Fig. 1. An overview of the proposed RFCMAC system.

In the RFCMAC architecture, an arbitrary virtual rule cell is denoted as Z ½j1 ;...;ji ;...;jI , where [j1, . . . , ji, . . . , jI] refers to the address of the rule cell. The total number of possible rules in RFCMAC is thus Q K ¼ Ii¼1 J i , where Ji is the number of antecedent labels for the ith input. This shows that the number of rules depends heavily on the dimensionality of the inputs and the number of labels in each input. In practice, however, only a small portion of the rule space is actually used, and allocating memory for all K rules may be unreasonable, especially for large datasets. A method is thus needed to map this virtually large rule space into a physically small storage, analogous to the mapping between virtual and physical memory spaces in computer systems. While it is possible to use hash map, for instance, this approach imposes the risk of rule collisions, where distant rules are mapped into the same location, possibly yielding undesirable generalization and degraded performance (Wang, Schiano, & Ginsberg, 1996). To alleviate this issue, RFCMAC employs instead an associative container data structure based on simple bidirectional linked list for storing its physical rules. This data structure ensures that only the physical rules relevant to the current domain data are stored and that each rule is unique, thus eliminating collisions. The only tradeoff (as compared to hashing) is slower neighborhood invocation due to the need to trace from the ﬁrst list element to know the rules corresponding to the activated antecedent labels. However, this time excess is generally negligible, since the number of physical rules remains usually small in practice. This idea is illustrated in Fig. 1(a), where virtual rules are mapped into physical rules using an associative container. Each physical rule Rk in the container has access via the bidirectional links to its adjacent rules i.e., the trained physical rules whose corresponding virtual rule cells are located closest to that of Rk. Such access allows a fast traversal from one physical rule to its neighboring rules, whose antecedent labels are invoked during the inference process. Fig. 1(a) also shows that some virtual rules can map to the same physical rule (e.g., Z[11], Z[13] map to R1). This does not

imply a collision but indicates that the antecedent links from some inputs are omitted (e.g., R1 says: ‘‘IF x1 is A1,1 THEN y is C1,1’’, ignoring input x2). This omission results from the rule reduction phase of the RFCMAC learning process and will be detailed in Section 3.2. 2.2. Inference scheme The RFCMAC architecture is generic and can accommodate various fuzzy inference schemes. For illustration and experimentation purposes, the so-called Yager inference scheme (Keller, Yager, & Tahani, 1992) is adopted in the current work, yielding the RFCMAC–Yager system. In this scheme, the system output is computed based on the degree of dissimilarity between input and rule antecedent. Its key feature is that, when the input perfectly matches the rule antecedents, the ﬁnal output exactly matches the rule consequents. Such deduction process is conceptually sound, for it maps closely to the material implication in classical Boolean logic while corresponding to the humans’ intuitive reasoning (Oentaryo et al., 2008; Sim et al., 2006). In this work, Gaussian membership function (MF) is used to describe both the antecedent and consequent labels in the RFCMAC– Yager system. The computation of the ﬁnal crisp output ym at the mth dimension based on the Yager inference is deﬁned in (2)

P

mðl;mÞ fk k

k2S

rðl;mÞk ð2fk Þ

k2S

rðl;mÞk ð2fk Þ

ym ¼ P

fk

ð2Þ

where S is the set of (physical) rule indices being selected; mðl;mÞk and rðl;mÞk , respectively denote the centroid and width of the MF in the consequent label C ðl;mÞk linked to the selected rule Rk with ﬁring strength fk, as per (3)

fk ¼

I Y ð1 dði;jÞk Þ i¼1

ð3Þ

12069

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

The term dði;jÞk in (3) essentially computes the degree of dissimilarity between the input xi and antecedent label Aði;jÞk (whose MF is deﬁned by the centroid mði;jÞk and width rði;jÞk ) connected to rule Rk, as per (4)

dði;jÞk ¼ 1 exp

ðxi mði;jÞk Þ

2

!

r2ði;jÞk

ð4Þ

Meanwhile, the (localized) selection of rules in RFCMAC–Yager is determined based upon the neighborhood set Ni of the selected, relevant antecedent labels, as per (5)

8 if mi;1 > xi > < f1g if mi;Ji < xi Ni ¼ fJ i g > : fj; j þ 1g if 9j : mi;j 6 xi 6 mi;jþ1

ð5Þ

Here each instance [j1, . . . , ji, . . . , jI] of the index combination set {N1, . . . , Ni, . . . , NI} addresses uniquely a virtual rule Z ½j1 ;...;ji ;...;jI , which then maps to physical rule Rk. 3. Learning procedure The learning procedure of the RFCMAC system consists of three phases: label generation, rule generation, and parameter tuning, as elaborated in Sections 3.1, 3.2, 3.3, respectively. Fig. 1(b) provides an overview of the procedure along with its correspondence to the connectionist structure, whereby each of the ﬁrst two phases further comprises construction and reduction steps modeling the two-stage neural development explained in Section 1. A complexity analysis of the learning procedure is also presented in Section 3.4.

its simplicity and low bias for multi-valued features. The measure is expressed in (7)

RAB ¼

3.1.1. Stage 1A: Feature construction This step serves to select a (sub)set of input features that are highly correlated to the output but uncorrelated to one another. Starting from an empty subset, features are added one at a time (i.e., forward feature selection) until D features are selected, where D is empirically set as 1 6 D 6 I, and the feature subset selected in each step is one that maximizes the correlation function (Hall, 1999) in (6)

nRfy C F ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n þ nðn 1ÞRff

ð6Þ

where CF is the merit of a feature subset F containing n 6 D features, Rfy is the average correlation between features f 2 F and output y, and Rff is the average correlation among different features f 2 F. The numerator of (6) indicates how relevant a feature subset is, while the denominator denotes how much redundancy exists among the features. Consequently, by adding each time a feature that maximizes CF, a set of unique, informative features with low redundancy can be obtained. The chosen features are used from step 2 onwards, and their orders are kept for the later rule generation phase. In (6), the degree of correlation between any two features A and B is computed using a symmetrical uncertainty measure (Press, Flannery, Teukolsky, & Vetterling, 1986), which is chosen due to

ð7Þ

where I(A, B) = H(A) + H(B) H(A, B) is the information gain (also called mutual information) between features A and B, H(A) is the entropy of A, and H(A, B) is the joint entropy of A and B. To obtain H(A) and I(A, B), the probability distributions of A and B are typically computed a priori via discretization (Fayyad & Irani, 1993) or kernel estimator (Parzen, 1962). However, this estimation is rather complex and expensive, with a time complexity up to O(N3), where N is the number of data instances. For simplicity, normal distribution is assumed instead to avoid the expensive distribution estimation. The notion of differential entropy (Lazo & Rathie, 1978) in (8) is used, and the corresponding information gain is given in (9)

1 1 þ log 2pr2A 2 1 IðA; BÞ ¼ log 1 q2AB 2

HðAÞ ¼

ð8Þ ð9Þ

where r2A is the variance of feature A and qAB the Pearson’s correlation between A and B (Goldman & Weinberg, 1985). In this manner, the complexity of computing H(A) and I(A, B) reduces to O(N). 3.1.2. Stage 1B: Label construction The initial antecedent and consequent labels are crafted in this step. For each pth training data point (p 2 {1, . . . , N}), if the overall similarity S(p) between the inputs and selected antecedent labels, ðpÞ computed in (10) based on the dissimilarity degree di;j as deﬁned in (4), is lower than a splitting threshold d, then a new label is created at every input and output dimension.

3.1. Label generation The label generation phase aims at identifying the input features that are most relevant to the ﬁnal outputs (i.e., classiﬁcation/approximation made by the system), and subsequently partitioning the data space into a set of input/output fuzzy labels. This phase comprises the following three steps:

2IðA; BÞ 2 ½0; 1 HðAÞ þ HðBÞ

SðpÞ ¼

D n o 1X ðpÞ 1 min di;j j2f1; ... ; J i g D i¼1

ð10Þ

The centroid of the newly generated label is initialized to the current data point, and the width is set in proportion to the scale of the respective feature. In essence, d controls the coverage of data by the labels and affects the number of initial labels (and initial rules). Its value is chosen empirically as 0 < d 6 1. Larger d yields more labels, and vice versa. When S(p) P d, Hebbian learning (Hebb, 1949) is performed based on the current data point to update the winning antecedent and consequent labels Ai;j and Al ;m , having the lowest dissimilariðpÞ ðpÞ ties di;j and dl ;m with respect to input xi and target output tm, ðpÞ ðpÞ respectively. The weights wi;j of Ai;j and wl ;m of C l ;m are updated using (11) and (12)

DwðpÞ i;j

D 1 X ðpÞ ¼ ð1 di0 ;j Þ DM i ¼1

DwðpÞ ¼ l ;m

1 DM ðpÞ

!

M X

! ð1

ðpÞ dl ;m0 Þ

ð11Þ

m0 ¼1

! ! D M X X ðpÞ ðpÞ 1 di0 ;j 1 dl ;m0 i0 ¼1

ðpþ1Þ

ð12Þ

m0 ¼1 ðpÞ

ðpÞ

ðpþ1Þ

ðpÞ

where Dwi;j ¼ wi;j wi;j and Dwl ;m ¼ wl ;m wl ;m . For every new label created, its weight is initially set as one. The accumulated ðNÞ ðNÞ weights W i;j ¼ wi;j and W l;m ¼ wl;m are later used to rank the antecedent and consequent labels in each dimension, respectively. 3.1.3. Stage 2: Label reduction Antecedent (or consequent) labels in each dimension are successively evaluated based on their accumulated weights, starting from the highest rank. In each evaluation, among the labels in every input, the label Ai,j0 is chosen whose centroid is the nearest to that of label Ai,j being evaluated (j0 – j). If their overlapping degree exceeds a merging threshold a (empirically set as 0 < a 6 1),

12070

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

the two labels are combined as a new label Ai;j00 and Ai;j0 is deleted, thus improving interpretability; otherwise, Ai;j0 remains. In sum, a governs the tradeoff between accuracy and interpretability. Higher a can improve accuracy but may reduce interpretability, and vice versa. The centroid and width of Ai,j and Ai;j0 , i.e., (mi,j, ri,j) and ðmi;j0 ; ri;j0 Þ, are then merged into ðmi;j00 ; ri;j00 Þ using (13) and (14), and the new label weight is given in (15)

mi;j00 ¼

mi;j W i;j þ mi;j0 W i;j0 W i;j þ W i;j0

ð13Þ

maxfmi;j þ ri;j ; mi;j0 þ ri;j0 g minfmi;j ri;j ; mi;j0 ri;j0 g ð14Þ 2 ¼ W i;j ð15Þ

ri;j00 ¼ W i;j00

In this process, the overlapping degree between Ai,j and Ai;j0 is given in (16)

! jAi;j \ Ai;j0 j jAi;j \ Ai;j0 j uðAi;j ; Ai;j Þ ¼ max ; jAi;j j jAi;j0 j 0

ð16Þ

where jAi;j j; jAi;j0 j and jAi;j \ Ai;j0 j are computed based on the centroids and widths of the labels via (17) and (18), respectively, assuming mi;j P mi;j0 , as in Lin and Lee (1994)

jAi;j j ¼

pﬃﬃﬃﬃ

pri;j

ð17Þ 2

jAi;j \ Ai;j0 j ¼

pﬃﬃﬃﬃ

h mi;j0 mi;j þ p ri;j þ ri;j0 pﬃﬃﬃﬃ 2 p ri;j þ ri;j0

pﬃﬃﬃﬃ 2 h mi;j0 mi;j þ p ri;j ri;j0 pﬃﬃﬃﬃ 2 p ri;j ri;j0

pﬃﬃﬃﬃ 2 h mi;j0 mi;j p ri;j ri;j0 pﬃﬃﬃﬃ þ 2 p ri;j ri;j0

ð18Þ

where h(x) = max{0, x}. The merging of consequent labels at the output section is the same as in (13)–(16), except that Ai,j is replaced with Cl,m. In a special case when there is only one label in an input feature, that feature is deleted and D decreased by 1. 3.2. Rule generation In this phase, a set of rules linking identiﬁed antecedent and consequent labels is constructed. The rules are then reduced/compressed based on certain consistency measure to produce a concise yet accurate rule base. The steps are detailed hereafter. 3.2.1. Stage 1: Rule construction The identiﬁed antecedent and consequent labels are ﬁrst sorted based on their centroids in an ascending order, so as to support the neighborhood selection process in (5). Then for each pth data point, a physical rule Rk, if not already existing in the associative container, is created at the address pointed by the set of currently selected winning antecedent labels (from all inputs) giving the minimum degrees of dissimilarity. This new rule is then linked to all consequent labels in all outputs, with all link weights initially set to zero. As only one rule can be created for each data point, the number of rules K is constrained as K 6 N. Hebbian learning (Hebb, 1949) is then carried out for each pth ðpÞ data point to update the weight wk;l;m linking the most inﬂuential rule Rk (addressed by the current winning antecedent labels) and the winning consequent label Cl,m in each mth output. This is deﬁned in (19) ðpÞ ðpÞ Dwk;l;m ¼ fk

M Y ðpÞ 1 dl0 ;m0 ; 8l;

0

l 2 f1; . . . ; Lm g

ð19Þ

m0 ¼1 ðpÞ

ðpþ1Þ

ðpÞ

ðpÞ

where Dwk;l;m ¼ wk;l;m wk;l;m ; fk is the ﬁring strength of rule ðpÞ Rk ; dl0 ;m is the dissimilarity degree between Cl,m and target output

ðpÞ tm , and Lm is the number of consequent labels in the mth output. The ﬁnal consequent label C ðl;mÞk , to which Rk should be linked, is subsequently determined using (20):

C ðl;mÞk ¼ C l ;m jW k;l ;m ¼

max fW k;l;m g

l2f1; ... ; Lm g

ð20Þ

ðNÞ

where W k;l;m ¼ wk;l;m is the accumulated link weight for N data points. Afterwards, rule links to all the non-winning consequent labels (i.e., "l – l⁄) are discarded. 3.2.2. Stage 2: Rule reduction More input features are eliminated in this step using a rule consistency criterion adopted from rough set theory (Pawlak, 1982). Speciﬁcally, for each ith input feature, consistent rules are identiﬁed which, when their antecedent links to the ith input are omitted, will have the same remaining antecedent and consequent labels. That is, a rule Rk is consistent with another rule R0k when it satisﬁes (21)

n o Ak fAði;jÞk g ¼ Ak0 Aði;jÞk0 ^ Cðl;mÞk ¼ Cðl;mÞk0

ð21Þ

where Ak ¼ fAð1;jÞk ; . . . ; Aði;jÞk ; . . . ; AðI;jÞk g and Ck ¼ fC ðl;1Þk ; . . . ; C ðl;mÞk ; . . . ; C ðl;MÞk g are the sets of all antecedent and consequent labels of Rk. The ith input feature is thus discarded when all rules Rk in memory meet criterion (21). Consequently, duplicate rules with the same residual set of antecedent labels Ak fAði;jÞk g are deleted. Otherwise, a partial feature reduction is performed for each rule Rk as long as removal of the ith input ofn Rk does o not yield an ambiguous rule Rk0 i.e., Ak fAði;jÞk g ¼ Ak0 Aði;jÞk0 but Cðl;mÞk –Cðl;mÞk0 . In turn, any one or more consistent/duplicate rules matching Rk as per (21) are discarded, and the inconsequential inputs of Rk are deleted. These steps ensure that each physical rule is stored only once, for memory efﬁciency, while enhancing the generalization ability of the rules. It is important to note that a different order of feature reduction may result in different ﬁnal rule sets. A reasonable heuristic in this case is to perform reduction operations starting from the least signiﬁcant input feature, selected last in the feature construction step. The rough set-based rule reduction in RFCMAC bears some similarities to that in RSPOP (Ang & Quek, 2005), but they differ in several important ways. RSPOP uses two separate phases of feature and rule reduction, whereas RFCMAC combines the two in a single phase. Moreover, the RFCMAC reduction is only concerned with the consistency criterion in (21), whereas each RSPOP reduction phase additionally checks if certain objective measure deteriorates before reduction can take place. The latter, however, requires multiple passes of data that are rather slow and may degrade the conciseness of the ﬁnal rule base. In contrast, RFCMAC tries to reduce as many rules as possible, only after which the rule base parameters are tuned. As shown by our experiments, this method is more effective for deriving a compact rule base, while still giving good output accuracy/generalization. The RFCMAC parameter tuning procedure is described in Section 3.3. 3.3. Parameter tuning The parameter tuning phase involves an iterative Localized Least Mean Square (LLMS) procedure, which is an iterative version of the LMS algorithm (Widrow & Stearns, 1985) adapted to the RFCMAC architecture for locally updating the kernel parameters of the antecedent and consequent labels selected in the current neighborhood. The procedure attempts to minimize the error criterion deﬁned in (22)

12071

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

EðpÞ ¼

M ðpÞ 2 1X t yðpÞ m 2 m¼1 m

ð22Þ

ðpÞ ym

ðpÞ tm

where and are the mth target and system outputs for the pth data point, respectively. The LLMS-based tuning in RFCMAC involves successive corrections, one step at a time, to the kernel parameters of the selected antecedent label Ai,j and consequent label Cl,m in the negative direction of their gradients, as deﬁned in (23) and (24) ðpÞ

ðp1Þ

hi;j ¼ hi;j ðpÞ

ðp1Þ

hl;m ¼ hl;m

g g

@EðpÞ

ð23Þ

ðpÞ

@hi;j @E

ðpÞ

ð24Þ

ðpÞ @hl;m

ðpÞ ðpÞ where g is the learning rate and hi;j hl;m is either the centroid mi,j (ml,m) or width ri,j (rl,m) of Ai,j(Cl,m). A high g value produces fast learning at the expense of stability, whereas a low value yields stable but slow learning (Duda et al., 2000; Lin & Lee, 1996). To ensure a good compromise between learning speed and stability and to avoid introducing further meta (free) parameter in the system, the value of g is heuristically set as 1/N in this work. Resolving (24) based on the deﬁnitions given in (2)–(4), the updating formulae for the consequent parameters ml,m and rl,m are computed as per (25) and (26) respectively

DmðpÞ i;j

DrðpÞ i;j

¼

g

ðpÞ

ðpÞ

xi mi;j

!

ðpÞ M X tðpÞ m ym

!

Dm ðri;j Þ2 m¼1 1 0 10 ðpÞ ðpÞ ðpÞ X mðl;mÞ ym B k @ A@ fk C 2A ðpÞ ðpÞ r k2S ðl;mÞk 2 fk ! ! ðpÞ ðpÞ 2 ðpÞ M g ðxi mi;j Þ X t ðpÞ m ym ¼ ðpÞ ðpÞ MjSj Dm ðri;j Þ3 m¼1 1 0 10 ðpÞ ðpÞ X mðpÞ f ðl;mÞk ym B @ A@ k C 2A ðpÞ ðpÞ r k2S ðl;mÞk 2 fk MjSj

ðpÞ

ðpÞ

ðpÞ

ðpþ1Þ

ðpÞ

ðpþ1Þ

ð25Þ

ð26Þ

ðpþ1Þ

ðpÞ

where Dmi;j ¼ mi;j mi;j and Dri;j ¼ ri;j Dri;j . Similarly, the updating formulae for the antecedent parameters mi,j and ri,j are given in (27) and (28), respectively

DmðpÞ l;m

Dr

ðpÞ l;m

¼

g

ðpÞ

ðpÞ tm ym

ðpÞ

!

X

ðpÞ

fk

ðpÞ

ðpÞ

jSl;m j DðpÞ ðpÞ 2 fk m rl;m k2Sl;m 0 1 ðpÞ ðpÞ ðpÞ ðpÞ ðpÞ t y m y m m fk l;m C X g B m ¼ ðpÞ @ A 2 ðpÞ ðpÞ ðpÞ jSl;m j ðpÞ 2 fk rl;m Dm k2Sl;m ðpÞ

ðpþ1Þ

ðpþ1Þ

mi;j

ðpÞ

ðpÞ

ðpþ1Þ

ðpÞ

ðpÞ

8 l x > > < i ¼ xri > > : ðpÞ mi;j

ðpþ1Þ

< xli ^ mi;j

ðpþ1Þ

> xri ^ mi;j

if mi;j if mi;j

ðpþ1Þ

– xmin i

ðpþ1Þ

– xmax i

ð29Þ

otherwise

where xli and xri are the nearest input points on the left and right of ðpþ1Þ ðpÞ centroid mi;j , and xmin ¼ minp2f1; ... ; Ng xi and xmax ¼ i i ðpÞ maxp2f1; ... ; Ng xi are the minimum and maximum values of the ith input, respectively. The above tuning process is repeated until the convergence criterion (30) is met

X N N N X X ðpÞ ðpÞ ðpÞ E E Et1 t 6 e t1 p¼1 p¼1 p¼1

ð30Þ

where EtðpÞ is the cost for the pth data point at the tth training epoch and e is a small threshold value (e.g., 105), or until a maximum number of epochs Tmax is reached (i.e., t = Tmax). The e and Tmax thus constitute complementary parameters to avoid excessive training epochs that may otherwise yield overﬁtting and degraded generalization. 3.4. Complexity analysis A summary of the worst time complexity of the three learning phases, employing the notations adopted in the previous sections, is given in Table 1. It can be seen that the main computational load lies in the feature construction and parameter tuning steps, especially when the number of inputs I and number of data points N are large. Nevertheless, the complexity of the feature construction phase remains fairly low, as it is governed by the parameter D which is typically much smaller than I (i.e., D I). On the other hand, as the parameter tuning in RFCMAC is localized and only updates selected labels in the current neighborhood, the complexity of the step is relatively low compared to globalized learning methods. This is also notably faster than conventional FCMAC methods, such as (Hwang & Hsu, 2002; Ng et al., 2008; Nie & Linkens, 1994; Nguyen et al., 2006; Sim et al., 2006; Ting & Quek, 2009), since the tuning is done at a much reduced feature/rule space after the label generation and rule generation phases are performed. Meanwhile, the space complexity of the proposed system is deﬁned in (31), which is essentially the sum of the numbers of input/ output labels and (physical) rules. D X

!

ð27Þ

ð28Þ

As noted in Section 3.2, it must be that K 6 N, although most of the time it is much smaller (i.e., K N), implying in turn a total (worst) space complexity as per (32).

where Dml;m ¼ ml;m ml;m , Drl;m ¼ rl;m rl;m , and Sl;m is the set of selected rule indices that are linked to Cl,m. The complete derivations of all the above updating equations can be found in Appendix A. An important difference between the above updating procedure and that of contemporary FCMAC systems is that the former covers not only consequent labels but also antecedent labels, which help to further improve the accuracy of the input–output mapping. Also note that the shifting of antecedent centroid mi,j due to (23) may result in an alteration of the activated input neighborhood set computed in (5), when the same data point is presented to the system in the next training epoch. This may lead to unwanted spikes in the learning curve that can hamper the learning stability. To resolve ðpþ1Þ this issue, a simple regularization procedure is done on the mi;j obtained via (23), as per (29)

Ji þ

M X

O Kþ

O Nþ

i¼1

ð31Þ

m¼1

i¼1

D X

Lm

Ji þ

M X

! Lm

ð32Þ

m¼1

This requirement is considerably low nonetheless, which again can be attributed to the reduction processes in the label generation and rule generation phases. This beneﬁt is appealing when compared to classical FCMAC systems. 4. Pedagogical example A pedagogical example is given to illustrate the working of the RFCMAC learning procedure. Consider the simple data in Fig. 2, comprising ﬁve nominal input features x1–x5 and an output y. In this case, x2 and x3 are the two features relevant to y, whereas x1, x4 and x5 are irrelevant. There also exist three indispensable rules that best describe the data, and it is expected for RFCMAC to

12072

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Table 1 Complexity analysis of the RFCMAC learning procedure. Learning phase Label generation (1) Feature construction (2) Label construction (3) Label reduction Rule generation (1) Rule construction

Time complexity

Description

O(N D I M) P PM OðN ð D i¼1 J i þ m¼1 Lm ÞÞ PD PM Oð i¼1 J i þ m¼1 Lm Þ

Compute pairwise input–output and input-input feature correlations, and add selected features one at a time For each data point, update Hebbian strengths of the winning input/output labels, or create new labels Combine the labels in all input and output dimensions when their overlapping degrees are sufﬁciently high

(2) Rule reduction

P PM OðN ð D i¼1 J i þ m¼1 Lm ÞÞ 2 O(K (D + M))

Perform pairwise rule comparison to check whether they are consistent, and then remove duplicate rules

Parameter tuning

O(N K Tmax (D + M))

Update kernel parameters of the labels belonging to the selected rules for all data points in every epoch

For each point, determine based on the winning input labels to create a rule linking them to output labels

Fig. 2. A simple dataset with redundant input features.

discover them. For visualization purposes, the parameter D is set as three here. Learning begins with the feature construction step selecting x2, x3 and x5 in this order based on (6) (i.e., x2 is the ﬁrst rank and so on). The subsequent label construction step involves determining whether a new fuzzy label needs to be created for each pth data point, as illustrated in Fig. 3(a). For simplicity, the ﬁgure only shows either the newly generated, or existing (winning), labels, ðpÞ ðpÞ having the lowest dissimilarities di;j (or dl;m ) and Hebbian weights ðpÞ ðpÞ wi;j (or wl;m ) for p 2 {1, . . . , 6}. Here the splitting threshold d is set as 0.5, and the initial width of the Gaussian MF set to a small value so as to minimize the overlaps between neighboring MFs. For p = 1–4 and 6, a new label is created at every input/output dimension, as the overall similarity S(p) deﬁned in (10) falls below d; conversely, no new label is formed for p = 5, since the 5th data point is the same as the 3rd (i.e., S(5) = 1), viewed from x2, x3 and x5. In the ﬁrst case, duplicate (or very similar) input and output labels may emerge (e.g., A2,1 and A2,2, C1,1 and C2,1, etc.) that have identical (or very similar) MFs. This is possible mainly because the computation of S(p) involves all, rather than one, input dimensions. Label reduction is carried out afterwards to merge similar labels, ð6Þ as per (13) and (14), taking into account the weights W i;j ¼ wi;j and ð6Þ W l;m ¼ wl;m accumulated from p = 1 to 6. Fig. 3(b) illustrates this process, where the callouts show how the ﬁnal labels Ai;j0 and C l0 ;m are derived. For instance, the antecedent label A2,3 has the same MF as A2,4, so they can be merged into a new label A2;20 . In this case, A2,3 has a larger accumulated weight W2,3 = 2.0, for the 5th data point equals the 3rd, as explained before. Hence, A2;20 will have a weight W 2;20 ¼ 2:0. Meanwhile, for any other labels having identical MF and accumulated weight, the resulting merged label will have exactly the same parameters, e.g., the merged label A2;10 is identical to its constituents A2,1, A2,2 and A2,5. Next, the rule construction step is performed via (19) and (20), producing ﬁve (physical) rules R1–R5 in this order, as shown in Fig. 4(a). The dark and white circles denote the rules for y = 1 and 2, respectively, while the (x, y, z) labels the centroid

coordinates of their respective antecedent labels. Rule reduction is then carried out that starts by examining x5 (i.e., the lowest-ranked input feature). It is found that x5 can be removed entirely, for all rules meet (21). The memory representation of RFCMAC thus reduces from three- to two-dimensional, as per Fig. 4(b). Here R3–R5 expand their territories to uncovered memory region giving R03 R05 , while R1 and R2 are merged into R01 since they are consistent. In effect, this improves the rule generalization. The next consistency test using (21) indicates that removing links to x3 from rules R04 and R05 does not yield ambiguity; hence, they are merged into R004 , as per Fig. 4(c). For the same consistency reason, it is also valid to remove link to x2 from R03 , yielding R003 . This gives the three ﬁnal rules: R01 ; R003 and R004 , as in Fig. 4(d). Parameter tuning is carried out afterwards, which stops after one epoch as the error (22) is already zero, also satisfying (30).

5. Experimental results and discussion 5.1. Simulation setup A number of experimental studies have been conducted to validate the RFCMAC–Yager approach, the three most illustrative of which are reported in this article: the Nakanishi regression benchmark (Nakanishi, Turksen, & Sugeno, 1993), wastewater treatment plant (Asuncion & Newman, 2007), and acute leukemia diagnosis (Golub et al., 1999). In all experiments, the merging threshold a and maximum training epochs Tmax are ﬁxed as 0.5 and 15,000, respectively. Meanwhile, the other system parameters are determined empirically for each case study, as summarized in Table 2. Detailed results and analysis of the case studies are given in Sections 5.2–5.4.

5.2. Nakanishi regression benchmark The Nakanishi datasets (Nakanishi et al., 1993) are explored here to provide an initial evaluation for the function approximation and knowledge reduction abilities of the RFCMAC system. Three datasets are available: the nonlinear system, human operation of a chemical plant, and daily price in a stock market, each split into three similar groups: A, B and C (Nakanishi et al., 1993). In this study, A and B are used as the training set, while C the testing set. The results are then compared with those of several regression methods: Multilayer Perceptron (MLP) (Rumelhart, Hinton, & Williams, 1986), Radial Basis Function (RBF) (Powell, 1987), Support Vector Regression (SVR) (Smola & Schölkopf, 1998), Interval-Valued Compositional Rule of Inference (IV-CRI) (Nakanishi et al., 1993), Adaptive-Network-based Fuzzy Inference System (ANFIS) (Jang, 1993), Evolving Fuzzy Neural Network (EuFNN) (Kasabov, 2001), RSPOP realizing CRI (RSPOP-CRI) (Ang & Quek, 2005), HRR (Liu et al., 2007), and FCMAC-Yager (Sim et al., 2006).

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

12073

Fig. 3. Example of label generation in RFCMAC.

The nonlinear system in Nakanishi et al. (1993) is described using (33)

1:5 2 y ¼ 1 þ x2 ; 1 þ x2

1 6 x1 ; x2 6 5

ð33Þ

which involves two input features x1, x2, and an output feature y. Meanwhile, the actual data used in this study includes two redundant, noisy inputs x3 and x4. Comparisons can be accordingly made among the systems listed which employs feature selection methods i.e., IV-CRI, RSPOP-CRI, HRR, and RFCMAC–Yager. All these systems select inputs x1 and x2 as desired, except for RSPOP-CRI which still keeps x4 (albeit partially). The membership functions identiﬁed by RFCMAC–Yager are shown in Fig. 5(a). As can be seen, two antecedent labels are generated in x1 and x2, while y has three consequent labels. The RFCMAC–Yager system generates a total number of three

rules, listed in Table 3(a), which accurately capture the inverse relationship between the inputs and output in (33). For instance, rule R1 states ‘‘IF x1 is High AND x2 is High THEN y is Low.’’, and R3 states ‘‘IF x1 is Low THEN y is High.’’ In the latter, the omission of links to x2 stems from the partial feature reduction performed in the rule reduction step. The chemical plant dataset in Nakanishi et al. (1993) contains ﬁve inputs and a single output, labeled x1, x2, x3, x4, x5 and y, respectively. In this example, the feature selection in IV-CRI (Nakanishi et al., 1993) omits x2, x4 and x5, while RSPOP-CRI and HRR delete x1, x2, x5 and x1, x2, x4, x5, respectively. RFCMAC–Yager achieves the same result as HRR (i.e., retaining only x3), and crafts six antecedent and consequent labels in both input x3 and output y, respectively, as per Fig. 5(b). It subsequently generates six rules, as listed in Table 3(b). The monotonic mapping of the rules in the

12074

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Fig. 4. Example of rule generation in RFCMAC.

Table 2 Parameter conﬁgurations of RFCMAC–Yager for the experiments. Parameter

D d

Nakanishi data

Water plant data

Leukemia data

Example 1

Example 2

Example 3

2-Class

3-Class

Independent test

Leave-one-out

2 0.7 105

1 0.4 105

2 0.2 6 103

5 0.7 104

6 0.3 104

3 0.3 103

4 0.3 105

table essentially suggests that x3 and y are approximately linearly correlated to one another. The stock market price prediction dataset consists of 10 inputs and one output, termed x1, x2, x3,x4, x5, x6, x7, x8, x9, x10 and y, respectively, as described in Nakanishi et al. (1993). The feature selection procedures used in IV-CRI and RSPOP-CRI preserve inputs x4, x5, x8 and x4, x5, x7, x8, x9, respectively, while HRR omits x2 and x5. Meanwhile, RFCMAC–Yager keeps only two inputs: x2 and x4, which is smaller than the other methods. Input x2 is the one discarded by the other methods. However, RFCMAC–Yager shows that x2 is in fact informative, as reﬂected later by its relatively good approximation results. Five antecedent labels are then generated in both x2 and x4, and four consequent labels are crafted in y, as per Fig. 5(c). Table 3(c) lists in turn the eight rules identiﬁed in this example, which are again very intuitive and concise. Traces of the system prediction and training convergence for all three examples are given in Fig. 6(a), (c), (e) and (b), (d), (f), respectively. The former illustrates generalization performance on unseen testing data while the latter shows a stable, non-increasing learning curve that warrants training convergence (corresponding chieﬂy to the parameter tuning phase described in Section 3.3). Consolidated simulation results are shown in Table 4, involving four metrics: the number of selected features, number of rules, Pearson’s correlation R (Goldman & Weinberg, 1985), and test mean square error (MSE). As seen, RFCMAC–Yager gives the smallest set of features for all examples. Its number of rules is also the

smallest for the ﬁrst example, second smallest for the second example (excluding IV-CRI as its rules are coded by hand), and smallest for the last example (excluding IV-CRI again). On the other hand, RFCMAC–Yager yields the best R and MSE for the ﬁrst two examples, and second best for the last. While HRR gives a better result in the latter case, it has more features and rules, and hence less interpretability. In addition, due to its reduction mechanisms, RFCMAC–Yager compares favorably to its predecessor FCMAC-Yager in terms of memory requirements and prediction performances. In sum, these results show the salient interpretability and generalization traits of the proposed system. 5.3. Water plant monitoring This experiment concerns a case study of wastewater treatment plant data (Asuncion & Newman, 2007), conducted to evaluate the classiﬁcation performance of the RFCMAC–Yager in an ill-structured domain with missing feature values and unbalanced class distribution. The task involves a historical plant dataset collected over 527 days (i.e., samples) with one series of measurements per day. There are 38 different real-valued input features observed daily: 9 features for representing plant inputs, 6 and 7 features for inputs to the primary and secondary settlers respectively, 7 features for plant outputs, and 9 features for plant performances. Each day is categorized into one of the 13 classes associated with the status of the plant, some indicating normal operation of varying

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

12075

Fig. 5. Fuzzy labels crafted by RFCMAC–Yager for Nakanishi datasets.

Table 3 Fuzzy rules identiﬁed by RFCMAC–Yager for Nakanishi datasets. x1 (a) Nonlinear system example R1 High R2 High R3 Low x3 (b) Chemical plant example R1 Very Low R2 Low R3 Rather Low R4 Rather High R5 High R6 Very High x2 (c) Stock market example R1 – R2 – R3 High R4 Low R5 – R6 Very Low R7 – R8 Very High

x2

y

High Low –

Low Medium High

y Very Low Low Rather Low Rather High High Very High x4

y

Very High High Medium Medium Low Medium Very Low Medium

Very Low Very Low Low High High High Very High Very High

types, others denoting faults at various parts of the plant. The detailed feature information can be found in Asuncion and Newman (2007), while the distribution of the classes is presented in Table 5.

However, as all faults in the actual plant occur for a brief period (only 1–4 days) and are resolved immediately, one main issue is the lack of training samples for the fault cases. Following (Shen & Jensen, 2004), therefore, the fault cases were collated to form two (i.e., Normal and Faulty) and three (i.e., OK, Good and Faulty) broader categories, also given in Table 5. This results in two datasets, labeled 2-class and 3-class, respectively. Additionally, in order to handle the missing feature values, class mean imputation method is used (Kalton, 1983). That is, the data are grouped based on the output classes, and in each group, the mean value of every feature is computed to ﬁll in the missing values in that feature. It is not advisable to simply discard/rule out samples with missing values here, due to the presence of instrumental failures or other occasions disrupting the measurement. Evaluation of the RFCMAC–Yager is subsequently done using a stratiﬁed 3-fold cross-validation (CV) procedure, i.e., partitioning the data into three separate groups of train and test sets, each retaining the output class proportion as in the original data. Fig. 7 shows in turn the fuzzy labels crafted by RFCMAC–Yager for the 2-class data in CV1. In this, the system ﬁnds that the status of the plant depends solely on three features: suspended solids to primary settler (SS-P), biological demand of oxygen to secondary settler (RD-DBO-S), and sediments (RS-SED-G). Meanwhile, the rule set pertaining to the semantic interpretations of the labels in Fig. 7 is given in Table 6. As can be seen, the rules are short, comprehensible, and fairly rational. For instance, rules R2, R3 and R4 indicate that a fault is expected if either the SS-P level is high or RD-DBO-S level

12076

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Fig. 6. Results of RFCMAC–Yager on Nakanishi datasets.

is low or RS-SED-G level is low; otherwise, a normal operation is assumed, as per R1.

The results of RFCMAC–Yager on the 2-class data are summarized in Fig. 8(a) and (b). Fig. 8(a) shows the learning curve and

12077

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084 Table 4 Benchmark results on Nakanishi datasets. Predictor

MLP RBF (Gaussian) SVR IV-CRI ANFIS EFuNN RSPOP-CRI HRR FCMAC–Yager RFCMAC–Yager

Nonlinear system example

Chemical plant example

Stock market example

SF

#Rules

R

MSE

SF

#Rules

R

MSE

SF

#Rules

R

MSE

All All All x1, x2 All All x1, x2, x4 x1, x2 All x1, x2

– – – 6+ ⁄ ⁄ 17 7 7 3

0.874 0.857 0.832 0.609 0.853 0.720 0.856 0.911 0.557 0.957

0.396 0.403 0.363 0.706 0.286 0.566 0.383 0.185 0.827 0.109

All All All x1, x3 All All x3, x4 x3 All x3

– – – 5+ ⁄ ⁄ 14 3 19 6

0.999 0.921 0.998 0.993 0.780 0.946 0.983 0.998 0.927 0.999

1.703 104 8.803 105 1.833 104 2.581 105 2.968 106 7.247 105 2.124 105 2.423 104 7.856 105 1.656 104

All All All x4, x5, x8 All All x4, x5, x7–x9 x1, x3, x4, x6–x10 All x2, x4

– – – 4+ ⁄ ⁄ 29 20 42 8

0.696 0.881 0.816 0.661 0.875 0.756 0.922 0.947 0.652 0.931

143.015 33.420 49.438 93.020 38.062 72.542 24.859 15.128 167.003 19.597

SF = selected features, R = Pearson’s correlation, MSE = mean squared error, + = manually set, = not applicable, ⁄ = not speciﬁed.

Table 5 Distribution of output classes in water plant dataset. Original dataset

2-Class dataset

3-Class dataset

#Samples

Category

Value

Category

Value

Category

Value

Normal situation Secondary settler problems, type 1 Secondary settler problems, type 2 Secondary settler problems, type 3 Normal situation with good performance Solids overload, type 1 Secondary settler problems, type 4 Storm, type 1 Normal situation, low inﬂuent Storm, type 2 Normal situation Storm, type 3 Solids overload, type 2

1 2 3 4 5 6 7 8 9 10 11 12 13

Normal Faulty Faulty Faulty Normal Faulty Faulty Faulty Normal Faulty Normal Faulty Faulty

1 2 2 2 1 2 2 2 1 2 1 2 2

OK Faulty Faulty Faulty Good Faulty Faulty Faulty OK Faulty OK Faulty Faulty

1 3 3 3 2 3 3 3 1 3 1 3 3

Fig. 7. Fuzzy labels crafted by RFCMAC–Yager for 2-class water dataset (CV1).

275 1 1 4 116 3 1 1 69 1 53 1 1

12078

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Table 6 Fuzzy rules of RFCMAC–Yager for 2-class water dataset (CV1).

R1 R2 R3 R4

SS-P

RD-DBO-S

RS-SED-G

Class

Low – High –

High Low – –

High – – Low

Normal Faulty Faulty Faulty

convergence traits of the system (in CV1), while Fig. 8(b) presents a receiver operating characteristic (ROC) plot that showcases the robustness and discriminative power of the system on the test data throughout all three CVs. The latter is obtained by varying certain decision threshold and measuring the speciﬁcity and sensitivity rates for every threshold value [48]. In the current context, speciﬁcity and sensitivity are the proportion of normal and faulty cases that are correctly classiﬁed respectively, as per (34) and (35). Larger area under the ROC curve indicates better performance (Fawcett, 2006). Also, EER refers to the error equal rate when speciﬁcity equals sensitivity. Another related criterion (not shown in the plot but used in later benchmarking) is precision i.e., the proportion of predicted faulty cases that are actually correct, as in (36). These metrics serve to complement the system’s accuracy evaluation in relation to the unbalanced class distribution in the water data.

#normal samples correctly classified #actual normal samples #fault samples correctly classified Sensitivity ¼ #actual faulty samples #fault samples correctly classified Precision ¼ #samples classified as faulty

Specificity ¼

ð34Þ

of features selected and rules crafted. As seen, RFCMAC–Yager gives excellent overall performances, despite having rather low sensitivity due to the unbalanced class distribution (i.e., the faulty cases are fewer than normal ones). It also produces smaller numbers of features and rules than the other rule-based methods (excluding FRFS whose number of rules is predetermined), exemplifying its superior interpretability. In this, the RFCMAC feature construction and rule reduction steps are shown to be much more effective than the RSPOP feature/rule reduction (that is subject to some objective criterion deterioration, as explained before). It is also evident here that the reduction mechanisms in RFCMAC–Yager can help achieve signiﬁcant result improvements, in comparison to its predecessor FCMAC-Yager. Further comparisons using the same classiﬁers are performed on the 3-class dataset, as given in Table 8. This task poses a greater challenge, as reﬂected in the lower resultant prediction performances. Subsequently, because the current problem involves more than two classes, false rate (FR) indicator is used in place of the sensitivity, speciﬁcity and precision criteria. Speciﬁcally, the notation FR-c refers to the number of class-c samples that are predicted wrongly, divided by the total number of class-c cases. As seen in Table 8, RFCMAC–Yager again displays its predictive competency, particularly in terms of accuracy and FR for class Good. In comparison to the other rule-based methods (excluding FRFS), it is clear that the RFCMAC structural reduction approach offers a major advantage. While C4.5 gives fewer rules in this case (albeit quite marginal), its knowledge base involves a larger set of features and hence poorer overall rule readability.

ð35Þ ð36Þ

Comparisons are then made with several other methods: Fuzzy Rough Feature Selection (FRFS) (Shen & Jensen, 2004), Naïve Bayes (Domingos & Pazzani, 1997), MLP (Rumelhart et al., 1986), RBF (Powell, 1987), SVM (Platt, 1998), C4.5 decision tree (Quinlan, 1993), k-Nearest Neighbors (k-NN) (Cover & Hart, 1967), RSPOPCRI (Ang & Quek, 2005), the Generic Self-organizing Fuzzy Neural Network realizing Yager inference (GenSoFNN-Yager) (Oentaryo et al., 2008), and FCMAC-Yager (Sim et al., 2006). Table 7 shows the benchmark results using the accuracy, sensitivity, speciﬁcity and precision metrics described before, plus the average numbers

5.4. Acute leukemia diagnosis To examine the ability of RFCMAC–Yager in handling highdimensional data, an experiment is conducted using the acute leukemia microarray data (Golub et al., 1999), popularly studied in bioinformatics. The dataset contains expression proﬁles of 7129 genes (probe sets) collected from 72 leukemia patients, where the expression levels of each sample is measured by the Affymetrix oligonucleotide microarray. Out of these patients, 47 were identiﬁed as having acute lymphoblastic leukemia (ALL) and the other 25 as acute myeloid leukemia (AML). Following the setup in Golub et al. (1999), the data is split into a train set of 38 samples (27

Fig. 8. Results of RFCMAC–Yager on 2-class water dataset.

12079

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084 Table 7 Benchmark results on 2-class water dataset. Classiﬁer

Evaluation

#Features

#Rules

Accuracy (%)

Sensitivity (%)

Speciﬁcity (%)

Precision (%)

FRFS Naïve Bayes MLP (20 neurons) RBF (5 Gaussians) SVM (c = 1.0) C4.5 (p = 0.25) k-NN (k = 3) RSPOP-CRI GenSoFNN–Yager FCMAC–Yager RFCMAC–Yager

75%t–25%e 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV

10.00 All All All All 4.00 All 24.00 All All 2.67

2+ – – – – 5.00 – 241.33 9.33 64.33 3.67

83.90 98.67 98.29 98.10 98.48 98.29 98.48 98.29 98.67 97.53 98.67

⁄ 85.71 42.86 57.14 42.86 64.29 42.86 42.86 85.71 14.28 50.00

⁄ 99.03 99.81 99.22 100.00 99.22 100.00 99.81 99.03 99.81 100.00

⁄ 70.59 85.71 66.67 100.00 69.23 100.00 85.71 70.59 66.67 100.00

75%t–25%e = 75% train–25% test, CV = cross validation, = not applicable, ⁄ = not speciﬁed, + = manually set.

Table 8 Benchmark results on 3-class water dataset. Classiﬁer

Evaluation

#Features

#Rules

Accuracy (%)

FR-OK (%)

FR-Good (%)

FR-Faulty (%)

FRFS Naïve Bayes MLP (20 neurons) RBF (5 Gaussians) SVM (c = 1.0) C4.5 (p = 0.25) k-NN (k = 3) RSPOP-CRI GenSoFNN–Yager FCMAC–Yager RFCMAC–Yager

75%t–25%e 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV 3-Fold CV

11.00 All All All All 19.00 All 35.33 All All 5.67

3+ – – – – 31.00 – 348.67 323.33 204.67 34.67

71.80 85.20 85.58 84.25 85.39 79.89 82.35 74.19 79.51 69.07 86.34

⁄ 9.82 7.30 9.82 5.29 9.57 6.80 20.65 5.79 12.59 7.81

⁄ 30.17 28.45 32.76 42.24 54.31 50.86 41.38 61.21 85.34 25.86

⁄ 28.57 100.00 42.86 50.00 35.71 50.00 42.86 100.00 100.00 78.57

75%t–25%e = 75% train–25% test, CV = cross validation, = not applicable,⁄ = not speciﬁed, + = manually set.

ALL and 11 AML), and an independent test set of 34 samples (20 ALL and 14 AML). Given the ultra-high dimensionality of the dataset, feature selection and in particular the RFCMAC feature construction step provide a natural means to remove redundant or inconsequential input features (that may contribute signiﬁcantly to the classiﬁcation errors), and to reduce the overall computational time and memory requirements. In this experiment, the feature construction step initially selects three genes (i.e., D = 3): Leukotriene C4 synthase (Probe U50136), Zyxin (Probe X95735), and FAH Fumarylacetoacetate (Probe M55150), which were also selected and discussed in Golub et al. (1999). However, subsequent consistency evaluations in the RFCMAC rule reduction step ﬁnd that the last feature (gene) can be removed entirely, leaving only the ﬁrst two. The fuzzy labels associated with the two genes selected by the RFCMAC– Yager are shown in Fig. 9, while the corresponding set of rules is provided in Table 9. It can again be observed that the knowledge base formulated by RFCMAC–Yager is very simple and easy to understand. There is also biological evidence supporting the validity of the features and rules identiﬁed. For instance, a study in Yagi et al. (2003) revealed the signiﬁcance of Zyxin in the prognosis of pediatric AML. Zyxin has also been reported to play a critical role in encoding a LIM domain protein for cell adhesion in ﬁbroblast (Crawford & Beckerle, 1991), and more recent studies showed that Zyxin LIM (1–2) facilitates the interaction between the CasL-HEF1 and Crk1 adaptor proteins (Yi et al., 2002), closely associated with chronic myeloid leukemia and ALL (Salgia et al., 1996). On the other hand, Leukotriene C4 synthase constitutes the key enzyme in the biosynthesis of the Leukotriene C4, which stimulates the growth of human myeloid progenitor cells and is frequently overproduced in myeloid leukemia (Brunet, Tamayo, Golub, & Mesirov, 2004). In sum, all these facts suggest that both Zyxin and leukotriene C4 synthase are

crucial for leukemogenesis, although further (wet) experiments are needed to conﬁrm the conjecture. A closer examination based upon the expression patterns of the selected genes depicted in Fig. 10(a) reveals the validity and intuitiveness of the RFCMAC–Yager rules in Table 9. It is shown that the rules conform to the plotted distribution of gene expression levels. Here rule R1 covers the lower left data region, where most of the ALL cases are present (in both train and test sets), whereas R2 and R3 capture the upper and right regions, respectively, containing the majority of the AML cases. Despite its simplicity, this set of rules can provide satisfactory predictions; it yields a perfect accuracy on the train set, and 94.12% on the test set (1 mistake from ALL and AML each). The learning curve corresponding to the tuning of these rules is given in Fig. 10(b), which again showcases the stable learning traits of the system. The classiﬁcation performance and robustness of the RFCMAC– Yager network on the (independent) test set under varying decision threshold are illustrated by the ROC plot in Fig. 11(a), where sensitivity and speciﬁcity correspond to the ALL and AML classes respectively. The results are consolidated in Table 10, and shown along with some previously published results on the same test set obtained with: Voting Machine (Golub et al., 1999), SVM with Correlation Feature Selection (SVM-CFS) (Platt, 1998; Wang et al., 2005), C4.5 (Quinlan, 1993; Wang et al., 2005), Prediction Analysis of Microarrays (PAM) (Tibshirani, Hastle, Narasimhan, & Chu, 2002), Partial Least Squares with Logistic Discrimination (PLS-LD) (Nguyen & Rocke, 2002), Emerging Patterns (EP) (Li & Wong, 2002), and RBF with Median Vote Relevance (RBF-MVR) (Chow, Moler, & Mian, 2001). Note that comparisons with (conventional) FCMAC systems are not made here, due to their lack of scalability in dealing with high-dimensional data. As shown, the SVM-CFS, EP and C4.5 employed only one feature, Zyxin, to perform the classiﬁcation. Although Zyxin alone is sufﬁcient to (linearly) separate the

12080

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Fig. 9. Fuzzy labels crafted by RFCMAC–Yager for leukemia dataset.

Table 9 Fuzzy rules of RFCMAC–Yager for leukemia dataset.

R1 R2 R3

LTC4S

Zyxin

Class

Low – High

Low High –

ALL AML AML

ALL and AML cases on the train set, it is no longer true for the test set, as illustrated in Fig. 10(a). Meanwhile, RFCMAC–Yager shows that adding Leukotriene C4 synthase can indeed provide better discrimination, and generalization results, on the test set.

Comparisons can also be made with the other remaining methods i.e., Voting Machine, PAM, PLS-LD, SVM-FFS, and RBF-MVR. As shown in Table 10, RFCMAC–Yager achieves a high accuracy rate comparable to PAM and SVM-FFS, albeit second to PLS-LD and RBF-MVR. Despite their superior performances, both PLS-LD and RBF-MVR function as black-box predictors; no meaningful knowledge (rules) can be extracted from the systems. In this respect, the primary advantage of RFCMAC–Yager is the ability to formulate concise, interpretable rule base with signiﬁcantly smaller set of features, and explain its outputs in a way highly akin to human physicians’ decision making process. The interpretability of the system is also further enhanced by the use of linguistic labels in

Fig. 10. Results of RFCMAC–Yager on leukemia dataset (independent test).

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

12081

Fig. 11. ROC curves for the RFCMAC–Yager predictions on leukemia dataset.

Table 10 Benchmark results on leukemia dataset (independent test). Classiﬁer

#Features

#Rules

Accuracy (%)

#Errors (ALL:AML)

Voting machine SVM-CFS C4.5 PAM PLS-LD EP SVM-FFS RBF-MVR RFCMAC–Yager

50 1 1 21 50 1 25–1000 50 2

– – 2 – – 2 – – 3

85.29 91.18 91.18 94.12 97.06 91.18 88.24–94.12 97.06 94.12%

5(⁄) 3(2:1) 3(2:1) 2(⁄) 1(0:1) 3(2:1) 2–4(⁄) 1(1:0) 2(1:1)

2002), Top Scoring Pair (TSP) (Geman, d’Avignon, Naiman, & Winslow, 2004), k-TSP (Tan, Naiman, Xu, Winslow, & Geman, 2005), Relevance Vector Machine (RVM) (Tipping, 2001), Sparse Logistic Regression (SLogReg) (Shevade & Keerthi, 2003), Bayesian Logistic Regression (BLogReg) (Cawley & Talbot, 2006), and RBF-MVR (Chow et al., 2001), listed in Table 11. The results (especially comparisons with RBF-MVR) is similar to that of Table 10. In general, RFCMAC–Yager shows a good balance between interpretability and prediction accuracy, demonstrating therefore its efﬁcacy as a clinical decision support system.

= not applicable, ⁄ = not speciﬁed.

6. Conclusion

Table 11 Benchmark results on leukemia dataset (leave-one-out). Classiﬁer

#Features

#Rules

Accuracy (%)

#Errors (ALL:AML)

C4.5 PAM TSP k-TSP RVM SLogReg BLogReg RBF-MVR RFCMAC–Yager

2.00 2296.00 2.00 18.00 3.63 5.06 11.59 25.00 2.92

2.00 – 2.00 18.00 – – – – 3.92

73.61 97.22 93.06 95.83 93.06 94.44 93.06 98.61 97.22

19(⁄) 2(⁄) 5(⁄) 3(⁄) 5(⁄) 4(⁄) 5(⁄) 1(1:0) 2(0:2)

= not applicable, ⁄ = not speciﬁed.

its fuzzy rules (as opposed to crisp rules in C4.5 and EP), thereby allowing the user to understand them in familiar terms and making the system veriﬁcation easier. To conﬁrm the performance verity of the proposed system, further experiments adopting leave-one-out (LOO) strategy on all 72 samples (i.e., no division between the train and test sets) were conducted, owing to the small data size. This is the same as 72-fold CV, where 71 samples serve to train the system, and 1 to assess its generalization performance. The result is summarized in Fig. 11(b), showing generalization improvement given larger training samples as compared to the previous result in Fig. 11(a). Comparisons are then made against several published results based on the LOO setting utilizing: C4.5 (Quinlan, 1993), PAM (Tibshirani et al.,

A new knowledge extraction tool is presented in this article, termed the Reduced Fuzzy Cerebellar Model Articulation Controller (RFCMAC), which incorporates structural reduction mechanisms to improve the interpretability and generalization of the contemporary FCMAC and localized neuro-fuzzy systems. The potential of the proposed system in identifying a concise, interpretable rule base, while enhancing classiﬁcation/approximation accuracy, has been exempliﬁed through many experimental results. These suggest that RFCMAC can be used to address complex real world problems, and to eventually realize large-scale learning memory systems required for developing a more general kind of intelligence. Several projects are now underway that aim at extending the proposed system while applying it to more novel tasks. One example is the development of an integrated neuro-cognitive architecture which models the putative functional aspects of the brain and ultimately human general intelligence (Duch, Oentaryo, & Pasquier, 2008; Oentaryo & Pasquier, 2008). A potential application is the study and modeling of cognitive skill acquisition, from novice to expert level, such as that of driving skills in an autonomous vehicle (Pasquier, 2001; Pasquier & Oentaryo, 2008). While RFCMAC is a powerful tool for solving complex problems, several deﬁciencies remain within the current system. For instance, it is possible that a ‘‘hole’’ occurs in memory when the testing inputs fall within the antecedent labels of untrained (virtual) rule cells, yielding a possibly erroneous output. Also, the acquisition of novel, fast changing patterns may be rather slow in the proposed system, due to its localized, fairly stable adaptation processes. To address these issues, the development of a ‘‘patching’’ algorithm capable of reconstructing a plausible space (Yao,

12082

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Pasquier, & Quek, 2007) and a brain-inspired dual neuro-fuzzy system that incorporates a complementary blending of globalized and localized adaptation methods (Oentaryo & Pasquier, 2009, in press), are currently being investigated. The latter also provides a groundwork to realize knowledge consolidation mechanisms resembling that of humans (Oentaryo & Pasquier, 2009, in press).

@E t m ym X fk ¼ @ml;m Dm rl;m k2S 2 fk @E @ rl;m

This section describes the mathematical derivation of the RFCMAC parameter tuning phase, which is concerned with procedures for updating the kernel parameters of the fuzzy labels in the formulated rule base. Sections A.1 and A.2 present the updating rule derivations for the consequent labels and antecedent labels, respectively.

ðpÞ

ðpÞ l;m

Dr

@E @E @ym @y ¼ ¼ ðtm ym Þ m @hl;m @ym @hl;m @hl;m

ð37Þ

@ym To derive @h , the output defuzziﬁcation formula in (2) ﬁrst needs to l;m be rewritten in terms of the individual parameter hl,m of consequent label Cl,m, as given in (38)

PLm ml;m P

mðl;mÞ fk k

k2S

rðl;mÞk ð2fk Þ

ym ¼ P

fk

¼

l¼1

PLm

l¼1

rl;m 1

rðl;mÞk

fk k2Sl;m 2fk

Dm @ym ¼ @hl;m 1 @ ¼ Dm

l0 ¼1

m0 l ;m rl0 ;m

P

fk k2S 0 2fk l ;m

@hl;m

PLm ml0 ;m P l0 ¼1

ðpÞ l;m

X

ðpÞ

fk

ð44Þ

ðpÞ

ðpÞ Dm r

2 fk

ðpÞ

k2Sl;m

1 0 ðpÞ ðpÞ ðpÞ t ðpÞ ym ml;m X f ðpÞ m ym k A ¼ g@ ðpÞ 2 ðpÞ ðpÞ Dm ðrl;m Þ ðpÞ 2 fk k2S

P

k2Sl;m

fk 2fk

rl0 ;m

@

Nm D2m

fk k2Sl0 ;m 2fk

@hl;m

PLm

l0 ¼1

y @ m Dm

DmðpÞ l;m ¼

ðpþ1Þ

g

t ðpÞ m ym

ðpÞ

ðpÞ

ðpÞ jSl;m j

ð38Þ

1

rl0 ;m

P

fk k2S 0 2fk l ;m

ð45Þ

l0 ¼1

1

rl0 ;m

P

fk k2Sl0 ;m 2fk

@hl;m

Dr

X fk @ym 1 1 X fk y 1 ¼ m ð0Þ ¼ Dm rl;m k2S 2 fk @ml;m Dm rl;m k2S 2 fk Dm

ð40Þ

ðpÞ

ðpÞ Dm

r

!

ðpÞ l;m

X

ðpÞ

fk

ð46Þ

ðpÞ

k2S

ðpÞ l;m

2 fk

1

ð47Þ

Similar to the consequent section, the updating of each antecedent label Ai,j is proportional to the derivative term @h@E , which subsei;j quently evaluates to (48)

PM

m¼1 ðt m

y m Þ2

@hi;j

¼

M X @y ðtm ym Þ m @hi;j m¼1

ð48Þ

m Using the notations Nm and Dm described before, the term @y trans@hi;j lates to (49)

@

¼

l;m

Dm

P

mðl;mÞ fk k k2S rðl;mÞ ð2fk Þ k

@hi;j

@

Nm

P

fk k2S rðl;mÞ ð2fk Þ k

@hi;j

D2m fk 1 X mðl;mÞk @ ð2fk Þ ym X Dm k2S rðl;mÞk @hi;j Dm k2S

1

fk @ ð2f

kÞ

rðl;mÞk @hi;j

fk 1 X mðl;mÞk ym @ ð2fk Þ Dm k2S rðl;mÞk @hi;j 0 1 @fk @ð2fk Þ 1 X mðl;mÞk ym @ð2 fk Þ @hi;j fk @hi;j A ¼ Dm k2S rðl;mÞk ð2 fk Þ2 ! 1 X mðl;mÞk ym 2 @fk ¼ 2 Dm k2S rðl;mÞk @h i;j ð2 fk Þ ¼

while the corresponding formula for rl,m is presented in (41)

! ! X fk @ym 1 ml;m X fk y 1 ¼ 2 m 2 @ rl;m Dm rl;m k2Sl;m 2 fk Dm rl;m k2Sl;m 2 fk ! ym ml;m X fk ¼ 2 fk Dm r2l;m k2S

ðpþ1Þ

A.2. Antecedent part

@ym ¼ @hi;j

In the context of ml,m, (39) subsequently evaluates to (40)

ðpÞ

ðpÞ ðpÞ ðpÞ ðpÞ g B tm ym ym ml;m C X fkðpÞ ¼ ðpÞ @ A 2 ðpÞ ðpÞ jSl;m j ðpÞ 2 fk r DðpÞ k2Sl;m m l;m

@1 @E ¼ 2 @hi;j

@hl;m

PLm

ðpÞ l;m

ð39Þ

l;m

ðpÞ

0

l;m

PLm

¼g

!

where Dml;m ¼ ml;m ml;m and Drl;m ¼ rl;m rl;m . It should be noted, however, that each consequent label Cl,m may be linked to many rules Rk. As a result, the amount of update computed in (44) and (45) may be overly large. To avoid excessive changes and upsetting the learning stability, the amount of update is scaled down by accounting for the number of rules being seðpÞ lected jSl;m j which point to Cl,m. The resultant revised updating formulae are subsequently given in (46) and (47)

where Sl,m is the set of selected rule indices that point to Cl,m. For simplicity, let Nm and Dm denote the numerator and denominator of (38), respectively. Using this notation and (38), the term @ym can be computed as per (39) @h @

tðpÞ m ym

l;m

With reference to (24), the updating of kernel parameter hl,m (i.e., centroid ml,m or width rl,m for Gaussian MF) in a consequent label Cl,m depends on the derivative term @h@E . The term can be rel;m solved using the standard chain rule, as per (37)

rðl;mÞk ð2fk Þ

ð43Þ

l;m

DmðpÞ l;m

A.1. Consequent part

k2S

! ðtm ym Þðym ml;m Þ X fk ¼ 2 fk Dm r2l;m k2S

As such, the (full) updating formulae of the parameters ml,m and rl,m for each pth data point can be written as in (44) and (45), respectively

Appendix A. Parameter tuning derivation

P

ð42Þ

l;m

ð41Þ

l;m

Substituting back (40) and (41) into (37) accordingly yields (42) and (43), respectively

ð49Þ

where hi,j denotes either the centroid mi,j or width ri,j in Gaussian @fk MF. In the case of mi,j, the term @h can be evaluated as per (50) i;j

12083

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

@ @fk ¼ @mi;j

QI

i0 ¼1

1 dði0 ;jÞk

@

QI

i0 ¼1

exp

ðmi0 ;j xi Þ

2 ðpÞ Dmi;j ¼

r20

i ;j

¼

@mi;j 3 2 2 3 ðmi;j xi Þ2 ! @ exp 2 I 2 Y ri;j 6 7 ðmi0 ;j xi0 Þ 7 56 ¼4 exp 4 5 @mi;j r2i0 ;j 0 0 @mi;j

i ¼1;i –i

2

I Y

¼4 0

exp

ðmi0 ;j xi0 Þ

r2i0 ;j

0

i ¼1;i –i

¼

I 2ðmi;j xi Þ Y

r2i;j

2

!3" !# 2 5 2ðmi;j xi Þ exp ðmi;j xi Þ 2 2

ri;j

exp

ðmi0 ;j xi0 Þ

2

! ¼

r2i0 ;j

0

i ¼1

ri;j

2ðxi mi;j Þ

r2i;j

fk ð50Þ

while, for ri,j, the term translates to (51)

i ¼1;i –i

¼4

2

I Y

exp

ðmi0 ;j xi0 Þ

r2i0 ;j

i0 ¼1;i0 –i 2

¼

2ðmi;j xi Þ

r3i;j

I Y

!3" !# 2 2 5 2ðmi;j xi Þ exp ðmi;j xi Þ 3 2

ri;j

2

exp

ðmi0 ;j xi0 Þ

0

i ¼1

! ¼

r2i0 ;j

ri;j

2ðxi mi;j Þ

r3i;j

2

fk ð51Þ

Substituting (50) and (51) into (49), and then into (48), yield (52) and (53), respectively

! M @E xi mi;j X t m ym ¼ @mi;j Dm r2i;j m¼1 ! X mðl;mÞ ym 4f k k rðl;mÞk ð2 fk Þ2 k2S ! M @E ðxi mi;j Þ2 X t m ym ¼ @ ri;j Dm r3i;j m¼1 ! X mðl;mÞ ym 4f k k rðl;mÞk ð2 fk Þ2 k2S

ð52Þ

ð53Þ

which in turn lead to the updating formulae in (54) and (55) ðpÞ

ðpÞ

xi mi;j

DmðpÞ i;j ¼ g

ðri;j Þ2 X

0 @

m¼1

ðpÞ

ðpÞ

mðl;mÞ ym k

ðpÞ rðl;mÞ k

k2S

Dr

ðpÞ M X t ðpÞ m ym

ðpÞ

ðpÞ i;j

!

ðpÞ Dm 10

!

ðpÞ

1

4fk C AB @ 2 A ðpÞ 2 fk

ð54Þ

! ! ðpÞ ðpÞ ðpÞ M ðpÞ ðxi mi;j Þ2 X tm ym

¼g

ðri;j Þ3 ðpÞ

X k2S

ðpÞ

0 @

ðpÞ

ðpÞ

mðl;mÞ ym k

ðpÞ rðl;mÞ k

ðpþ1Þ

DðpÞ m

m¼1

ðpÞ

10

ðpÞ 4fk

1

C AB @ 2 A ðpÞ 2 fk ðpþ1Þ

ðpþ1Þ

Dm ðri;j Þ2 m¼1 1 0 10 ðpÞ ðpÞ X mðpÞ f ðl;mÞk ym B @ A@ k C 2A ðpÞ ðpÞ r k2S ðl;mÞk 2 fk 0 2 1 ! ðpÞ ðpÞ ðpÞ M g B xi mi;j C X t ðpÞ m ym ¼ @ 3 A MjSj DðpÞ m¼1 m ri;jðpÞ 1 0 10 ðpÞ ðpÞ X mðpÞ y m f C @ ðl;mÞk AB @ k 2 A ðpÞ rðpÞ k2S ðl;mÞk 2 fk MjSj

ðpÞ

ðpÞ

ð56Þ

ð57Þ

References

3 0 !3 @ exp ðmi;j xi Þ2 2 2 I Y ri;j B 7 ðmi0 ;j xi0 Þ @fk 7 5B ¼4 exp @ 5 @ ri;j @ ri;j r2i0 ;j 0 0 2

2

DrðpÞ i;j

g

! ! ðpÞ ðpÞ ðpÞ M ðpÞ xi mi;j X tm ym

ð55Þ

ðpÞ

where Dmi;j ¼ mi;j mi;j and Dri;j ¼ ri;j Dri;j . As with (46) and (47), downscaling is required to prevent large updates in the antecedent label parameters. To achieve this, both 1 (54) and (55) are multiplied by a factor of 4MjSj , where jSj and M are the number of rules being selected and number of output dimensions, respectively. The modiﬁed updating equations are given in (56) and (57)

Albus, J. S. (1975). A new approach to manipulator control: The cerebellar model articulation controller. Journal of Dynamic Systems, Measurement, and Control, 97, 220–227. Ang, K. K., & Quek, C. (2005). RSPOP: Rough set-based pseudo outer-product fuzzy rule identiﬁcation algorithm. Neural Computation, 17, 205–243. Angelov, P. P., & Zhou, X. (2008). Evolving fuzzy-rule-based classiﬁers from data streams. IEEE Transactions on Fuzzy Systems, 16(6), 1462–1475. Asuncion, A., & Newman, D. (2007). UCI machine learning repository. School of Information and Computer Science, University of California. Available: ). Brunet, J., Tamayo, P., Golub, T., & Mesirov, J. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academic Sciences of USA, 101, 4164–4169. Cawley, G. C., & Talbot, N. L. C. (2006). Gene selection in cancer classiﬁcation using sparse logistic regression with Bayesian regularisation. Bioinformatics, 22(19), 2348–2355. Chakraborty, D., & Pal, N. R. (2004). A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classiﬁcation. IEEE Transactions on Neural Networks, 15(1), 110–123. Chow, M. L., Moler, E. J., & Mian, I. S. (2001). Identifying marker genes in transcription proﬁling data using a mixture of feature relevance experts. Physiological Genomics, 5, 99–111. Cover, T. M., & Hart, P. E. (1967). Nearest pattern classiﬁcation. IEEE Transactions on Information Theory, IT-13(1), 21–27. Crawford, A., & Beckerle, M. (1991). Puriﬁcation and characterization of zyxin, an 82,000-Dalton component of adherens junctions. Journal of Biological Chemistry, 266(9), 5847–5853. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classiﬁer under zero-one loss. Machine Learning, 29, 103130. Duch, W., Oentaryo, R. J., & Pasquier, M. (2008). Cognitive architectures: Where do we go from here? In P. Wang, B. Goertzel, & S. Franklin (Eds.). Frontiers in artiﬁcial intelligence and applications (Vol. 171, pp. 122–136). IOS Press. Duch, W., Setiono, R., & Zurada, J. M. (2004). Computational intelligence methods for rule-based data understanding. Proceedings of the IEEE, 92(5), 771–805. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classiﬁcation (2nd ed.). New York, NY: John Wiley & Sons. Estomih, M., & Gregory, G. (2006). Clinical neuroanatomy and neuroscience. Philadelphia: Saunders. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874. Fayyad, U., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classiﬁcation learning. In Proceedings of the international joint conference on artiﬁcial intelligence, Chambry, France (pp. 1022–1029). Geman, D., d’Avignon, C., Naiman, D., & Winslow, R. L. (2004). Classifying gene expression proﬁles from pairwise mRNA comparisons. Statistical Applications in Genetics and Molecular Biology, 3(1), 19. Goldman, R. N., & Weinberg, J. S. (1985). Statistics: An introduction. Englewood Cliffs, NJ: Prentice Hall. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Hall, M. (1999). Correlation-based feature subset selection for machine learning. Unpublished doctoral dissertation, Department of Computer Science, University of Waikato. Hebb, D. O. (1949). The organization of behavior. New York: John Wiley & Sons. Hwang, K. S., & Hsu, Y. B. (2002). A self-improving fuzzy cerebellar model articulation controller with stochastic action generation. Cybernetics and Systems, 33(2), 101–128. Jang, J. S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics, 23(3), 665–685. Jou, C. C. (1992). A fuzzy cerebellar model articulation controller. In Proceedings of the IEEE international conference on fuzzy systems, San Diego (pp. 1171–1178). Kalton, G. (1983). Introduction to survey sampling. Beverly Hills and London: SAGE Publications, Inc..

12084

R.J. Oentaryo et al. / Expert Systems with Applications 38 (2011) 12066–12084

Kandel, E. R., Schwartz, J. H., & Jessel, T. M. (2000). Principles of neural science (4th ed.). New York: McGraw-Hill. Kasabov, N. (2001). Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning. IEEE Transactions on Systems, Man and Cybernetics, Part B, 31(6), 902–918. Keller, J. M., Yager, R. R., & Tahani, H. (1992). Neural network implementation of fuzzy logic. Fuzzy Sets and Systems, 45(1), 1–12. Lazo, A., & Rathie, P. (1978). On the entropy of continuous probability distributions. IEEE Transactions on Information Theory, 24(1), 120–122. Li, J., & Wong, L. (2002). Identifying good diagnostic gene groups from gene expression proﬁles using the concept of emerging patterns. Bioinformatics, 18(5), 725–734. Lin, C. J., & Lin, C. T. (1997). An ART-based fuzzy adaptive learning control network. IEEE Transactions on Fuzzy Systems, 5(4), 477–496. Lin, C. T., & Lee, C. S. G. (1994). Reinforcement structure/parameter learning for neural-network based fuzzy logic control systems. IEEE Transactions on Fuzzy Systems, 2, 46–63. Lin, C. T., & Lee, C. S. G. (1996). Neural fuzzy systems: A neuro-fuzzy synergism to intelligent systems. Upper Saddle River, NJ: Prentice Hall. Liu, F., Quek, C., & Ng, G. S. (2007). A novel generic Hebbian-ordering-based fuzzy rule base reduction approach to Mamdani neuro-fuzzy system. Neural Computation, 19, 1656–1680. Mamdani, E. H., & Assilian, S. (1999). An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Human–Computer Studies, 51(2), 135–147. Nakanishi, H., Turksen, I. B., & Sugeno, M. (1993). A review and comparison of six reasoning methods. Fuzzy Sets and Systems, 57(3), 257–294. Nauck, D., Klawonn, F., & Kruse, R. (1997). Foundations of neuro-fuzzy systems. New York, NY: John Wiley & Sons. Ng, G. S., Quek, C., & Jiang, H. (2008). FCMAC-EWS: A bank failure early warning system based on a novel localized pattern learning and semantically associative fuzzy neural network. Expert Systems with Applications, 34(2), 989–1003. Nguyen, D. V., & Rocke, D. M. (2002). Tumor classiﬁcation by partial least squares using microarray gene expression data. Bioinformatics, 18(1), 39–50. Nguyen, M. N., Shi, D., & Quek, C. (2006). FCMAC-BYY: Fuzzy CMAC using Bayesian Ying-Yang learning. IEEE Transactions on Systems, Man and Cybernetics, Part B, 36(5), 1180–1190. Nie, J., & Linkens, D. (1994). FCMAC: A fuzziﬁed cerebellar model articulation controller with self-organizing capacity. Automatica, 30(4), 655–664. Oentaryo, R. J., & Pasquier, M. (2008). Towards a novel integrated neuro-cognitive architecture (INCA). In Proceedings of the IEEE international joint conference on neural networks, Hong Kong (pp. 1902–1909). Oentaryo, R. J., & Pasquier, M. (2009). A novel dual neuro-fuzzy system approach for large-scale knowledge consolidation. In Proceedings of the IEEE international conference on fuzzy systems, Jeju, South Korea (pp. 53–58). Oentaryo, R.J., Pasquier, M. (in press). Knowledge consolidation and inference in the integrated neuro-cognitive architecture. IEEE Intelligent Systems. Oentaryo, R. J., Pasquier, M., & Quek, C. (2008). GenSoFNN–Yager: A novel braininspired generic self-organizing neuro-fuzzy system realizing Yager inference. Expert Systems with Applications, 35(4), 1825–1840. Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1076. Pasquier, M., & Oentaryo, R. J. (2008). Learning to drive the human way: A step towards intelligent vehicles. International Journal of Vehicle Autonomous Systems, Special Issue on Advances in Autonomous Vehicle Technologies for Urban Environment, 6(1–2), 24–47. Pasquier, M., Quek, C., & Toh, M. (2001). Fuzzylot: A novel self-organising fuzzyneural rule-based pilot system for automated vehicles. Neural Networks, 14(8), 1099–1112. Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341–356.

Peng, J., Wang, Y., & Sun, W. (2007). Trajectory-tracking control for mobile robot using recurrent fuzzy cerebellar model articulation controller. Neural Information Processing – Letters and Reviews, 11(1), 15–23. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods – Support vector learning. advances in kernel methods – Support vector learning. MIT Press. Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation: A review. In J. Mason & M. Cox (Eds.), Algorithms for approximation (pp. 143–167). Oxford University Press. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical recipes in C. Cambridge, England: Cambridge University Press. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. Salgia, R., Pisick, E., Sattler, M., Li, J., Uemura, N., Wong, W., et al. (1996). p130CAS forms a signaling complex with the adapter protein crk1 in hematopoietic cells transformed by the BCR/ABL oncogene. Journal of Biological Chemistry, 271(41), 25198–25203. Shen, Q., & Jensen, R. (2004). Selecting informative features with fuzzy-rough sets and its application for complex systems monitoring. Pattern Recognition, 37(7), 1351–1363. Shevade, S. K., & Keerthi, S. S. (2003). A simple and efﬁcient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17), 2246–2253. Sim, J., Tung, W. L., & Quek, C. (2006). FCMAC–Yager: A novel Yager-inferencescheme-based fuzzy CMAC. IEEE Transactions on Neural Networks, 17(6), 1394–1410. Smola, A., & Schölkopf, B. (1998). A tutorial on support vector regression. Tech. Rep. NeuroCOLT2 Technical Report Series, NC2-TR-1998-030. Su, S. F., Lee, Z. J., & Wang, Y. P. (2006). Robust and fast learning for fuzzy cerebellar model articulation controllers. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 36(1), 203–208. Tan, A. C., Naiman, D. Q., Xu, L., Winslow, R. L., & Geman, D. (2005). Simple decision rules for classifying human cancers from gene expression proﬁles. Bioinformatics, 21(20), 3896–3904. Tibshirani, R., Hastle, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academic Sciences of USA, 99(10), 6567–6572. Ting, C. W., & Quek, C. (2009). A novel blood glucose regulation using TSK0-FCMAC: A fuzzy CMAC based on the zero-ordered TSK fuzzy inference scheme. IEEE Transactions on Neural Networks, 20(5), 563–582. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Wang, Y., Tetko, I. V., Hall, M., Frank, E., Facius, A., Mayer, K. F. X., et al. (2005). Gene selection from microarray data for cancer classiﬁcation A machine learning approach. Computational Biology and Chemistry, 29, 37–46. Wang, Z. Q., Schiano, J. L., & Ginsberg, M. (1996). Hash-coding in CMAC neural networks. In Proceedings of the IEEE international conference on neural networks (Vol. 3, pp. 1698–1703). Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. New Jersey: Prentice Hall. Yagi, T., Morimoto, A., Eguchi, M., Hibi, S., Sako, M., Ishii, E., et al. (2003). Identiﬁcation of a gene expression signature associated with pediatric AML prognosis. Blood, 102(5), 1849–1856. Yao, S., Pasquier, M., & Quek, C. (2007). A foreign exchange portfolio management mechanism based on fuzzy neural networks. In Proceedings of the IEEE congress on evolutionary computation, Singapore (pp. 2576–2583). Yi, J., Kloeker, S., Jensen, C., Bockholt, S., Honda, H., Hirai, H., et al. (2002). Members of the zyxin family of lim proteins interact with members of the p130cas family of signal transducers. Journal of Biological Chemistry, 277(11), 9580–9589.

$pdf-1837\louisiana-a-students-guide-to-localized-history-localized ...$