Generalized relevance LVQ (GRLVQ) with correlation ...

Viewer
Transcript

ARTICLE IN PRESS

Neurocomputing 69 (2006) 651–659 www.elsevier.com/locate/neucom

Generalized relevance LVQ (GRLVQ) with correlation measures for gene expression analysis Marc Strickerta,, Udo Seifferta, Nese Sreenivasulub, Winfriede Weschkeb, Thomas Villmannc, Barbara Hammerd a

Pattern Recognition Group, Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Germany b Gene Expression Group, IPK Gatersleben, Germany c Clinic for Psychotherapy, University Leipzig, Germany d Institute of Computer Science, Technical University of Clausthal, Germany Available online 10 January 2006

Abstract A correlation-based similarity measure is derived for generalized relevance learning vector quantization (GRLVQ). The resulting GRLVQ-C classiﬁer makes Pearson correlation available in a classiﬁcation cost framework where data prototypes and global attribute weighting terms are adapted into directions of minimum cost function values. In contrast to the Euclidean metric, the Pearson correlation measure makes input vector processing invariant to shifting and scaling transforms, which is a valuable feature for dealing with functional data and with intensity observations like gene expression patterns. Two types of data measures are derived from Pearson correlation in order to make its beneﬁts for data processing available in compact prototype classiﬁcation models. Fast convergence and high accuracies are demonstrated for cDNA-array gene expression data. Furthermore, the automatic attribute weighting of GRLVQ-C is successfully used to rate the functional relevance of analyzed genes. r 2005 Elsevier B.V. All rights reserved. Keywords: Prototype-based learning; Adaptive metrics; Correlation measure; Learning vector quantization; GRLVQ; Gene expression analysis

1. Introduction Pattern classiﬁcation is the key technology for solving tasks in diagnostics, automation, information fusion, and forecasting. The backbone of pattern classiﬁcation is the underlying distance metric: it deﬁnes how data items are compared, and it controls the grouping of data. Thus, depending on the deﬁnition of the distance, a data set can be viewed and processed from different perspectives. Unsupervised clustering with a speciﬁc similarity measure, for example, visualized as the result of a self-organizing map (SOM), provides ﬁrst hints about the appropriateness of the chosen metric for meaningful data grouping [5]. In prototype-based models like Corresponding author. Fax: +49 39482 5 137.

E-mail addresses: [email protected] (M. Strickert), [email protected] (U. Seiffert), [email protected] (N. Sreenivasulu), [email protected] (W. Weschke), [email protected] (T. Villmann), [email protected] (B. Hammer). 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.12.004

the SOM, a data item can be compared with an ‘average’ data prototype in various ways, for example, according to the Euclidean distance or the Manhattan block distance. Different physical and geometric interpretations are obtained then, because the former measures diagonally across the vector space, while the latter sums up distances along each dimension axis. In any case, the speciﬁc structure of the data space can and should be accounted for by selecting an appropriate metric. Once a suitable metric is identiﬁed, it can be further utilized for the design of good classiﬁers. In supervised scenarios, auxiliary class information can be used for adapting parameters improving the speciﬁcity of data metrics during data processing, as proposed by Kaski for (semi-)supervised extensions of the SOM [4]. Another metricadapting classiﬁcation architecture is the generalized relevance learning vector quantization (GRLVQ) developed by Hammer and Villmann [3]. Data metrics in mathematical sense, however, might be too restrictive for some applications in which a relaxation

ARTICLE IN PRESS 652

M. Strickert et al. / Neurocomputing 69 (2006) 651–659

to more general similarity measures would be useful. For example, in biological sciences often functional aspects of collected data play an important role: general spatiotemporal patterns in time series, intensity ﬁelds, or observation sequences might be more inter-related than patterns that are just spatially close in Euclidean sense. This applies to the aim of the present work, the analysis of gene expression patterns, for which the Pearson correlation is commonly used. Since recent technological achievements allow probing of thousands of gene expression levels in parallel, fast and accurate methods are required to deal with the resulting large data sets. Thereby, the deﬁnition of genetic similarity in terms of Pearson correlation should be possible, and the curse of dimensionality, related to only few available experiments in high-dimensional gene expression space, should be reduced to a minimum. Many commercial and freely available bioinformatics tools, such as ArrayMiner, GeneSpring, J-Express Pro, and Eisen’s Gene Cluster use Pearson correlation for analysis. The common goal of these programs is the identiﬁcation of key regulators and clusters of coexpressed genes that determine metabolic functions in developing organisms. Usually, only the metric of algorithms, which have been initially designed for processing Euclidean data, is exchanged by a 1 minus correlation term. Here, GRLVQ-C is proposed, a classiﬁer that is mathematically derived ‘from the scratch’ for correlation-based classiﬁcation. Its foundations are the generic update rules of generalized relevance learning vector quantization (GRLVQ, [2,3]). This allows incorporation of auxiliary information for genetic distinction, such as the developmental stage of the probed tissues, or the stress factors applied to the growing organisms. Using the GRLVQ approach with its rigid classiﬁcation cost function, a fast prototype-based and intuitive classiﬁcation model with very good generalization properties is derived. Both, data attribute relevances and prototype locations are obtained as a result of optimizing Pearson correlationships. The speciﬁc requirements of gene expression analysis are met in two ways: ﬁrstly, the implemented correlation measure accounts for the nature of gene expression experiments which, due to physico-chemical reasons, tend to differ in their overall intensities and in their dynamic ranges, but not in their general structure of expressed patterns. Secondly, automatic relevance weighting attenuates the curse of high dimensionality. The properties and beneﬁts of the proposed GRLVQ-C classiﬁer are demonstrated for real-life data sets.

2. Generalized relevance LVQ (GRLVQ) and extensions Let X ¼ fðxi ; yi Þ 2 Rd f1; . . . ; cg j i ¼ 1; . . . ; ng be a training data set with d-dimensional elements to be classiﬁed xk ¼ ðxk1 ; . . . ; xkd Þ and c classes. A set W ¼ fw1 ; . . . ; wK g of prototypes in data space with class labels yi is used for data representation, wi ¼ ðwi1 ; . . . ; wid ; yi Þ 2 Rd f1; . . . ; cg.

The classiﬁcation cost function to be minimized is given in the generic form [3]: E GRLVQ :¼ qk ðxi Þ ¼

n X

gðqk ðxi ÞÞ

i¼1 i dþ k ðx Þ i dþ k ðx Þ

with

i d k ðx Þ ; i þ d k ðx Þ

d k ðxÞ:¼d k ðx; wÞ.

The classiﬁcation costs of all patterns are summed up, whereby qk ðxi Þ serves as quality measure of the classiﬁcation depending on the degree of ﬁt of the presented pattern xi and the two closest prototypes, wiþ representing the same label as xi and wi a different label. A sigmoid transfer function gðxÞ ¼ sgdðxÞ ¼ 1=ð1 þ expðxÞÞ 2 ð0; 1Þ is used [8]. Implicit degrees of freedom of the cost minimization are the prototype locations in the weight space and a set of adaptive parameters k connected to the measure d k ðxÞ ¼ d k ðx; wÞ comparing pattern and prototype. In prior work, d k ðxÞ was supposed to be a metric in mathematical sense, i.e. taking only nonnegative values, conforming to the triangle inequality, with a distance of d ¼ 0 only for w ¼ x. These conditions enable intuitive interpretations of prototype relationships. However, if just a well-performing classiﬁer invariant to certain features is wanted, distance conditions might be relaxed to a mere similarity measure to be plugged into the algorithm. Overall similarity maximization can be expressed in the GRLVQ framework by ﬂipping the sign of the measure and then just keeping the minimization of E GRLVQ . Since the iterative GRLVQ update implements a gradient descent on E, d must be differentiable almost everywhere, no matter if acting as distance or as similarity measure. Partial derivatives of E GRLVQ yield the generic update formulas for the closest correct and the closest wrong prototype and the metric weights: qE GRLVQ ¼ gþ g0 ðqk ðxi ÞÞ qwiþ i 2 d qd þ ðxi Þ k ðx Þ þ k iþ , 2 qw ðd k ðxi Þ þ d k ðxi ÞÞ

nwiþ ¼ gþ

qE GRLVQ ¼ g g0 ðqk ðxi ÞÞ qwi i i 2 dþ qd k ðx Þ k ðx Þ þ , i 2 qwi ðd k ðxi Þ þ d k ðx ÞÞ

nwi ¼ g

qE GRLVQ ¼ gk g0 ðqk ðxi ÞÞ qk i þ i i i 2 qd þ k ðx Þ=qk d k ðx Þ 2 d k ðx Þ qd k ðx Þ=qk . i 2 i ðd þ k ðx Þ þ d k ðx ÞÞ

nk ¼ gk

Learning rates are gk for the metric parameters lj , all initialized equally by lj ¼ 1=d; j ¼ 1 . . . d; gþ and g describe the update amount. Their choice depends on the used measure— generally, they should be chosen according to the relation 0pgk 5g pgþ p1and decreased within these constraints during training. Metric adaptation should be realized slowly, as a reaction to the quasi-stationary solutions for the prototype

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

3. Metrics and similarity measures The missing ingredient for carrying out comparisons is either a distance metric or a more general similarity measure d k ðx; wÞ. In contrast to metrics, similarity measures are sometimes also called dis-similarity measures, because they are maximum for best match, which is opposed to the semantics of metrics. For reference, formulas for the weighted Euclidean distance will be revisited ﬁrst. Then, by relaxing the conditions of metrics, two types of measures are derived from the Pearson correlation coefﬁcient, which both inherit the invariance to component offsets and amplitude scaling. The feature of prototype invariance, implemented by the presented update dynamic, is desirable in situations when mainly frequency information and simple plotting curves matching is accounted for. More details on functional shape representations and functional data processing with neural SOM, RBF and MLP networks are given by Rossi et al. [6,7] and for use with support vector machines by Villa and Rossi [11]. 3.1. Weighted Euclidean metric The weighted Euclidean metric yields the following set of [12]: d X i d Euc lbj k ðxj wij Þbw integers bk ; bw X0; bw even k ðx; w Þ ¼ j¼1

)

i qd Euc k ðx; w Þ ¼ bw lbj k ðxj wij Þbw 1 , qwij

i qd Euc k ðx; w Þ ¼ bk lbj k 1 ðxj wij Þbw . qlj

For simplicity, roots have been omitted. In the squared case with bw ¼ 2, the derivative for the prototype update 2 ðxj wij Þ contains the well-known Hebbian learning term. In other cases, large bw tend to focus on dimensions with large differences, and small bw focus on dimensions with small differences. Approved values for the exponents of Pdthe relevance factors are bk 2 f1; 2g. Normalization of i¼1 li ¼ 1; li X0 is necessary after each update step to prevent the parameters from divergence and collapse. 3.2. Correlation-based measures In the following, a correlation-based classiﬁcation is derived from the term r ¼ d r ðx; wi Þ Pd i l¼1 ðwl mwi Þ ðxl mx Þ ﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ 2 ½1; 1, ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pd Pd 2 2 i ðw m Þ ðx m Þ i w x l¼1 l¼1 l l

ð1Þ

Pattern similarity reference signal (RS) pattern 1 (P1) pattern 2 (P2)

expression value

positions. The above set of equations is a convenient starting point to test different concepts of similarity by just inserting the denoted partial derivatives of d k ðxÞ.

653

1

2

3

4 attribute

5

6

7

Fig. 1. Data patterns compared with different similarity functions. Relation characterizations for the squared Euclidean metric differ from those for Pearson correlation: d Euc ðRS; P1Þ ¼ 0:82od Euc ðRS; P2Þ ¼ 1:81 ! P1 closer to RS than P2; but d r ðRS; P1Þ ¼ 0:53od r ðRS; P2Þ ¼ 0:89 ! P2 more similar to RS (highly correlated) than P1 (anticorrelated).

which is the Pearson correlation coefﬁcient; therein, my denotes the mean value of vector y. As illustrated in Fig. 1, this correlation possesses fundamentally different properties than the Euclidean distance: depending on the applied similarity function, the two patterns compared with a reference pattern yield opposite relations. Simple data preprocessing cannot transform correlation-based classiﬁcation problem into an equivalent one solvable with the Euclidean metric. As a rough rule of thumb: if a prototype with ‘sufﬁcient’ variance is similar to input points in Euclidean sense, then it is very likely that it is also highly correlated to them. The other direction is untrue: if high correlation exists, there might still be a large Euclidean distance. Thus, potentially fewer prototypes are necessary for representations based on correlation, and sparser data models can thus be realized. The straightforward deﬁnition of Pearson correlation by Eq. (1), however, is not suitable for being implemented in GRLVQ: ﬁrstly, the required cost function minimization conﬂicts with the desired maximization of Pearson similarity between data and prototypes; secondly, only the small range of values 1prp1 is taken for expressing best match versus worst match, which yields sub-optimum convergence of E GRLVQ . As a ﬁrst approach, one might think of a version of Fisher’s Z 0 ¼ 0:5ðlnð1 þ rÞ lnð1 rÞÞ as a standard transformation of Pearson correlation. This, however, leads to unstable behavior, because almost perfect (dis-)correlation is mapped to arbitrarily large absolute values. Therefore, inverse fractions of appropriately reshaped functions of r are considered in the following. The derivations presented here unify and improve the transformations given in the authors’ prior work [9]. Since metric adaptivity is a very beneﬁcial property for rating individual data attributes, free parameters are added to the Pearson correlation in such a way that the meaning of correlation can be fully maintained. Then, by paying attention to the current prototype w:¼wi , the numerator of

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

654

Eq. (1) becomes H:¼ ¼

d X l¼1 l2j

l2l ðwl mw Þ ðxl mx Þ

focusing on component j:

¯ j ðmw Þ, ðwj mw Þ ðxj mx Þ þ H

¯ j ðmw Þ ¼ H

d X

l2l ðwl mw Þ ðxl mx Þ,

laj l¼1

mw ¼ mw ðwj Þ ¼ 1=d wj þ 1=d

d X

wl .

laj l¼1

The focus on component j will be a convenient abbreviation for deriving the formulas for prototype update and relevance adaptation. Each of the mean subtracted weight and pattern components, ðwj mw Þ and ðxj mx Þ, has got its own relevance factor lj . This is reﬂected in both rewritten denominator factors of Eq. (1), again with a focus on weight vector components j: W:¼

d X l¼1

l2l ðwl mw Þ2 ;

X:¼

d X

l2l ðxl mx Þ2 ,

l¼1

¯ j; Wðwj ; lj Þ ¼ l2j ðwj mw Þ2 þ W

C ¼ 1, an integer exponent kX1 and Rmin ¼ 2k ; here, the rare occurrence of extreme anti-correlation might be worth handling singular values in computer realizations. The C1 setup allows classiﬁcation with intuitive Pearson correlationships, a feature that is well-suited for co-expression analysis of gene intensities. For calculating derivatives of R, the constant expression Rmin can be omitted. Solutions can be obtained manually or by using computer algebra systems. In the latter case, after some rearrangements, the following equations are found: !k qRðwj ; lj Þ Hðwj ; lj Þ qR q ¼ ¼ Cþ qwj wj qwj ðWðwj ; lj Þ Xðlj ÞÞ1=2 ¼ F ðH0 ðwj Þ W X H W0 ðwj Þ XÞ qR ¼ F ðH0 ðlj Þ W X H W0 ðlj Þ X H W X0 ðlj ÞÞ qlj using the factor F ¼ ðk Rk1 Þ=ðW XÞ3=2 . The missing derivatives are

¯j ¼ W

d X

l2l ðwl mw Þ2 .

H0 ðwj Þ ¼ l2j ðxj mx Þ 1=d

laj l¼1

d X

l2l ðxl mx Þ,

l¼1 d X

Using the deﬁned shortcuts, the adaptive pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃPearson correlation can be written as rk ¼ H= W X. Two types of measures are obtained by a uniﬁed transform: k 1 Rmin R¼ C þ rk ðx; wÞ k H ¼ C þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Rmin ¼:Rk Rmin . ð2Þ WX

W0 ðwj Þ ¼ 2 l2j ðwj mw Þ 2=d

The resulting classiﬁers are characterized as follows:

These formulas contain plausible Hebb terms, adapting wj into the direction of ðxj mx Þ and away from ðwl mw Þ in case of correct classiﬁcation, whereby further scaling factors come from the cost function. Similarly, ll is adapted according to the correlation ðwl mw Þ ðxl mx Þ in comparison to the variances of these terms. Note that very efﬁcient computer implementations can be realized, because most calculations can be done during the pattern presentation phase already. Then, the similarity measure R and all its constituents H; W; X, and some other terms are computed, and they can be stored for each prototype for later reuse in the update phase. Pd Again, the normalization of the relevance factors to i¼1 li ¼ 1, li X0 is advised in order to avoid numerical instabilities and to make different training runs comparable. Two ﬁnally discussed practical issues concern singular expression handling and the choice of the exponent k. Singular states occur if the variance of one of the vectors is absent. If, for example, a pattern vector x is affected,

C0 One type of classiﬁer is obtained for C ¼ 0, even integer exponents kX2, and minimum Rmin ¼ 1. This classiﬁer allows to separate both correlation and anti-correlation from mere dis-correlation. The minimum value Rmin is subtracted in order to obtain sharp zeros for perfect matches. In computer implementations, a special treatment of the unlikely case of extreme discorrelation might be considered in order to avoid division by (near-)zero values. C0 -prototypes match both correlated patterns and their inverted anticorrelated counterparts, which allows the realization of very compact classiﬁers. For speciﬁc data, however, this type of classiﬁcation might lead to undesired intermingling of data proﬁle shapes, such as occurring in gene expression analysis. C1 The other type of classiﬁer separates correlated patterns from anti-correlated ones. The C1 model is realized by

l2l ðwl mw Þ,

l¼1

H0 ðll Þ ¼ 2 ll ðwl mw Þ ðxl mx Þ, W0 ðll Þ ¼ 2 ll ðwl mw Þ2 , X0 ðll Þ ¼ 2 ll ðxl mx Þ2 .

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

4. Experiments Three experiments underline the usefulness of correlation-based classiﬁcation. First, a proof of concept is given for the Tecator benchmark data which is a set of absorbance spectra. Then the focus is put on cDNA array classiﬁcation: in the second experiment, a distinction between two types of leukemia from gene spotting intensities is sought for—this involves the classiﬁcation of 72 complete cDNA microarrays, i.e. 7129-dimensional gene expression intensity vectors, and the rating of these genes for their relevance to classiﬁcation. The last experiment detects systematic differences between two series of gene expression records by analyzing two corresponding sets of seven-dimensional expression patterns with 1421 genes each. 4.1. Tecator spectral data

Tecator data set samples / relevance profiles 4.500 intensity

is of interest. In this case, all prototypes would end up with zero correlations and impracticable terms F. Analog reasoning holds in rare cases of equal prototype components, which, in practice, would occur by inappropriate initialization rather than by the update dynamic. Here, prototypes are assumed to be initialized by data instances. By skipping the degenerated constant patterns or prototypes, unpleasant situations can be effectively avoided. Alternatively, a single randomly picked component can be set to a different value, which, on average, produces the desired state of uncorrelatedness, even for two vectors with simultaneously equal components subject to that modiﬁcation. The free parameter k takes inﬂuence on the speed of convergence and the generalization ability. Integer values in the range 1pko20 have been found as reasonable choices in experiments. Too high values lead to fast adaptation, but sometimes also to over-ﬁtting or to unstable convergence, unless a very small learning rate g is chosen. Good initial exponents are k ¼ 7 or 8, odd and/or even according to the desire for a C1 or a C0 type classiﬁer. For the presented experiments, training is only a matter of one or two minutes; therefore, systematic parameter searches can be realized.

Tecator Infratec Food and Feed Analyzer. The task is to predict the binary fat content, low or high, of meat by spectrum classiﬁcation, thereby using random data partitions of 120 training patterns and 95 test patterns, as suggested in [11]. It is known that Euclidean distance is not appropriate for raw spectrum comparison, and the question of interest is, if Pearson correlation yields any beneﬁt. Fig. 2, top panel, shows some of the spectra with their corresponding classes. Apart from a tendency towards dints around channel 41 for high fat content, a substantial visual data overlap can be stated. This is reﬂected in the results for Euclidean-based classiﬁers: k-nearest neighbor (k-NN) reaches its best classiﬁcation results of about 80% accuracy for k ¼ 3, GRLVQ with squared Euclidean metric reaches 88% on the test set at 94% training accuracy. However, focusing on correlation-based classiﬁcation the problem gets easier and results above 97% are obtained. For comparison, k-NN with maximum correlation neighborhood is taken, k-NN-C for short. Table 1 contains the average numbers of misclassiﬁcation for a speciﬁc classiﬁer, each of which trained 25 times. Instead of the k-NN-C, which utilizes all available training data for classiﬁcation, GRLVQ-C training requires only 20 prototypes per class trained in 500 epochs. The learning rates for both types C0 and C1 without relevance learning are gk ¼ 0 and gþ ¼ g ¼ 108 , and the exponents are k ¼ 6 and 5, respectively. In additional runs, relevance learning is switched on by choosing a relatively large non-zero learning rate gk ¼ 107 while the prototype adaptation rates are kept at gþ ¼ g ¼ 108 . To summarize Table 1, GRLVQ-C classiﬁcation is superior to k-NN-C. The differences between C0 and C1 accuracies become visible for non-relevance based learning: while C0 does not much proﬁt from relevance adaptation, C1 does account for it, as some of the runs ended up with no misclassiﬁcation at all. The bottom panel of Fig. 2 shows individual and average relevance proﬁles for the C1 classiﬁer. As an intuitive result, the apparent discriminators

class 0 4.000 class 1 3.500 3.000 2.500 10

relevance factor

then the simpliﬁed limit correlation Pd l ðxl mx Þ ðwl mw Þ ﬃ lim qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pd xl !mx 2 Pd 2 l ðxl mx Þ l ðwl mw Þ P ðx mx Þ dl ðwl mw Þ ﬃ lim qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P x!mx d ðx mx Þ2 dl ðwl mw Þ2 Pd 0 l ðwl mw Þ ﬃ¼0 ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ q ¼ Pd Pd 2 2 d l ðwl mw Þ d l ðwl mw Þ

20

30

40

50

60

0.016 0.014 0.012 0.01 0.008 0.006

70

80

90

relevance profile average std.-dev. boundary

10

The ﬁrst data set, publicly available at http://lib.stat. cmu.edu/datasets/tecator, contains 215 samples of 100dimensional infrared absorbance spectra measured with the

655

20

30

40

50

60

70

80

90

frequency channel Fig. 2. Upper panel: sample spectra from Tecator data set. Lower panel: GRLVQ-C1 relevance proﬁles.

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

656

Table 1 Tecator correlation-based classiﬁcation results for the test set GRLVQ-C0

GRLVQ-C1

1-NN-C

3-NN-C

7-NN-C

No rel.

Rel.

No rel.

Rel.

No rel.

Rel.

No rel.

Rel.

No rel.

Rel.

3.32

2.04

2.4

2.16

3.96

2.52

5.04

2.84

6.4

3.24

Average numbers of misclassiﬁcations are shown for 25 runs of each classiﬁer. k-NN-C utilizes maximum correlation neighborhood. Relevance utilization is indicated by ‘rel.’, denoting metric adaptation for GRLVQ-C and application of relevance-rescaled data for k-NN-C.

Table 2 Leukemia list of candidate genes for differentiating between types AML and ALL Gene-#

Found

ID

Name

1745 1834 1882 2354 4190 4211 4847 5954 6277 6376

Yes Yes Yes Yes No No Yes No No Yes

M16038 M23197 M27891 M92287 X16706 X51521 X95735 Y00339 M30703 M83652

LYN Yamaguchi sarc. vir. rel. oncog. homolog CD33 antigen (differentiation antigen) CST3 Cystatin C (amyloid cerebral hemorrhage) CCND3 Cyclin D3 FOS-related antigen 2 VIL2 Villin 2 (ezrin) Zyxin CA2 Carbonic anhydrase II Amphiregulin (AR) gene PFC Properdin P factor, complement

The ‘Found’ column indicates if the speciﬁc gene is conﬁrmed identiﬁed by the team of Golub et al.

shown at particular channels of the data, such as channel 41, get ampliﬁed, while less important channels are suppressed. Although k-NN-C decreases in performance for larger k-neighborhoods, the results can be improved by transforming the input data according to the scaling factors shown in the relevance plots. The GRLVQ-C scaling weights can thus be used to boost k-NN-C classiﬁcation accuracies, which underlines a more general validity of the found data scaling properties. To conclude, sparse and accurate GRLVQ-C-classiﬁers are obtained for the Tecator data set without further data preprocessing. The built-in relevance detection yields highly interpretable results, which, in the following, helps to identify key regulators in gene expression experiments. 4.2. Leukemia cancer type detection The second task is gene expression analysis, where the GRLVQ-C property of automatic attribute weighting is used for gene ranking. Data are taken from cDNA arrays which are powerful tools for probing in parallel the expression levels of thousands of genes that were extracted form organic tissue cells. A very important issue in gene expression analysis is the identiﬁcation of functionally relevant genes. Particularly, medical diagnostics and therapies proﬁt from the isolation of small sets of candidate genes responsible for defective or mutative operations. In cancer research many well-documented data sets and publications are available online. One of the discussed problems is the differentiation between two types of

leukemia, the acute lymphoblastic leukemia (ALL) and the acute myeloid leukemia (AML). Background information is provided by Golub et al. [1]. The corresponding data sets and further online material are found at http://www.broad.mit.edu/. The available data contain real-value expressions of 7129 genes (some redundant) for each of the 38 training samples (27 ALL, 11 AML) and of the 34 test samples (20 ALL, 14 AML). In order to distinguish correlation from anti-correlation, GRLVQ-C1 is considered. Training has been carried out for a minimalistic GRLVQ-C1 -model, using only one prototype per class which prevents over-ﬁtting in the 7129-dimensional space. Learning rates are gþ ¼ g ¼ 2:5 103 , the exponent k ¼ 2 is taken, and 1000 epochs are trained (Table 2). Table 3 shows that the average results of 100 runs with only prototype adaptation are rather poor in contrast to the neighborhood analysis method of Golub et al.; however, allowing relevance adaptation at gk ¼ 5 108 , the GRLVQ-C1 accuracies are drastically improved. Thus, the obtained relevance factors must explain correlative differences between AML and ALL. Yet, this statement does not claim biological truth. For validation purposes, the genes have been ranked according to their relevance values for those 19 of 100 independently trained classiﬁers that showed perfect results on training and test set. A list of top-ten genes with highest sum of their 19 ranks has been extracted and matched with a longer list of 50 genes given by Golub and his group. The results are given in Table 2. Remarkably, six of the identiﬁed

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

657

Table 3 Leukemia data set classiﬁcation results Set

GRLVQ-C1 (rel.)

GRLVQ-C1 (no rel.)

Neigh.-Analysis

Train Test

0.19 2.1

5.96 7.14

2 5

Average numbers of misclassiﬁcations are shown for 100 runs of each GRLVQ-C1 classiﬁer. Relevance utilization is indicated by ‘rel.’. The results for the neighborhood analysis (done a single time) are taken from Golub et al. [1].

MDS of original leukemia data

MDS of rescaled leukemia data 2

ALL AML

ALL AML

2

1

0

0

-2

-1

-4 -3

-2 -1

0

1

2

3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Fig. 3. Leukemia data embedded by correlation-based multi-dimensional scaling. Left: original data set. Right: data set after application of GRLVQ-C1 relevance factors. Class prototypes are indicated by large symbols.

genes are consistent with the reference list. This event ‘ﬁnding at least six matches’ has got a vanishing probability of 10 X 50 7129 50 7129 P¼ 1:8 1011 k 10 k 10 k¼6 for randomly selected genes. This means that, although only 10 instead of 50 genes are considered for brevity here, these genes are conﬁrmed as being of special importance. The functional role of the other four identiﬁed genes might still be of interest from biological point of view, but this remains open for investigation in extra studies. Finally, the grouping of all 72 data samples are visualized together with the GRLVQ-C prototypes by correlation-based multi-dimensional scaling HiT-MDS [10] in Fig. 3. For the original data, shown in the left panel, there is a rough unsupervised separation of the 7219-dimensional gene expression vectors according to their type AML or ALL. The corresponding GRLVQ-C1 prototypes are deﬁning a tight data separation boundary which is still imperfect due to noisy data grouping. However, after the rescaling transform utilizing the GRLVQ-C1 relevance factors, much clearer data clusters are obtained, as shown in right panel of Fig. 3. This visual aid re-emphasizes how the curse of dimension is effectively circumvented by using an adaptive metric that is driven by the available auxiliary class information. 4.3. Validation of gene expression experiments The third study is connected to developmental gene expression series obtained from macroarrays. Expression

patterns of 1421 genes were collected from ﬁlial tissues of barley seeds during seven developmental stages between 0 and 12 days after ﬂowering in steps of two days. For control purposes, each experiment has been repeated using two sets of independently grown plant material. The question of interest is, whether a systematic difference can be found in the gene expression proﬁles resulting from the two experimental series. Thus, 1421 data vectors in seven dimensions are considered for each of the two classes related to series 1 and 2. Random partitions into 50% training and 50% test sets are trained for 2500 epochs and 25 runs for each classiﬁcation model, GRLVQ-C0 with k ¼ 8, GRLVQ-C1 with k ¼ 7, and GRLVQ-Euc. The exponents have been determined in a number of runs as a compromise between speed of convergence, related to small exponents, and overﬁtting observed for high values. Table 4 contains the average results of the classiﬁers with optimum model size. GRLVQ-C0 uses only three prototypes, one for series one, and two for series two. This asymmetry has proofed to be beneﬁcial for classiﬁcation. Likewise, GRLVQ-C1 makes use of two and three prototypes for series one and two, respectively. The squared Euclidean GRLVQ-Euc yields about guessing results for two prototypes per class; accuracies get better for 20 prototypes per class, but then the generalization is rather poor. All classiﬁers but the small Euclidean one indicate a detectable difference between the two series of experiments. However, the GRLVQ-C1 -classiﬁer that maintains the opposition of correlation and anti-correlation is a good choice with respect to model size and generalization ability. The relevance proﬁles for the three classiﬁers are given in Fig. 4. They show a rough correspondence in identifying relevant developmental stages within a range of 4–8 days. However, the details must be considered with care, because different conﬁgurations of the relevance proﬁles are found to lead to E GRLVQ cost function minimization, especially the C 1 type. Thus, the data space is too homogeneously ﬁlled to emphasize speciﬁc dimensions clearly. This is also reﬂected in the small variability of the relevance factors li ; in this case, larger relevance learning rates produce unstable solutions. Nevertheless, biologically consistent interpretation of the relevance proﬁles has been found: further biological investigations have supported a slight shift in assigning developmental stages between the two sets of independent experiments. In the conducted gene expression experiments, a robust transcriptional

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

658

Table 4 GRLVQ-classiﬁcation accuracies for differentiating between 2 series of macroarray experiments. Number of used prototypes are given in brackets Set

GRLVQ-C0 (3 pt.)

GRLVQ-C1 (5 pt.)

GRLVQ-Euc (4 pt.)

GRLVQ-Euc (40 pt.)

Train Test

68.07% 64.95%

66.91% 66.44%

53.14% 49.66%

68.38% 58.32%

phase. These slight differences were detected and could be well exploited by the GRLVQ-classiﬁers, which conﬁrms their use for processing biological observations.

relevance factor

Euclidean relevances for gene expression series classification 0.147 0.146 0.145 0.144 0.143 0.142 0.141 0.14 0.139

relevance profile average std.-dev

5. Conclusions

0

2

4

6

8

10

12

developmental stage (days after flowering) C0 relevances for gene expression series classification

relevance factor

0.147

relevance profile average std.-dev

0.146 0.145 0.144 0.143 0.142 0.141 0.14

0

2

4

6

8

10

12

developmental stage (days after flowering) C1 relevances for gene expression series classification

relevance factor

0.147 relevance profile average std.-dev

0.146 0.145 0.144 0.143 0.142 0.141

0

2

4

6

8

10

12

developmental stage (days after flowering) Fig. 4. GRLVQ relevance proﬁles characterizing the developmental stage that enhance the distinction of two experimental gene expression series. From top to bottom: Euclidean, C 0 , C 1 . Different characteristics occur depending on the underlying similarity measure.

reprogramming occurred during intermediate stage related to days 4–8 of ﬁlial tissue development. Although overall expression data between the two sets of experiments are hardly distinguishable in practice, the slight systematic inﬂuence depending on an assignment of the developmental stages affects gene expression during this intermediate

Adaptive correlation-based similarity measures have been successfully derived and integrated into the existing mathematical framework of GRLVQ learning. The experiments with the GRLVQ-C classiﬁers show that there is much potential in using non-Euclidean similarity measures. GRLVQ-C1 maintains properties of correlation vs. anticorrelation, while GRLVQ-C0 opposes both characteristics against dis-correlation, which leads to structurally different classiﬁers. The GRLVQ-C0 pattern matching is somewhat analogous to the property of Hopﬁeld networks that do not distinguish a pattern from an inverted copy of it. By the utilization of Pearson correlation, no preprocessing is required for getting independent of data contrast related to scaling and shifting. As a consequence, in comparison to the Euclidean metric less prototypes are necessary to represent certain types of data: it has been shown that the functional Tecator data and the gene expression classiﬁcation proﬁt from using correlation measures. High sensitivity to speciﬁc differences in the data is realized, and very good classiﬁcation results are obtained. Many other areas of GRLVQ-C applications ranging from image processing to mass spectroscopy can be thought of, areas which proﬁt from relaxed pattern matching in contrast to strict metricbased classiﬁcation. A very important property of the proposed types of correlation measures, C0 and C1 , is their adaptivity for enhanced data discrimination at a global data perspective. As has been shown by the experiments, relevance scaling helps to ﬁnd interesting data attributes, and, thereby, the scaling drastically increases the classiﬁcation accuracies for high-dimensional data. Even for standard methods, like the demonstrated k-NN-C and the MDS visualization technique, their expressiveness can be much improved if the data are subject to preprocessing by the GRLVQ-C scaling factors. Yet, an interesting theoretical problem remains, apart from the practical beneﬁts: to which extend can the large margin properties of GRLVQEuc be transferred to the new correlation measures of GRLVQ-C? This and a number of further issues will be addressed in future work. GRLVQ-C is online available as instance of SRNG-GM at http://srng.webhop.net/.

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

Acknowledgements Thanks to Dr. Volodymyr Radchuk for macroarray hybridization experiments. Gratitudes to Prof. Wolfgang Stadje, University of Osnabru¨ck, for his solution to the combinatorial probability of accidentally identiﬁed relevant genes. Many thanks also for the precise statements of the anonymous reviewers. The present work is supported by BMBF Grant FKZ 0313115, GABI-SEED-II.

659

Udo Seiffert is the head of the Pattern Recognition group at the Leibniz-Institute of Plant Genetics and Crop Plant Research Gatersleben (IPK), Germany since 2002. He has a number of teaching assignments for artiﬁcial neural networks as well as evolutionary algorithms at the University of Magdeburg. After completing hig Ph.D. in 1998, he received a grant by the German Academic Exchange Service to continue his research work at the University of South Australia, Adelaide. His research interests focus on computational intelligence, image processing, and high performance computing within these ﬁelds.

References [1] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomﬁeld, E. Lander, Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (5439) (1999) 531–537. [2] B. Hammer, M. Strickert, T. Villmann, On the generalization ability of GRLVQ networks, Neural Process. Lett. 21 (2005) 109–120. [3] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Networks 15 (2002) 1059–1068. [4] S. Kaski, Bankruptcy analysis with self-organizing maps in learning metrics, IEEE Trans. Neural Networks 12 (2001) 936–947. [5] T. Kohonen, Self-Organizing Maps, third ed., Springer, Berlin, 2001. [6] F. Rossi, B. Conan-Guez, A.E. Golli, Clustering functional data with the SOM algorithm, in: Proceedings of ESANN 2004, Bruges, Belgium, 2004, pp. 305–312. [7] F. Rossi, N. Delannay, B. Conan-Guez, M. Verleysen, Representation of functional data in neural networks, Neurocomputing 64 (2005) 183–210. [8] A. Sato, K. Yamada, Generalized learning vector quantization, in: G. Tesauro, D. Touretzky, T. Leen (Eds.), Advances in Neural Information Processing Systems 7 (NIPS), vol. 7, MIT Press, Cambridge, MA, 1995, pp. 423–429. [9] M. Strickert, N. Sreenivasulu, W. Weschke, U. Seiffert, T. Villmann, Generalized relevance LVQ with correlation measures for biological data, in: M. Verleysen (Ed.), European Symposium on Artiﬁcial Neural Networks (ESANN), D-side Publications, 2005, pp. 331–338. [10] M. Strickert, S. Teichmann, N. Sreenivasulu, U. Seiffert. Highthroughput multi-dimensional scaling (HiT-MDS) for cDNA-array expression data, in: W. Duch, J. Kacprzyk, E. Oja, S. Zadroz˙ ny (Eds.), Artiﬁcial Neural Networks: Biological Inspirations—ICANN 2005, Springer Lecture Notes in Computer Science, Springer, Berlin, 2005, pp. 625–633. [11] N. Villa, F. Rossi, Support vector machine for functional data classiﬁcation, in: Proceedings of ESANN 2005, Bruges, Belgium, 2005, pp. 467–472. [12] T. Villmann, F. Schleif, B. Hammer, Supervised neural gas and relevance learning in learning vector quantization, in: T. Yamakawa (Ed.), Proceedings of the Workshop on Self-Organizing Networks (WSOM), Kyushu Institute of Technology, New York, 2003, pp. 47–52. Marc Strickert obtained his Ph.D. in Computer Science in 2005 at the University of Osnabru¨ck in the research group ‘Learning with Neural Methods on Structured Data’. Self-organizing neural learning architectures for high-dimensional data analysis, time series processing, and pattern recognition are his major topics of interest. He is currently working in the Pattern Recognition Group at the Leibniz-Institute of Plant Genetics and Crop Plant Research Gatersleben (IPK), Germany.

Nese Sreenivasulu completed his Ph.D. in biology in 2002. Presently, he is working as post-doctoral research scientist at Institute of Plant Genetics and Crop Plant Research (IPK) with a focus on genomic studies of barley seeds.

Winfriede Weschke studied chemistry at the TU Merseburg (Germany). She ﬁnished her Ph.D. work in organic photochemistry in 1980 and afterwards worked at the former Zentralinstitut fu¨r Genetik and Kulturpﬂanzenforschung (ZIGUK) in Gatersleben on genetic engineering of seed storage proteins of Vicia faba. She took a permanent position at the Institute of Plant Genetics and Crop Plant Research (IPK) in 1990. Presently, she is working on the molecular physiology of barley seed development with the main focus on expression analysis. Thomas Villman works in the Clinic for Psychotherapy at the University of Leipzig where he is the head of the computational intelligence group. His research interests incude theory and applications of neural networks—in particular SOM, neural gas, and LVQ—data mining and evolutionary algorithms. Applications cover medical problems, satellite remote sensing and model optimization. He holds a diploma degree in mathematics and a Ph.D. in computer science both received from University Leipzig, Germany. Further, he is founding member of the German chapter of ENNS (GNNS). Barbara Hammer received her Ph.D. in Computer Science in 1995 and her venia legendi in Computer Science in 2003, both from the University of Osnabrueck, Germany. During 2000–2004, she was leader of the junior research group ‘Learning with Neural Methods on Structured Data’ at University of Osnabru¨ck before becomming professor of Theoretical Computer Science at Clausthal University of Technology, Germnay, in 2004. Several research visits have taken her to Pisa, Padova (Italy), Birmingham (UK), Bangalore (India), Paris (France), and the USA. She is coauthor of more than 60 papers in international journals and conferences on different aspects of Computational Intelligence, most of which can be retrieved from http://www.in.tu-clausthal.de/hammer/

Generalized relevance LVQ (GRLVQ) with correlation ...

On the geometry of a generalized cross-correlation ...

SHORTWA'U'E RADIO PROPAGATION CORRELATION WITH ...

Trade with Correlation

Crohn Disease with Endoscopic Correlation

Interpersonal Pathoplasticity in Individuals With Generalized Anxiety ...

Improving web search relevance with semantic features

Aligning Vertical Collection Relevance with User Intent - UT Research ...

Social Image Search with Diverse Relevance Ranking - Springer Link

Probabilistic relevance feedback with binary semantic feature vectors ...

Aligning Vertical Collection Relevance with User Intent - UT Research ...

Web Image Retrieval Re-Ranking with Relevance Model

Loss of Heterozygosity and Its Correlation with ... - Semantic Scholar

from Correlation Scaled ab Initio Energies with Ex

Sonographic Measurement of Splenic Size and its Correlation with ...

MACs: Multi-Attribute Co-clusters with High Correlation ...

Correlation of Balance Tests Scores with Modified ...