ARTICLE IN PRESS

Neurocomputing 69 (2006) 651–659 www.elsevier.com/locate/neucom

Generalized relevance LVQ (GRLVQ) with correlation measures for gene expression analysis Marc Strickerta,, Udo Seifferta, Nese Sreenivasulub, Winfriede Weschkeb, Thomas Villmannc, Barbara Hammerd a

Pattern Recognition Group, Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Germany b Gene Expression Group, IPK Gatersleben, Germany c Clinic for Psychotherapy, University Leipzig, Germany d Institute of Computer Science, Technical University of Clausthal, Germany Available online 10 January 2006

Abstract A correlation-based similarity measure is derived for generalized relevance learning vector quantization (GRLVQ). The resulting GRLVQ-C classifier makes Pearson correlation available in a classification cost framework where data prototypes and global attribute weighting terms are adapted into directions of minimum cost function values. In contrast to the Euclidean metric, the Pearson correlation measure makes input vector processing invariant to shifting and scaling transforms, which is a valuable feature for dealing with functional data and with intensity observations like gene expression patterns. Two types of data measures are derived from Pearson correlation in order to make its benefits for data processing available in compact prototype classification models. Fast convergence and high accuracies are demonstrated for cDNA-array gene expression data. Furthermore, the automatic attribute weighting of GRLVQ-C is successfully used to rate the functional relevance of analyzed genes. r 2005 Elsevier B.V. All rights reserved. Keywords: Prototype-based learning; Adaptive metrics; Correlation measure; Learning vector quantization; GRLVQ; Gene expression analysis

1. Introduction Pattern classification is the key technology for solving tasks in diagnostics, automation, information fusion, and forecasting. The backbone of pattern classification is the underlying distance metric: it defines how data items are compared, and it controls the grouping of data. Thus, depending on the definition of the distance, a data set can be viewed and processed from different perspectives. Unsupervised clustering with a specific similarity measure, for example, visualized as the result of a self-organizing map (SOM), provides first hints about the appropriateness of the chosen metric for meaningful data grouping [5]. In prototype-based models like Corresponding author. Fax: +49 39482 5 137.

E-mail addresses: [email protected] (M. Strickert), [email protected] (U. Seiffert), [email protected] (N. Sreenivasulu), [email protected] (W. Weschke), [email protected] (T. Villmann), [email protected] (B. Hammer). 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.12.004

the SOM, a data item can be compared with an ‘average’ data prototype in various ways, for example, according to the Euclidean distance or the Manhattan block distance. Different physical and geometric interpretations are obtained then, because the former measures diagonally across the vector space, while the latter sums up distances along each dimension axis. In any case, the specific structure of the data space can and should be accounted for by selecting an appropriate metric. Once a suitable metric is identified, it can be further utilized for the design of good classifiers. In supervised scenarios, auxiliary class information can be used for adapting parameters improving the specificity of data metrics during data processing, as proposed by Kaski for (semi-)supervised extensions of the SOM [4]. Another metricadapting classification architecture is the generalized relevance learning vector quantization (GRLVQ) developed by Hammer and Villmann [3]. Data metrics in mathematical sense, however, might be too restrictive for some applications in which a relaxation

ARTICLE IN PRESS 652

M. Strickert et al. / Neurocomputing 69 (2006) 651–659

to more general similarity measures would be useful. For example, in biological sciences often functional aspects of collected data play an important role: general spatiotemporal patterns in time series, intensity fields, or observation sequences might be more inter-related than patterns that are just spatially close in Euclidean sense. This applies to the aim of the present work, the analysis of gene expression patterns, for which the Pearson correlation is commonly used. Since recent technological achievements allow probing of thousands of gene expression levels in parallel, fast and accurate methods are required to deal with the resulting large data sets. Thereby, the definition of genetic similarity in terms of Pearson correlation should be possible, and the curse of dimensionality, related to only few available experiments in high-dimensional gene expression space, should be reduced to a minimum. Many commercial and freely available bioinformatics tools, such as ArrayMiner, GeneSpring, J-Express Pro, and Eisen’s Gene Cluster use Pearson correlation for analysis. The common goal of these programs is the identification of key regulators and clusters of coexpressed genes that determine metabolic functions in developing organisms. Usually, only the metric of algorithms, which have been initially designed for processing Euclidean data, is exchanged by a 1 minus correlation term. Here, GRLVQ-C is proposed, a classifier that is mathematically derived ‘from the scratch’ for correlation-based classification. Its foundations are the generic update rules of generalized relevance learning vector quantization (GRLVQ, [2,3]). This allows incorporation of auxiliary information for genetic distinction, such as the developmental stage of the probed tissues, or the stress factors applied to the growing organisms. Using the GRLVQ approach with its rigid classification cost function, a fast prototype-based and intuitive classification model with very good generalization properties is derived. Both, data attribute relevances and prototype locations are obtained as a result of optimizing Pearson correlationships. The specific requirements of gene expression analysis are met in two ways: firstly, the implemented correlation measure accounts for the nature of gene expression experiments which, due to physico-chemical reasons, tend to differ in their overall intensities and in their dynamic ranges, but not in their general structure of expressed patterns. Secondly, automatic relevance weighting attenuates the curse of high dimensionality. The properties and benefits of the proposed GRLVQ-C classifier are demonstrated for real-life data sets.

2. Generalized relevance LVQ (GRLVQ) and extensions Let X ¼ fðxi ; yi Þ 2 Rd  f1; . . . ; cg j i ¼ 1; . . . ; ng be a training data set with d-dimensional elements to be classified xk ¼ ðxk1 ; . . . ; xkd Þ and c classes. A set W ¼ fw1 ; . . . ; wK g of prototypes in data space with class labels yi is used for data representation, wi ¼ ðwi1 ; . . . ; wid ; yi Þ 2 Rd  f1; . . . ; cg.

The classification cost function to be minimized is given in the generic form [3]: E GRLVQ :¼ qk ðxi Þ ¼

n X

gðqk ðxi ÞÞ

i¼1 i dþ k ðx Þ i dþ k ðx Þ

with

i  d k ðx Þ ; i þ d k ðx Þ

d k ðxÞ:¼d k ðx; wÞ.

The classification costs of all patterns are summed up, whereby qk ðxi Þ serves as quality measure of the classification depending on the degree of fit of the presented pattern xi and the two closest prototypes, wiþ representing the same label as xi and wi a different label. A sigmoid transfer function gðxÞ ¼ sgdðxÞ ¼ 1=ð1 þ expðxÞÞ 2 ð0; 1Þ is used [8]. Implicit degrees of freedom of the cost minimization are the prototype locations in the weight space and a set of adaptive parameters k connected to the measure d k ðxÞ ¼ d k ðx; wÞ comparing pattern and prototype. In prior work, d k ðxÞ was supposed to be a metric in mathematical sense, i.e. taking only nonnegative values, conforming to the triangle inequality, with a distance of d ¼ 0 only for w ¼ x. These conditions enable intuitive interpretations of prototype relationships. However, if just a well-performing classifier invariant to certain features is wanted, distance conditions might be relaxed to a mere similarity measure to be plugged into the algorithm. Overall similarity maximization can be expressed in the GRLVQ framework by flipping the sign of the measure and then just keeping the minimization of E GRLVQ . Since the iterative GRLVQ update implements a gradient descent on E, d must be differentiable almost everywhere, no matter if acting as distance or as similarity measure. Partial derivatives of E GRLVQ yield the generic update formulas for the closest correct and the closest wrong prototype and the metric weights: qE GRLVQ ¼ gþ  g0 ðqk ðxi ÞÞ qwiþ i 2  d qd þ ðxi Þ k ðx Þ  þ  k iþ ,  2 qw ðd k ðxi Þ þ d k ðxi ÞÞ

nwiþ ¼  gþ 

qE GRLVQ ¼ g  g0 ðqk ðxi ÞÞ qwi i i 2  dþ qd  k ðx Þ k ðx Þ  þ  , i 2 qwi ðd k ðxi Þ þ d  k ðx ÞÞ

nwi ¼ g 

qE GRLVQ ¼ gk  g0 ðqk ðxi ÞÞ qk  i þ i  i i 2  qd þ k ðx Þ=qk  d k ðx Þ  2  d k ðx Þ  qd k ðx Þ=qk .   i 2 i ðd þ k ðx Þ þ d k ðx ÞÞ

nk ¼  gk 

Learning rates are gk for the metric parameters lj , all initialized equally by lj ¼ 1=d; j ¼ 1 . . . d; gþ and g describe the update amount. Their choice depends on the used measure— generally, they should be chosen according to the relation 0pgk 5g pgþ p1and decreased within these constraints during training. Metric adaptation should be realized slowly, as a reaction to the quasi-stationary solutions for the prototype

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

3. Metrics and similarity measures The missing ingredient for carrying out comparisons is either a distance metric or a more general similarity measure d k ðx; wÞ. In contrast to metrics, similarity measures are sometimes also called dis-similarity measures, because they are maximum for best match, which is opposed to the semantics of metrics. For reference, formulas for the weighted Euclidean distance will be revisited first. Then, by relaxing the conditions of metrics, two types of measures are derived from the Pearson correlation coefficient, which both inherit the invariance to component offsets and amplitude scaling. The feature of prototype invariance, implemented by the presented update dynamic, is desirable in situations when mainly frequency information and simple plotting curves matching is accounted for. More details on functional shape representations and functional data processing with neural SOM, RBF and MLP networks are given by Rossi et al. [6,7] and for use with support vector machines by Villa and Rossi [11]. 3.1. Weighted Euclidean metric The weighted Euclidean metric yields the following set of [12]: d X i d Euc lbj k  ðxj  wij Þbw integers bk ; bw X0; bw even k ðx; w Þ ¼ j¼1

)

i qd Euc k ðx; w Þ ¼ bw  lbj k  ðxj  wij Þbw 1 , qwij

i qd Euc k ðx; w Þ ¼ bk  lbj k 1  ðxj  wij Þbw . qlj

For simplicity, roots have been omitted. In the squared case with bw ¼ 2, the derivative for the prototype update 2  ðxj  wij Þ contains the well-known Hebbian learning term. In other cases, large bw tend to focus on dimensions with large differences, and small bw focus on dimensions with small differences. Approved values for the exponents of Pdthe relevance factors are bk 2 f1; 2g. Normalization of i¼1 li ¼ 1; li X0 is necessary after each update step to prevent the parameters from divergence and collapse. 3.2. Correlation-based measures In the following, a correlation-based classification is derived from the term r ¼ d r ðx; wi Þ Pd i l¼1 ðwl  mwi Þ  ðxl  mx Þ ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2 ½1; 1, ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pd Pd 2 2 i ðw  m Þ ðx  m Þ  i w x l¼1 l¼1 l l

ð1Þ

Pattern similarity reference signal (RS) pattern 1 (P1) pattern 2 (P2)

expression value

positions. The above set of equations is a convenient starting point to test different concepts of similarity by just inserting the denoted partial derivatives of d k ðxÞ.

653

1

2

3

4 attribute

5

6

7

Fig. 1. Data patterns compared with different similarity functions. Relation characterizations for the squared Euclidean metric differ from those for Pearson correlation: d Euc ðRS; P1Þ ¼ 0:82od Euc ðRS; P2Þ ¼ 1:81 ! P1 closer to RS than P2; but d r ðRS; P1Þ ¼ 0:53od r ðRS; P2Þ ¼ 0:89 ! P2 more similar to RS (highly correlated) than P1 (anticorrelated).

which is the Pearson correlation coefficient; therein, my denotes the mean value of vector y. As illustrated in Fig. 1, this correlation possesses fundamentally different properties than the Euclidean distance: depending on the applied similarity function, the two patterns compared with a reference pattern yield opposite relations. Simple data preprocessing cannot transform correlation-based classification problem into an equivalent one solvable with the Euclidean metric. As a rough rule of thumb: if a prototype with ‘sufficient’ variance is similar to input points in Euclidean sense, then it is very likely that it is also highly correlated to them. The other direction is untrue: if high correlation exists, there might still be a large Euclidean distance. Thus, potentially fewer prototypes are necessary for representations based on correlation, and sparser data models can thus be realized. The straightforward definition of Pearson correlation by Eq. (1), however, is not suitable for being implemented in GRLVQ: firstly, the required cost function minimization conflicts with the desired maximization of Pearson similarity between data and prototypes; secondly, only the small range of values 1prp1 is taken for expressing best match versus worst match, which yields sub-optimum convergence of E GRLVQ . As a first approach, one might think of a version of Fisher’s Z 0 ¼ 0:5ðlnð1 þ rÞ  lnð1  rÞÞ as a standard transformation of Pearson correlation. This, however, leads to unstable behavior, because almost perfect (dis-)correlation is mapped to arbitrarily large absolute values. Therefore, inverse fractions of appropriately reshaped functions of r are considered in the following. The derivations presented here unify and improve the transformations given in the authors’ prior work [9]. Since metric adaptivity is a very beneficial property for rating individual data attributes, free parameters are added to the Pearson correlation in such a way that the meaning of correlation can be fully maintained. Then, by paying attention to the current prototype w:¼wi , the numerator of

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

654

Eq. (1) becomes H:¼ ¼

d X l¼1 l2j 

l2l  ðwl  mw Þ  ðxl  mx Þ

focusing on component j:

¯ j ðmw Þ, ðwj  mw Þ  ðxj  mx Þ þ H

¯ j ðmw Þ ¼ H

d X

l2l  ðwl  mw Þ  ðxl  mx Þ,

laj l¼1

mw ¼ mw ðwj Þ ¼ 1=d  wj þ 1=d 

d X

wl .

laj l¼1

The focus on component j will be a convenient abbreviation for deriving the formulas for prototype update and relevance adaptation. Each of the mean subtracted weight and pattern components, ðwj  mw Þ and ðxj  mx Þ, has got its own relevance factor lj . This is reflected in both rewritten denominator factors of Eq. (1), again with a focus on weight vector components j: W:¼

d X l¼1

l2l  ðwl  mw Þ2 ;

X:¼

d X

l2l  ðxl  mx Þ2 ,

l¼1

¯ j; Wðwj ; lj Þ ¼ l2j  ðwj  mw Þ2 þ W

C ¼ 1, an integer exponent kX1 and Rmin ¼ 2k ; here, the rare occurrence of extreme anti-correlation might be worth handling singular values in computer realizations. The C1 setup allows classification with intuitive Pearson correlationships, a feature that is well-suited for co-expression analysis of gene intensities. For calculating derivatives of R, the constant expression Rmin can be omitted. Solutions can be obtained manually or by using computer algebra systems. In the latter case, after some rearrangements, the following equations are found: !k qRðwj ; lj Þ Hðwj ; lj Þ qR q ¼ ¼ Cþ qwj wj qwj ðWðwj ; lj Þ  Xðlj ÞÞ1=2 ¼ F  ðH0 ðwj Þ  W  X  H  W0 ðwj Þ  XÞ qR ¼ F  ðH0 ðlj Þ  W  X  H  W0 ðlj Þ  X  H  W  X0 ðlj ÞÞ qlj using the factor F ¼ ðk  Rk1 Þ=ðW  XÞ3=2 . The missing derivatives are

¯j ¼ W

d X

l2l  ðwl  mw Þ2 .

H0 ðwj Þ ¼ l2j ðxj  mx Þ  1=d 

laj l¼1

d X

l2l  ðxl  mx Þ,

l¼1 d X

Using the defined shortcuts, the adaptive pffiffiffiffiffiffiffiffiffiffiffiffiffiPearson correlation can be written as rk ¼ H= W  X. Two types of measures are obtained by a unified transform:  k 1  Rmin R¼ C þ rk ðx; wÞ  k H ¼ C þ pffiffiffiffiffiffiffiffiffiffiffiffiffi  Rmin ¼:Rk  Rmin . ð2Þ WX

W0 ðwj Þ ¼ 2  l2j ðwj  mw Þ  2=d 

The resulting classifiers are characterized as follows:

These formulas contain plausible Hebb terms, adapting wj into the direction of ðxj  mx Þ and away from ðwl  mw Þ in case of correct classification, whereby further scaling factors come from the cost function. Similarly, ll is adapted according to the correlation ðwl  mw Þ  ðxl  mx Þ in comparison to the variances of these terms. Note that very efficient computer implementations can be realized, because most calculations can be done during the pattern presentation phase already. Then, the similarity measure R and all its constituents H; W; X, and some other terms are computed, and they can be stored for each prototype for later reuse in the update phase. Pd Again, the normalization of the relevance factors to i¼1 li ¼ 1, li X0 is advised in order to avoid numerical instabilities and to make different training runs comparable. Two finally discussed practical issues concern singular expression handling and the choice of the exponent k. Singular states occur if the variance of one of the vectors is absent. If, for example, a pattern vector x is affected,

C0 One type of classifier is obtained for C ¼ 0, even integer exponents kX2, and minimum Rmin ¼ 1. This classifier allows to separate both correlation and anti-correlation from mere dis-correlation. The minimum value Rmin is subtracted in order to obtain sharp zeros for perfect matches. In computer implementations, a special treatment of the unlikely case of extreme discorrelation might be considered in order to avoid division by (near-)zero values. C0 -prototypes match both correlated patterns and their inverted anticorrelated counterparts, which allows the realization of very compact classifiers. For specific data, however, this type of classification might lead to undesired intermingling of data profile shapes, such as occurring in gene expression analysis. C1 The other type of classifier separates correlated patterns from anti-correlated ones. The C1 model is realized by

l2l  ðwl  mw Þ,

l¼1

H0 ðll Þ ¼ 2  ll  ðwl  mw Þ  ðxl  mx Þ, W0 ðll Þ ¼ 2  ll  ðwl  mw Þ2 , X0 ðll Þ ¼ 2  ll  ðxl  mx Þ2 .

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

4. Experiments Three experiments underline the usefulness of correlation-based classification. First, a proof of concept is given for the Tecator benchmark data which is a set of absorbance spectra. Then the focus is put on cDNA array classification: in the second experiment, a distinction between two types of leukemia from gene spotting intensities is sought for—this involves the classification of 72 complete cDNA microarrays, i.e. 7129-dimensional gene expression intensity vectors, and the rating of these genes for their relevance to classification. The last experiment detects systematic differences between two series of gene expression records by analyzing two corresponding sets of seven-dimensional expression patterns with 1421 genes each. 4.1. Tecator spectral data

Tecator data set samples / relevance profiles 4.500 intensity

is of interest. In this case, all prototypes would end up with zero correlations and impracticable terms F. Analog reasoning holds in rare cases of equal prototype components, which, in practice, would occur by inappropriate initialization rather than by the update dynamic. Here, prototypes are assumed to be initialized by data instances. By skipping the degenerated constant patterns or prototypes, unpleasant situations can be effectively avoided. Alternatively, a single randomly picked component can be set to a different value, which, on average, produces the desired state of uncorrelatedness, even for two vectors with simultaneously equal components subject to that modification. The free parameter k takes influence on the speed of convergence and the generalization ability. Integer values in the range 1pko20 have been found as reasonable choices in experiments. Too high values lead to fast adaptation, but sometimes also to over-fitting or to unstable convergence, unless a very small learning rate g is chosen. Good initial exponents are k ¼ 7 or 8, odd and/or even according to the desire for a C1 or a C0 type classifier. For the presented experiments, training is only a matter of one or two minutes; therefore, systematic parameter searches can be realized.

Tecator Infratec Food and Feed Analyzer. The task is to predict the binary fat content, low or high, of meat by spectrum classification, thereby using random data partitions of 120 training patterns and 95 test patterns, as suggested in [11]. It is known that Euclidean distance is not appropriate for raw spectrum comparison, and the question of interest is, if Pearson correlation yields any benefit. Fig. 2, top panel, shows some of the spectra with their corresponding classes. Apart from a tendency towards dints around channel 41 for high fat content, a substantial visual data overlap can be stated. This is reflected in the results for Euclidean-based classifiers: k-nearest neighbor (k-NN) reaches its best classification results of about 80% accuracy for k ¼ 3, GRLVQ with squared Euclidean metric reaches 88% on the test set at 94% training accuracy. However, focusing on correlation-based classification the problem gets easier and results above 97% are obtained. For comparison, k-NN with maximum correlation neighborhood is taken, k-NN-C for short. Table 1 contains the average numbers of misclassification for a specific classifier, each of which trained 25 times. Instead of the k-NN-C, which utilizes all available training data for classification, GRLVQ-C training requires only 20 prototypes per class trained in 500 epochs. The learning rates for both types C0 and C1 without relevance learning are gk ¼ 0 and gþ ¼ g ¼ 108 , and the exponents are k ¼ 6 and 5, respectively. In additional runs, relevance learning is switched on by choosing a relatively large non-zero learning rate gk ¼ 107 while the prototype adaptation rates are kept at gþ ¼ g ¼ 108 . To summarize Table 1, GRLVQ-C classification is superior to k-NN-C. The differences between C0 and C1 accuracies become visible for non-relevance based learning: while C0 does not much profit from relevance adaptation, C1 does account for it, as some of the runs ended up with no misclassification at all. The bottom panel of Fig. 2 shows individual and average relevance profiles for the C1 classifier. As an intuitive result, the apparent discriminators

class 0 4.000 class 1 3.500 3.000 2.500 10

relevance factor

then the simplified limit correlation Pd l ðxl  mx Þ  ðwl  mw Þ ffi lim qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pd xl !mx 2 Pd 2 l ðxl  mx Þ  l ðwl  mw Þ P ðx  mx Þ  dl ðwl  mw Þ ffi  lim qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P x!mx d  ðx  mx Þ2  dl ðwl  mw Þ2 Pd 0 l ðwl  mw Þ ffi¼0 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q ¼ Pd Pd 2 2 d  l ðwl  mw Þ d  l ðwl  mw Þ

20

30

40

50

60

0.016 0.014 0.012 0.01 0.008 0.006

70

80

90

relevance profile average std.-dev. boundary

10

The first data set, publicly available at http://lib.stat. cmu.edu/datasets/tecator, contains 215 samples of 100dimensional infrared absorbance spectra measured with the

655

20

30

40

50

60

70

80

90

frequency channel Fig. 2. Upper panel: sample spectra from Tecator data set. Lower panel: GRLVQ-C1 relevance profiles.

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

656

Table 1 Tecator correlation-based classification results for the test set GRLVQ-C0

GRLVQ-C1

1-NN-C

3-NN-C

7-NN-C

No rel.

Rel.

No rel.

Rel.

No rel.

Rel.

No rel.

Rel.

No rel.

Rel.

3.32

2.04

2.4

2.16

3.96

2.52

5.04

2.84

6.4

3.24

Average numbers of misclassifications are shown for 25 runs of each classifier. k-NN-C utilizes maximum correlation neighborhood. Relevance utilization is indicated by ‘rel.’, denoting metric adaptation for GRLVQ-C and application of relevance-rescaled data for k-NN-C.

Table 2 Leukemia list of candidate genes for differentiating between types AML and ALL Gene-#

Found

ID

Name

1745 1834 1882 2354 4190 4211 4847 5954 6277 6376

Yes Yes Yes Yes No No Yes No No Yes

M16038 M23197 M27891 M92287 X16706 X51521 X95735 Y00339 M30703 M83652

LYN Yamaguchi sarc. vir. rel. oncog. homolog CD33 antigen (differentiation antigen) CST3 Cystatin C (amyloid cerebral hemorrhage) CCND3 Cyclin D3 FOS-related antigen 2 VIL2 Villin 2 (ezrin) Zyxin CA2 Carbonic anhydrase II Amphiregulin (AR) gene PFC Properdin P factor, complement

The ‘Found’ column indicates if the specific gene is confirmed identified by the team of Golub et al.

shown at particular channels of the data, such as channel 41, get amplified, while less important channels are suppressed. Although k-NN-C decreases in performance for larger k-neighborhoods, the results can be improved by transforming the input data according to the scaling factors shown in the relevance plots. The GRLVQ-C scaling weights can thus be used to boost k-NN-C classification accuracies, which underlines a more general validity of the found data scaling properties. To conclude, sparse and accurate GRLVQ-C-classifiers are obtained for the Tecator data set without further data preprocessing. The built-in relevance detection yields highly interpretable results, which, in the following, helps to identify key regulators in gene expression experiments. 4.2. Leukemia cancer type detection The second task is gene expression analysis, where the GRLVQ-C property of automatic attribute weighting is used for gene ranking. Data are taken from cDNA arrays which are powerful tools for probing in parallel the expression levels of thousands of genes that were extracted form organic tissue cells. A very important issue in gene expression analysis is the identification of functionally relevant genes. Particularly, medical diagnostics and therapies profit from the isolation of small sets of candidate genes responsible for defective or mutative operations. In cancer research many well-documented data sets and publications are available online. One of the discussed problems is the differentiation between two types of

leukemia, the acute lymphoblastic leukemia (ALL) and the acute myeloid leukemia (AML). Background information is provided by Golub et al. [1]. The corresponding data sets and further online material are found at http://www.broad.mit.edu/. The available data contain real-value expressions of 7129 genes (some redundant) for each of the 38 training samples (27 ALL, 11 AML) and of the 34 test samples (20 ALL, 14 AML). In order to distinguish correlation from anti-correlation, GRLVQ-C1 is considered. Training has been carried out for a minimalistic GRLVQ-C1 -model, using only one prototype per class which prevents over-fitting in the 7129-dimensional space. Learning rates are gþ ¼ g ¼ 2:5  103 , the exponent k ¼ 2 is taken, and 1000 epochs are trained (Table 2). Table 3 shows that the average results of 100 runs with only prototype adaptation are rather poor in contrast to the neighborhood analysis method of Golub et al.; however, allowing relevance adaptation at gk ¼ 5  108 , the GRLVQ-C1 accuracies are drastically improved. Thus, the obtained relevance factors must explain correlative differences between AML and ALL. Yet, this statement does not claim biological truth. For validation purposes, the genes have been ranked according to their relevance values for those 19 of 100 independently trained classifiers that showed perfect results on training and test set. A list of top-ten genes with highest sum of their 19 ranks has been extracted and matched with a longer list of 50 genes given by Golub and his group. The results are given in Table 2. Remarkably, six of the identified

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

657

Table 3 Leukemia data set classification results Set

GRLVQ-C1 (rel.)

GRLVQ-C1 (no rel.)

Neigh.-Analysis

Train Test

0.19 2.1

5.96 7.14

2 5

Average numbers of misclassifications are shown for 100 runs of each GRLVQ-C1 classifier. Relevance utilization is indicated by ‘rel.’. The results for the neighborhood analysis (done a single time) are taken from Golub et al. [1].

MDS of original leukemia data

MDS of rescaled leukemia data 2

ALL AML

ALL AML

2

1

0

0

-2

-1

-4 -3

-2 -1

0

1

2

3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Fig. 3. Leukemia data embedded by correlation-based multi-dimensional scaling. Left: original data set. Right: data set after application of GRLVQ-C1 relevance factors. Class prototypes are indicated by large symbols.

genes are consistent with the reference list. This event ‘finding at least six matches’ has got a vanishing probability of     10  X 50 7129  50 7129 P¼   1:8  1011 k 10  k 10 k¼6 for randomly selected genes. This means that, although only 10 instead of 50 genes are considered for brevity here, these genes are confirmed as being of special importance. The functional role of the other four identified genes might still be of interest from biological point of view, but this remains open for investigation in extra studies. Finally, the grouping of all 72 data samples are visualized together with the GRLVQ-C prototypes by correlation-based multi-dimensional scaling HiT-MDS [10] in Fig. 3. For the original data, shown in the left panel, there is a rough unsupervised separation of the 7219-dimensional gene expression vectors according to their type AML or ALL. The corresponding GRLVQ-C1 prototypes are defining a tight data separation boundary which is still imperfect due to noisy data grouping. However, after the rescaling transform utilizing the GRLVQ-C1 relevance factors, much clearer data clusters are obtained, as shown in right panel of Fig. 3. This visual aid re-emphasizes how the curse of dimension is effectively circumvented by using an adaptive metric that is driven by the available auxiliary class information. 4.3. Validation of gene expression experiments The third study is connected to developmental gene expression series obtained from macroarrays. Expression

patterns of 1421 genes were collected from filial tissues of barley seeds during seven developmental stages between 0 and 12 days after flowering in steps of two days. For control purposes, each experiment has been repeated using two sets of independently grown plant material. The question of interest is, whether a systematic difference can be found in the gene expression profiles resulting from the two experimental series. Thus, 1421 data vectors in seven dimensions are considered for each of the two classes related to series 1 and 2. Random partitions into 50% training and 50% test sets are trained for 2500 epochs and 25 runs for each classification model, GRLVQ-C0 with k ¼ 8, GRLVQ-C1 with k ¼ 7, and GRLVQ-Euc. The exponents have been determined in a number of runs as a compromise between speed of convergence, related to small exponents, and overfitting observed for high values. Table 4 contains the average results of the classifiers with optimum model size. GRLVQ-C0 uses only three prototypes, one for series one, and two for series two. This asymmetry has proofed to be beneficial for classification. Likewise, GRLVQ-C1 makes use of two and three prototypes for series one and two, respectively. The squared Euclidean GRLVQ-Euc yields about guessing results for two prototypes per class; accuracies get better for 20 prototypes per class, but then the generalization is rather poor. All classifiers but the small Euclidean one indicate a detectable difference between the two series of experiments. However, the GRLVQ-C1 -classifier that maintains the opposition of correlation and anti-correlation is a good choice with respect to model size and generalization ability. The relevance profiles for the three classifiers are given in Fig. 4. They show a rough correspondence in identifying relevant developmental stages within a range of 4–8 days. However, the details must be considered with care, because different configurations of the relevance profiles are found to lead to E GRLVQ cost function minimization, especially the C 1 type. Thus, the data space is too homogeneously filled to emphasize specific dimensions clearly. This is also reflected in the small variability of the relevance factors li ; in this case, larger relevance learning rates produce unstable solutions. Nevertheless, biologically consistent interpretation of the relevance profiles has been found: further biological investigations have supported a slight shift in assigning developmental stages between the two sets of independent experiments. In the conducted gene expression experiments, a robust transcriptional

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

658

Table 4 GRLVQ-classification accuracies for differentiating between 2 series of macroarray experiments. Number of used prototypes are given in brackets Set

GRLVQ-C0 (3 pt.)

GRLVQ-C1 (5 pt.)

GRLVQ-Euc (4 pt.)

GRLVQ-Euc (40 pt.)

Train Test

68.07% 64.95%

66.91% 66.44%

53.14% 49.66%

68.38% 58.32%

phase. These slight differences were detected and could be well exploited by the GRLVQ-classifiers, which confirms their use for processing biological observations.

relevance factor

Euclidean relevances for gene expression series classification 0.147 0.146 0.145 0.144 0.143 0.142 0.141 0.14 0.139

relevance profile average std.-dev

5. Conclusions

0

2

4

6

8

10

12

developmental stage (days after flowering) C0 relevances for gene expression series classification

relevance factor

0.147

relevance profile average std.-dev

0.146 0.145 0.144 0.143 0.142 0.141 0.14

0

2

4

6

8

10

12

developmental stage (days after flowering) C1 relevances for gene expression series classification

relevance factor

0.147 relevance profile average std.-dev

0.146 0.145 0.144 0.143 0.142 0.141

0

2

4

6

8

10

12

developmental stage (days after flowering) Fig. 4. GRLVQ relevance profiles characterizing the developmental stage that enhance the distinction of two experimental gene expression series. From top to bottom: Euclidean, C 0 , C 1 . Different characteristics occur depending on the underlying similarity measure.

reprogramming occurred during intermediate stage related to days 4–8 of filial tissue development. Although overall expression data between the two sets of experiments are hardly distinguishable in practice, the slight systematic influence depending on an assignment of the developmental stages affects gene expression during this intermediate

Adaptive correlation-based similarity measures have been successfully derived and integrated into the existing mathematical framework of GRLVQ learning. The experiments with the GRLVQ-C classifiers show that there is much potential in using non-Euclidean similarity measures. GRLVQ-C1 maintains properties of correlation vs. anticorrelation, while GRLVQ-C0 opposes both characteristics against dis-correlation, which leads to structurally different classifiers. The GRLVQ-C0 pattern matching is somewhat analogous to the property of Hopfield networks that do not distinguish a pattern from an inverted copy of it. By the utilization of Pearson correlation, no preprocessing is required for getting independent of data contrast related to scaling and shifting. As a consequence, in comparison to the Euclidean metric less prototypes are necessary to represent certain types of data: it has been shown that the functional Tecator data and the gene expression classification profit from using correlation measures. High sensitivity to specific differences in the data is realized, and very good classification results are obtained. Many other areas of GRLVQ-C applications ranging from image processing to mass spectroscopy can be thought of, areas which profit from relaxed pattern matching in contrast to strict metricbased classification. A very important property of the proposed types of correlation measures, C0 and C1 , is their adaptivity for enhanced data discrimination at a global data perspective. As has been shown by the experiments, relevance scaling helps to find interesting data attributes, and, thereby, the scaling drastically increases the classification accuracies for high-dimensional data. Even for standard methods, like the demonstrated k-NN-C and the MDS visualization technique, their expressiveness can be much improved if the data are subject to preprocessing by the GRLVQ-C scaling factors. Yet, an interesting theoretical problem remains, apart from the practical benefits: to which extend can the large margin properties of GRLVQEuc be transferred to the new correlation measures of GRLVQ-C? This and a number of further issues will be addressed in future work. GRLVQ-C is online available as instance of SRNG-GM at http://srng.webhop.net/.

ARTICLE IN PRESS M. Strickert et al. / Neurocomputing 69 (2006) 651–659

Acknowledgements Thanks to Dr. Volodymyr Radchuk for macroarray hybridization experiments. Gratitudes to Prof. Wolfgang Stadje, University of Osnabru¨ck, for his solution to the combinatorial probability of accidentally identified relevant genes. Many thanks also for the precise statements of the anonymous reviewers. The present work is supported by BMBF Grant FKZ 0313115, GABI-SEED-II.

659

Udo Seiffert is the head of the Pattern Recognition group at the Leibniz-Institute of Plant Genetics and Crop Plant Research Gatersleben (IPK), Germany since 2002. He has a number of teaching assignments for artificial neural networks as well as evolutionary algorithms at the University of Magdeburg. After completing hig Ph.D. in 1998, he received a grant by the German Academic Exchange Service to continue his research work at the University of South Australia, Adelaide. His research interests focus on computational intelligence, image processing, and high performance computing within these fields.

References [1] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, E. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (5439) (1999) 531–537. [2] B. Hammer, M. Strickert, T. Villmann, On the generalization ability of GRLVQ networks, Neural Process. Lett. 21 (2005) 109–120. [3] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Networks 15 (2002) 1059–1068. [4] S. Kaski, Bankruptcy analysis with self-organizing maps in learning metrics, IEEE Trans. Neural Networks 12 (2001) 936–947. [5] T. Kohonen, Self-Organizing Maps, third ed., Springer, Berlin, 2001. [6] F. Rossi, B. Conan-Guez, A.E. Golli, Clustering functional data with the SOM algorithm, in: Proceedings of ESANN 2004, Bruges, Belgium, 2004, pp. 305–312. [7] F. Rossi, N. Delannay, B. Conan-Guez, M. Verleysen, Representation of functional data in neural networks, Neurocomputing 64 (2005) 183–210. [8] A. Sato, K. Yamada, Generalized learning vector quantization, in: G. Tesauro, D. Touretzky, T. Leen (Eds.), Advances in Neural Information Processing Systems 7 (NIPS), vol. 7, MIT Press, Cambridge, MA, 1995, pp. 423–429. [9] M. Strickert, N. Sreenivasulu, W. Weschke, U. Seiffert, T. Villmann, Generalized relevance LVQ with correlation measures for biological data, in: M. Verleysen (Ed.), European Symposium on Artificial Neural Networks (ESANN), D-side Publications, 2005, pp. 331–338. [10] M. Strickert, S. Teichmann, N. Sreenivasulu, U. Seiffert. Highthroughput multi-dimensional scaling (HiT-MDS) for cDNA-array expression data, in: W. Duch, J. Kacprzyk, E. Oja, S. Zadroz˙ ny (Eds.), Artificial Neural Networks: Biological Inspirations—ICANN 2005, Springer Lecture Notes in Computer Science, Springer, Berlin, 2005, pp. 625–633. [11] N. Villa, F. Rossi, Support vector machine for functional data classification, in: Proceedings of ESANN 2005, Bruges, Belgium, 2005, pp. 467–472. [12] T. Villmann, F. Schleif, B. Hammer, Supervised neural gas and relevance learning in learning vector quantization, in: T. Yamakawa (Ed.), Proceedings of the Workshop on Self-Organizing Networks (WSOM), Kyushu Institute of Technology, New York, 2003, pp. 47–52. Marc Strickert obtained his Ph.D. in Computer Science in 2005 at the University of Osnabru¨ck in the research group ‘Learning with Neural Methods on Structured Data’. Self-organizing neural learning architectures for high-dimensional data analysis, time series processing, and pattern recognition are his major topics of interest. He is currently working in the Pattern Recognition Group at the Leibniz-Institute of Plant Genetics and Crop Plant Research Gatersleben (IPK), Germany.

Nese Sreenivasulu completed his Ph.D. in biology in 2002. Presently, he is working as post-doctoral research scientist at Institute of Plant Genetics and Crop Plant Research (IPK) with a focus on genomic studies of barley seeds.

Winfriede Weschke studied chemistry at the TU Merseburg (Germany). She finished her Ph.D. work in organic photochemistry in 1980 and afterwards worked at the former Zentralinstitut fu¨r Genetik and Kulturpflanzenforschung (ZIGUK) in Gatersleben on genetic engineering of seed storage proteins of Vicia faba. She took a permanent position at the Institute of Plant Genetics and Crop Plant Research (IPK) in 1990. Presently, she is working on the molecular physiology of barley seed development with the main focus on expression analysis. Thomas Villman works in the Clinic for Psychotherapy at the University of Leipzig where he is the head of the computational intelligence group. His research interests incude theory and applications of neural networks—in particular SOM, neural gas, and LVQ—data mining and evolutionary algorithms. Applications cover medical problems, satellite remote sensing and model optimization. He holds a diploma degree in mathematics and a Ph.D. in computer science both received from University Leipzig, Germany. Further, he is founding member of the German chapter of ENNS (GNNS). Barbara Hammer received her Ph.D. in Computer Science in 1995 and her venia legendi in Computer Science in 2003, both from the University of Osnabrueck, Germany. During 2000–2004, she was leader of the junior research group ‘Learning with Neural Methods on Structured Data’ at University of Osnabru¨ck before becomming professor of Theoretical Computer Science at Clausthal University of Technology, Germnay, in 2004. Several research visits have taken her to Pisa, Padova (Italy), Birmingham (UK), Bangalore (India), Paris (France), and the USA. She is coauthor of more than 60 papers in international journals and conferences on different aspects of Computational Intelligence, most of which can be retrieved from http://www.in.tu-clausthal.de/hammer/

Generalized relevance LVQ (GRLVQ) with correlation ...

dInstitute of Computer Science, Technical University of Clausthal, Germany. Available online 10 January 2006. Abstract .... degrees of freedom of the cost minimization are the prototype locations in the weight space and a set of adaptive ...

274KB Sizes 0 Downloads 180 Views

Recommend Documents

Generalized relevance LVQ (GRLVQ) with correlation ...
Jan 10, 2006 - correlation in order to make its benefits for data processing available in compact prototype ... as the result of a self-organizing map (SOM), provides first hints about ...... MDS visualization technique, their expressiveness can be.

On the geometry of a generalized cross-correlation ...
defined for analyzing functional connectivity in brain images, where the main idea ... by a postdoctoral fellowship from the ISM-CRM, Montreal, Quebec, Canada.

On the geometry of a generalized cross-correlation ...
Random fields of multivariate test statistics, with an application to shape analysis. Annals of Statistics 36, 1—27. van Laarhoven, P., Kalker, T., 1988. On the computational of Lauricella functions of the fourth kind. Journal of. Computational and

SHORTWA'U'E RADIO PROPAGATION CORRELATION WITH ...
over the North Atlantic for a fire-year period, and the reiatioc position of the planets in the solar system, discloses some oerp interesting cor- relations. As a result ...

Trade with Correlation
28 Nov 2017 - (2012), Costinot and Rodr`ıguez-Clare (2014), Caliendo and ... Caron et al. (2014), Lashkari and Mestieri (2016), Brooks and Pujolas. (2017) .... USA. JPN. BRA. −20. −10. 0. 10. 20. 30. 40. Percent Difference in Gains From Trade .6

Trade with Correlation
trade, and estimate that countries specialized in sectors with low cross-country corre- lation have 40 percent higher gains from trade relative to countries ...

Trade with Correlation
Nov 28, 2017 - treating technologies as random variables, giving rise to a rich theoretical and quantita- tive literature. Their results are based on a useful property of Fréchet .... global total factor productivity component and a destination-mark

Crohn Disease with Endoscopic Correlation
bandwidth, 70° flip angle) were obtained. 5 minutes after contrast material injec- tion. A minimum out-of-phase echo time of 2.1 msec was chosen to accentuate the effects of chemical fat suppression on the. GRE images. Blinded Separate Review of Sin

Interpersonal Pathoplasticity in Individuals With Generalized Anxiety ...
2000), have biased social judgment regarding their negative impact on others (Erickson .... ology does not permit derivation of clusters that match the data best. Indeed, in Salzer et al. ..... behavioral treatment plus supportive listening (Newman e

Improving web search relevance with semantic features
Aug 7, 2009 - the components of our system. We then evaluate ... of-art system such as a commercial web search engine as a .... ment (Tao and Zhai, 2007).

Aligning Vertical Collection Relevance with User Intent - UT Research ...
Due to the increasing diversity of data on the web, most ..... Several trends can be observed. Firstly ... was determined based on iterative data analysis, such that.

Social Image Search with Diverse Relevance Ranking - Springer Link
starfish, triumphal, turtle, watch, waterfall, wolf, chopper, fighter, flame, hairstyle, horse, motorcycle, rabbit, shark, snowman, sport, wildlife, aquarium, basin, bmw,.

Probabilistic relevance feedback with binary semantic feature vectors ...
For information retrieval, relevance feedback is an important technique. This paper proposes a relevance feedback technique which is based on a probabilistic framework. The binary feature vectors in our experiment are high-level semantic features of

Aligning Vertical Collection Relevance with User Intent - UT Research ...
CIKM'14, November 3–7, 2014, Shanghai, China. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2598-1/14/11 ...$15.00. http://dx.doi.org/10.1145/2661829.2661941. The relevance of a vertical could depend o

Web Image Retrieval Re-Ranking with Relevance Model
ates the relevance of the HTML document linking to the im- age, and assigns a ..... Database is often used for evaluating image retrieval [5] and classification ..... design plan group veget improv garden passion plant grow tree loom crop organ.

Loss of Heterozygosity and Its Correlation with ... - Semantic Scholar
Jan 1, 2004 - LOH in breast cancer has made it difficult to classify this disease .... allelotypes. This system uses the HuSNP chip, an array of oligonucleotide.

from Correlation Scaled ab Initio Energies with Ex
Aug 15, 2009 - of the necessary database for the design of spacecraft heat shields.1 As ...... (10) Wang, D.; Huo, W. M.; Dateo, C. E.; Schwenke, D. W.; Stallcop,.

Sonographic Measurement of Splenic Size and its Correlation with ...
Sonographic Measurement of Splenic Size and its Correlation with Body Parameters in Sulaimani Region.pdf. Sonographic Measurement of Splenic Size and its ...

MACs: Multi-Attribute Co-clusters with High Correlation ...
4.2 Obtaining MACs from an Attribute Set. After obtaining an attribute set A, MACminer takes the values of Xb a and Y d c in each transaction t of D to be a MAC C. C is then used to update list, a list of MACs sorted in descending order of their corr

Correlation of Balance Tests Scores with Modified ...
Keywords: BBS, MDRT, BPOMA, Modified PPT, Balance, Physical Function. ... It allows for analysis of the patient .... The data analysis was done on SPSS 11.5.