High-Throughput Multi-dimensional Scaling (HiT-MDS) for cDNA-Array Expression Data M. Strickert1 , S. Teichmann2 , N. Sreenivasulu1 , and U. Seiffert1 1
Pattern Recognition Group, Gene Expression Group, Institute of Plant Genetics and Crop Plant Research Gatersleben {stricker, srinivas, seiffert}@ipk-gatersleben.de 2 University of Osnabr¨ uck
[email protected]
Abstract. Multidimensional Scaling (MDS) is a powerful dimension reduction technique for embedding high-dimensional data into a lowdimensional target space. Thereby, the distance relationships in the source are reconstructed in the target space as best as possible according to a given embedding criterion. Here, a new stress function with intuitive properties and a very good convergence behavior is presented. Optimization is combined with an efficient implementation for calculating dynamic distance matrix correlations, and the implementation can be transferred to other related algorithms. The suitability of the proposed MDS for high-throughput data (HiT-MDS) is studied in applications to macroarray analysis for up to 12,000 genes. Keywords: Multi-dimensional scaling, clustering, gene expression analysis.
1
Introduction
Data analysis is a multi-stage process that usually starts at the raw data inspection from which iteratively more sophisticated data models are formulated. Thus, for unknown data, such as for gene expression levels from the first series of screening experiments, many future assumptions are based upon the groundwork of the initial results. Especially visualization techniques can help to gain basic knowledge about the data. For a decent number of continuous-valued attributes, scatter plots provide a very direct access to the data. For high-dimensional input, linear mapping techniques like the principal component analysis (PCA) or the projection pursuit (PP) can help to focus on data projections which are ‘interesting’ from the mathematical point of view, e.g. the directions of largest variance or views on the central data mass [3,14]. Self-organizing maps (SOM) are well-established non-linear neural tools for clustering data on a rectangular or a hexagonal lattice of specialized neurons [9]. The SOM, however, compels a data model that is structurally different from linear mappings. Since data are represented by prototypes, the one-to-one correspondence between input data and their low-dimensional counterparts vanishes: in case of Ultsch’s emergent SOM there are more prototypes than input points, but usually input samples W. Duch et al. (Eds.): ICANN 2005, LNCS 3696, pp. 625–633, 2005. c Springer-Verlag Berlin Heidelberg 2005
626
M. Strickert et al.
are mapped to a smaller number of specialized neurons. As a result, prototypebased representations give a more abstract view on the data, which is beneficial for model generalization and simplification but not for revealing details of the data distribution or of the data granularity. A nonlinear method that assigns each data point a related copy in the target space is multi-dimensional scaling (MDS) [10]. Its most common objective for visualization is arranging the lowdimensional copies in such a way that their mutual distances match best the original distances. This optimization with respect to good surrogate locations is realized by stress function minimization for which several alternatives exist [2]. Here, MDS is realized by a stochastic gradient descent on a stress function with superior convergence properties. The proposed high-throughput HiT-MDS method is suitable for embedding very large data sets in a distance-preserving way into Euclidean space, usually with 2 or 3 dimensions for visualization. Using MDS for large data sets is closely related to the suggested speed-up of Sammon’s mapping by Pekalska et al. [12] and to Naud and Duch who combine MDS with the data prototypes from a learning vector quantizer [11]. The topic presented here, MDS for gene expression analysis, coincides perfectly with a recent publication of Taguchi and Oono introducing a non-metric MDS variant called nMDS [6]. Like other MDS approaches, the proposed HiT-MDS can be used for non-metric data too, for example, by considering the rank order of mutual data items; also non-symmetric dissimilarity measures and sparse distance matrices can be realized for getting global reconstructions of incomplete local distance information.
2
Improved Multi-dimensional Scaling
The challenge of multi-dimensional scaling lies in the optimization of the free ˆ n×d which are the locations of points ˆxi = (ˆ parameters ˆ xi ∈ X xi1 , . . . , xˆid ) in the d-dimensional target space corresponding to i = 1 . . . n input points xi ∈ Xn×q . ˆij = d(ˆxi , ˆxj ) of all The most canonic approach to forcing the mutual distances d data pairs indexed by (i, j) to the original distances dij = d(xi , xj ) is defined by minimizing the raw stress function of classical scaling [5]: s=
n i
!
ˆij )2 = min (dij − d
with distances
dij (xi , xj ) =
d
(xik − xjk )2 .
k=1
For computational convenience, the squared Euclidean distance is considered, and symmetry dij = dji can be assumed for the distance matrix. However, as pointed out by Basalaj [1], the straight-forward minimization of the stress by gradient descent is prone to getting stuck in local optima. For Euclidean distances though, eigenvalue methods can be used to find a solution, making classical scaling equivalent to PCA [5]. The freely available XGvis package by Buja et al. [2] implements different types of MDS for which eigenmethods fail. XGvis comes with a very good documentation of the technical details. Their stress 1 function is of the general form s = (1 − cosϕ2 ) 2 which is based on the squared cosine between the original distance matrix D and the reconstructed distance
HiT-MDS for cDNA-Array Expression Data
627
ˆ thereby, the variables in ϕ are parameters of the used Minkowski matrix D; metric, for data weighting, and for the specification of the order of the moments of the original distances. Here, an approach is taken which less versatile than the one of Buja et al., but which yields very faithful overall data representations: such most ‘honest’ data displays are aimed at by maximizing the Pearson correlation of distances: ˆ = r(D, D)
n i=j n i=j
ˆ ij − µ ˆ ) (dij − µD ) · (d D
(dij − µD )2 ·
n i=j
ˆ ij − µ ˆ )2 (d D
=: √
ˆ B(d) C·
ˆ D(d)
∈ [−1; 1]
ˆ ij )i,j=1...n those ˆ = (d Matrix D = (dij )i,j=1...n contains pattern distances and D of the reconstructions. Maximizing r yields a balanced consideration of all distance matrix entries; up to the author’s knowledge, this maximization cannot be solved with eigenmethods, therefore inducing the proposed heuristic solution. The right fraction is a convenient one-to-one correspondence to the left term ˆ is related to the mixed summation of both original used in the following: B(d) ˆ refers to the dissimilarities dependent on the and reconstructed distances, D(d) ˆ choices of the reconstructions X, and C denotes the connection to the initially calculated and thus constant input pattern distances. In order to formulate an efficient correlation-based stress function for minimization, the tight range of the Pearson correlation [−1; 1], −1 denoting anticorrelation, 0 uncorrelatedness, and 1 perfect correlation, is swapped and widened by inverse power transform:
ˆ s = r(D, D)
−K
ˆ = for Euclidean distances D
d l=1
(ˆ xil − x ˆjl )2
j=i ,
.
i,j=1...n
Integer K > 0 are assumed. For even K, best ‘inverse’ point configurations with anti-correlated distance relationships might be found. This theoretical situation, however, has never been encountered experimentally – maybe the applied random projection initialization of the embedded points prevents such a defect. Thus, no matter which relationships are coded by D, the inverse power r−K can ˆ in the Euclidean tarbe minimized by optimally arranging the reconstructions X get space. This is achieved by a gradient descent on the stress function s, which requires finding zeros of the derivatives of s w.r.t. the free parameters x ˆik : ! ˆ = ˆ ◦X s = r−K ◦ D min ⇒
j=i ˆij ! ∂r−K ∂ d ∂s = · = 0, i = 1 . . . n i ˆ ∂x ˆk ∂x ˆik j=1...n ∂ dij
(1)
of step size γ into the Solutions are found by iterative updates ∆ˆ xik = −γ · ∂∂s x ˆik direction of the steepest gradient of s. The missing derivatives in Eqn. 1 are ∂r−K ˆ ij ∂d − ˆ Bij (d)
ˆ
=
·D(d)) ∂ (CB( ˆ K d)
K 2
= K · r−K ·
ˆ − (dij − µD ) · D − (d) ˆ ˆ ij − µ ˆ ) · B − (d) (d ij ij D ˆ ˆ B(d) · D(d)
ˆ ij ∂d ˆ − (dij − µD ) · (d ˆ ij − µ ˆ ) , = B(d) D
ˆ ij ∂d = 2 · (ˆ xik − x ˆjk ) ∂x ˆik
d l=1
− ˆ ˆ − (d ˆ ij − µ ˆ )2 Dij (d) = D(d) D
Æ
ˆ ij . (ˆ xil − x ˆjl )2 = 2 · (ˆ xik − x ˆjk ) d
628
M. Strickert et al.
The embedding procedure is outlined in Algorithm 1. High-dimensional data items are assumed to be embedded. They are used for the initialization of their low-dimensional counterparts by random linear mapping. Random mappings provide a first-shot for loosely distance-preserving first target point configurations [8]; here, the matrix elements of R (Alg. 1, l. 2) are randomly taken from [−1; 1]. Input data are also used for calculating the distance matrix D which, in an alternative initialization setup, could be directly taken from file. For computational convenience the mean is subtracted from D, which does not affect ˆ Target point adaptation is done the Pearson correlation computations r(D, D). iteratively in a neural network training manner: in each step a random point is taken its distance relations to the other points is improved by changing its coordinates. Convergence takes place when for all points the stress function cannot ˆ is maximum. be further minimized, i.e. when the correlation r(D, D) Algorithm 1 HiT-MDS 1: 2: 3: 4: 5: 6: 7:
Read input data X. ˆ by random projection X ˆ n×d = Xn×q · Rq×d . Initialize X Calculate distance matrix D and subtract its mean ⇒ constant C . ˆ ⇒ initial B(d), ˆ D(d). ˆ Calculate D repeat Draw a pattern index 1 ≤ i ≤ n from randomly shuffled list. for all j = i do −K
ˆ ∂d
ij xi − ∂r { accumulate derivatives corresponding to Eqn. 1 } 8: ∆ˆ xi ← ∆ˆ ˆ ij · ∂ˆ xi ∂d 9: end for ˆ xi + γ · ∆ˆ xi { adapt location of target point } 10: xi ← ˆ ˆ xi , ˆ xj ) influenced by new position of point ˆ xi ; 11: Recalculate distances d(ˆ ˆ ˆ 12: thereby, keep track of changes in B(d) and D(d). 13: until convergence criterion is met. ˆ by largest dimension variance. 14: Postprocess: center and normalize X
Substantial Speedup is obtained if, for each update step, the distance matrix correlation is not completely recalculated but incrementally adjusted. An unoptimized implementation would take an order of O(n2 ) matrix elements into ˆij change during each iteration on consideration, but since only n − 1 distances d ˆij in the pattern i, much computing power is saved by tracking only changes ∆d ˆ ˆ constituents B(d) and C (d) of r. Generally, new values are expressed by means of existing old values plus adding the specific changes caused by the adaptation of the i-th target point. Reformulating the average target distance yields: ˆ µ(D
new
ˆ ) = µ(D
old
i
ˆ ) = µ(D ˆ + ∆D
old
i
ˆ ), ) + µ(∆D
i
ˆ kj )k,j=1...n ˆ = (∆d ∆D
ˆ kj ) for k = i. In the same manner, incremental update with zero elements of (∆d formulas can be given for the constituents of the correlation r and thus for the terms in the stress function s that depend on the target distances (Alg. 1, l. 12):
HiT-MDS for cDNA-Array Expression Data
629
j=i
ˆ = Bold (d) ˆ + Bnew (d)
j=1...n
j=i
i
ˆ )2 · with T := µ(∆D
ˆ ij · dij , ∆d
ˆ = Dold (d) ˆ + (n2 − n) · T Dnew (d)
ˆ ij · ∆d ˆ ij + 2 · d ˆ old ˆ old ) + µ(∆D ˆ i) ∆d ij − µ(D
j=1...n
MDS with the new stress function in combination with the above speedup formulas is designed for high-throughput operation and it is therefore referred to as HiT-MDS.1 The free parameters of HiT-MDS are the exponent K of the stress function and the step size γ which can be called a learning rate. Empirical studies on different data sets have showed that the proposed algorithm is very robust with respect to the choices of γ and K. A two phase learning, first a rough ordering with K = 4 and increasing γ = 10−3 , 10−2 , 10−1 followed by a fine tuning with K = 1 and γ = 5, 25, 100, 250 is likely to meet the requirements even of unknown data sets; usually, a number between 10 and 500 training epochs will suffice to obtain good convergence for the current parameter set.
3
Application to cDNA Array Expression Data
Two studies are presented that show the suitability of HiT-MDS for the analysis of high-throughput expression data. In the experimental design, several thousand gene expression patterns were analyzed which correspond to barley seed development in the time span 0–26 days after flowering, in three distinct tissues: pericarp (p), endosperm (end), and embryo (e). The conducted experiments resulted in high-throughput expression data with a relatively small number of experiments and a large number of gene expression levels. The two major questions of interest are: 1. How are the experiments, representing the tissues at a particular developmental stage, characterized with respect to their transcriptome similarity of specifically expressed genes? 2. Can inter-experimental clusters of coexpressed genes be identified for which, afterwards, regulatory functions can be assigned? In order to address these key questions it is necessary to bring the high-dimensional data for visualization into lower dimension, thereby keeping their original distances. HiT-MDS is applied to obtain such visualizations, first for grouping the experiments with similar transcriptome, secondly for identifying the hot-spots of coexpressed genes from unfiltered expression data. 1. Experiment Clustering. The first HiT-MDS application aims at the visualization of the inter-experiment relationships: a two-dimensional embedding is wanted that reconstructs the distances of the original space of gene expression intensities. The common way to do this is principal component analysis (PCA). More recently, independent component analysis (ICA) has been used which is a linear decomposition technique [7], or locally linear embeddings LLE are computed [13]. However, none of these methods is explicitly formulated like MDS for the most natural approach to the original data, which is the best available 1
A C-language command line version is available at http://hitmds.webhop.net/.
630
M. Strickert et al. Principal Component Plot (28 gene expression experiments)
Multi−Dimensional Scaling (28 gene expression experiments) 0.8
10e 2 12e 10e12e
08e1
0.8
2
0.2
02e
00e1
2
08e
06e
2
2
1
04e
1
00e2 04e
1
2
0
02e 00e 02e 21 1 00e04e 2
1
−0.2
04e1
−0.1
04p 04p 1 02p 02p 11 00p 2 00p
−0.4 00p 00p21 02p 04p 02p 1 1 2 04p
−0.4
1
−0.7 −1 −1.5
1
1
02e1
x ˆ2
pc2
0.4
08e
06e 1 06e
0.5 0.2
0.6
10e 12e 2 2 10e 1 12e
08e
06e
11 2
1.1
08p1
06p
1
−1
−0.5
12p 10p1 1
06p2
08p
0
0.5
pc1
2
2
08p
1
06p1
−0.6
10p 12p2
1
10p1 12p
1
06p2
−0.8
08p
2
2
1
1.5
2
−1 −1.5
−1
−0.5
0
10p2 12p2 0.5
1
1.5
2
x ˆ1
Fig. 1. Gene expression experiments embedded in 2D. Left: PCA. Right: HiT-MDS.
distance reconstruction. The analyzed data set comprises 28 experiments, composing 7 developmental points on from endosperm/embryo (e) and pericarp (p) tissue which are reproduced by two independent experiments (1,2); each experiment is represented by 1,421 log-transformed gene expression values. Fig. 1 compares the experiment visualization of PCA (left) and HiT-MDS (right). Both embeddings show a temporal ordering from 00 to 12 days, but tissues e vs. p are more strictly separated by HiT-MDS, and the 00−04e1/2 cluster as well as the 08p1/2 pair are more faithfully spread. Distance correlations characterize the ˆ HiT-MDS ) = 0.981, ˆ PCA ) = 0.973 and r(X, X reconstruction quality; they are r(X, X which supports higher confidence for the HiT-MDS embedding. For these obtained groupings, the identification of regulating key genes will be of interest in future investigations. 2. Gene Clustering. A much more demanding experiment is conducted with data from 12k-arrays (11786 genes) available for 31 developmental stages of three tissue types. The left panel of Fig. 2 shows a density plot of the embedded 31-dimensional genes zoomed to the region of highest interest. Dark patches represent areas of high densities, i.e. many points with small Euclidean ˆ HiT-MDS ) = 0.979. In condistance, leading to an overall correlation of r(X, X ˆ trast, PCA yields a correlation of only r(X, XPCA ) = 0.937, which would result in misleading interpretations of a PCA-based density plot (omitted). HiT-MDS becomes even more useful for coexpression analysis, if the Euclidean distance measure for the 31-dimensional gene vectors is replaced by (1-correlation) between them; then, the maximum dissimilarity is 2 and the best correlation is 0. Thus, a 2D-embedding of such a dissimilarity matrix will cluster gene profiles of high correlation opposed to anti-correlated groups of genes. This is illustrated in the right panel of Fig. 2 for a preselected subset of 2000 embedded genes. By focusing on these fewer points, i.e. by loosening the global embedding constraints, more details become visible. The gene profiles associated with the points inside the exemplarily picked encircled high-density areas are plotted in Fig. 3. A good spatial separation and specificity is observed for the embedded genes and their corresponding relevance profiles, making the proposed approach a valuable tool for gene expression analysis: it is demonstrated that by HiT-MDS hot-spots of
HiT-MDS for cDNA-Array Expression Data
90
631
25
160 20
100
140
110
120
120
100
130
80
140
60
150
40
20 40
C2 15
60
80
10
C1
100
5 20
160
120 140
150
160
170
180
190
200
20
210
40
60
80
100
120
Fig. 2. Genes embedded in 2D. Left: Euclidean inter-gene embedding of 12k-genes. Right: correlation-based similarity on 2000 gene subset with finer clustering details.
coexpressed genes can be identified despite the presence of noise in the expression data by simultaneously processing all available data. Such coexpressed genes will be further examined for their functional role.
4
Conclusions
The proposed improvement of multi-dimensional scaling, HiT-MDS, allows to embed very large data sets into a low-dimensional Euclidean space, making the method suitable for visual data inspection. As a nonlinear method, HiT-MDS is able to capture even minor differences in the data by faithfully reconstructing given distance relations. The corresponding low-dimensional surrogates are obtained by a neural network type of incremental training that minimizes a fast converging stress function. This function is based on the correlations between input and output distances; thereby, functional terms contributing to the correlation calculation are also updated in an incremental way. In contrast to SOM learning, HiT-MDS is no mapping technique, which restricts its applicability to static data sets. However, no topological parameters have to be chosen, consequently HiT-MDS is not subject to topological restrictions of an underlying 3 4
C 1 2
C 2 3 2
1 1 0 0 -1
-1
-2 -2
-3
1
1
2
2
1
1 1
1
1
1
2
2
2
1
2
2
2
1 2
1
1
1
1
2
1
1
Fig. 3. Gene profiles corresponding to cluster C1 (left) and C2 (right) of Fig. 2.
2
2 n d n d n d n d n d n d n d n d n d n d n d n d n d n d
d
d
1
0 e 2 e 4 e 6 e 8 e 0 e 2 e 4 e 6 e 8 e 0 e 2 e 4 e 6 e
d
d d
d
d 0 e n 2 e n 4 e n 6 e n 8 e n 0 e n 2 e n 4 e n 6 e n 8 e n 0 e n 2 e n 4 e n 6 e n
d
d
d
d
d
d
d
1
1
1
1
1
1
1
1
2
1
2
2
2
0 p 2 p 4 p 6 p 8 p 0 p 2 p 4 p 6 p _ e _ e _ e _ e _ e _ e _ p _ p _ p _ p _ p _ p _ p _ p 2 e 4 e 6 e 8 e 0 e 2 e 4 e 6 e
0 p 2 p 4 p 6 p 8 p 0 p 2 p 4 p 6 p _ e _ e _ e _ e _ e _ e _ p _ p _ p _ p _ p _ p _ p _ p 2 e 4 e 6 e 8 e 0 e 2 e 4 e 6 e
-4
-3
632
M. Strickert et al.
neuron grid, and the success the data model can be well-evaluated by the correlation between input distances and those of the embedded points. Obviously, HiT-MDS can be applied to many problems beyond bioinformatics, e.g. for the visualization of high-dimensional items, for graph layout [4], and also for space conversion like the illustrated one, from correlation space to Euclidean space. Future attention is put on further improvements of the stress function – first preliminary results show that even faster initial convergence and more tolerant parameter choices are achieved by replacing the presented exponential correlations by Fisher’s Z transform. Finally, parallel implementations on an SMP machine are under consideration for making interactive zooming into large subsets of analyzed data easily possible. Acknowledgement. Thanks to Dr. Volodymyr Radchuk for helping probe preparation for the macroarray experiments. Thanks also to the anonymous reviewers for helping to improve the this contribution by their valuable comments. The work is supported by BMBF grant FKZ 0313115, GABI-SEED-II.
References 1. W. Basalaj. Proximity Visualization of Abstract Data. PhD thesis, Computer Lab, Univ. of Cambridge, URL: http://www.pavis.org/essay/pavis.pdf, 2001. 2. A. Buja, D. Swayne, M. Littman, N. Dean, and H. Hofmann. Interactive Data Visualization with Multidimensional Scaling. Report, University of Pennsylvania, URL: http://www-stat.wharton.upenn.edu/˜buja/, 2004. 3. D. Cook, A. Buja, and J. Cabrera. Grand tour and projection pursuit. Journal of Computational and Graphical Statistics, 4(3):155–172, 1995. 4. E. Gansner, Y. Koren, and S. North. Graph drawing by stress majorization. In Lecture Notes in Computer Science, Proc. of 12th Int. Symp. Graph Drawing (GD’04). Springer, 2004. 5. J. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53:325–338, 1966. 6. Y. h. Taguchi and Y. Oono. Relational patterns of gene expression via non-metric multidimensional scaling analysis. Bioinformatics, 21(6):730–740, 2005. 7. A. Hyv¨ arinen, J. Karhunen, and E. Oja. Independent component analysis. John Wiley & Sons, 1st edition, 2001. 8. S. Kaski. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proc. of the International Joint Conf. on Neural Networks (IJCNN’98), volume 1, pages 413–418. IEEE Service Center, Piscataway, NJ, 1998. 9. T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 3rd edition, 2001. 10. J. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, New Park, California, 1978. 11. A. Naud and W. Duch. Visualization of large data sets using MDS combined with LVQ. In L. Rutkowski and J. Kacprzyk, editors, Advances in Soft Computing, pages 632–637. Physica Verlag, 2002. 12. E. Pekalska, D. deRidder, R. Duin, and M. Kraaijveld. A new method of generalizing Sammon mapping with application to algorithm speed-up. In M. Boasson, J. Kaandorp, J. Tonino, and M. Vosselman, editors, Proc. 5th Annual Conf. of the Advanced School for Computing and Imaging, pages 221–228. ASCI, 1999.
HiT-MDS for cDNA-Array Expression Data
633
13. S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. 14. M. Strickert, S. Teichmann, N. Sreenivasulu, and U. Seiffert. ’DiPPP’ Online SelfImproving Linear Map for Distance-Preserving Macro-Array Data Analysis. To appear at WSOM 2005.