204

Current Bioinformatics, 2010, 5, 204-216

On Biclustering of Gene Expression Data Anirban Mukhopadhyay*,1, Ujjwal Maulik2 and Sanghamitra Bandyopadhyay3 1

Department of Theoretical Bioinformatics, DKFZ (Deutsches Krebsforschungszentrum, German Cancer Research Center), Im Neuenheimer Feld 580, D-69120, Heidelberg, Germany; 2Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, India; 3Machine Intelligence Unit, Indian Statistical Institute, Kolkata-700108, India Abstract: Microarray technology enables the monitoring of the expression patterns of a huge number of genes across different experimental conditions or time points simultaneously. Biclustering of microarray data is an important technique to discover a group of genes that are co-regulated in a subset of experimental conditions. Traditional clustering algorithms find groups of genes/conditions over the complete feature space. Therefore they may fail to discover the local patterns where a subset of genes has similar behaviour over a subset of conditions. Biclustering algorithms aim to discover such local patterns from the gene expression matrix, thus can be thought as simultaneous clustering of genes and conditions. In recent years, a large number of biclustering algorithms have been proposed in literature. In this article, a study has been made on various issues regarding the biclustering problem along with a comprehensive survey on available biclustering algorithms. Moreover, a survey on freely available biclustering software is also made.

Keywords: Microarray, gene expression, biclustering, bicluster types, biclustering algorithms, biclustering software. 1. INTRODUCTION The classical approach to genomic research was based on the local study and collection of data on single genes. With the advancement in microarray technology, it has now become feasible to have a global and simultaneous view of the expression levels of many thousands of genes across dierent time points or experimental conditions [1]. Microarray technology in recent years has major impacts in many fields such as medical diagnosis, bio-medicine, characterizing various gene functions, understanding dierent molecular biological processes, gene expression profiling etc [2-5]. New application opportunities have been created for data mining methodologies due to the development of microarrays. Microarray chips consist of expression levels of a large number of genes. Hence they produce large amounts of data to handle. Due to its large volume, computational analysis is essential for extracting knowledge from microarray gene expression data. Clustering is one of the primary approaches to analyze such large amount of data to discover the groups of co-expressed genes. Clustering [6], an important microarray analysis tool, has been used to identify the sets of genes with similar expression profiles. In some early works, visual analysis was successfully done for grouping genes into functionally relevant classes in Yeast cell cycle [3, 7] and Human large B-cell lymphoma [2] data sets. However, as these methods were very subjective, standard clustering methods, such as K-means [8], fuzzy C-means [9], hierarchical methods [4], *Address correspondence to this author. on leave from Department of Computer Science and Engineering, University of Kalyani, Kalyani – 741235, India; Tel: +91 33 2580 9618; Fax: +91 33 2582 8282; E-mail: [email protected] 1574-8936/10 $55.00+.00

Self Organizing Maps (SOM) [10], graph theoretic approach [11], simulated annealing based approach [12] and genetic algorithm (GA) based clustering methods [13] have been utilized for clustering microarray data. Clustering algorithms have been applied on microarray data either to group the genes across the time points or experimental conditions/samples [10, 13-16] or group the samples across the genes [17-19]. Clustering techniques, which aim to find the clusters of genes over all the experimental conditions, may fail to discover the genes having similar expression patterns over a subset of conditions. Similarly, a clustering algorithm that groups the conditions/samples across all the genes, may not capture the group of samples having similar expression values for a subset of genes. It is often the case that a subset of genes are co-regulated and coexpressed across a subset of experimental conditions and have almost different expression patterns over the remaining conditions. Traditional clustering methods are not able to identify such local patterns, usually termed as biclusters. Thus biclustering can be thought as the simultaneous clustering of genes and conditions instead of clustering them separately. The aim of the biclustering algorithms is to discover a subset of genes that are co-regulated over a subset of experimental conditions. Hence they provide better reflection of the biological reality. Although biclustering is a relatively new approach applied in gene expression data, it has a fast growing literature. In this article, we have discussed several issues of biclustering including a comprehensive review of the recent literature. The rest of the article is organized as follows: the next section describes the structure of a microarray gene expression data set. In Section 3, the biclustering problem is defined formally and different related definitions are © 2010 Bentham Science Publishers Ltd.

On Biclustering of Gene Expression Data

Current Bioinformatics, 2010, Vol. 5, No. 3

provided. In Section 4, a discussion is made on the available biclustering algorithms. Section 5 describes some publicly available biclustering software. Section 6 concludes the article.

A microarray [20] is a small chip onto which a large number of DNA molecules (probes) are attached in fixed grids. The chip is made of chemically coated glass, nylon, membrane or silicon. Each grid cell of a microarray chip corresponds to a DNA sequence. There are mainly two types of microarrays, viz., two-channel microarrays and singlechannel microarrays [21]. In two-channel microarrays (also called as two-color microarrays), two mRNA samples are reverse-transcribed into cDNA (targets) labelled using dierent fluorescent dyes (red-fluorescent dye Cy5 and green-fluorescent dye Cy3). Due to the complementary nature of the base-pairs, the cDNA binds to the specific oligonucleotides on the array. In the subsequent stage, the dye is excited by a laser so that the amount of cDNA can be quantified by measuring the fluorescence intensities. The log ratio of two intensities of each dye is used as the gene expression profiles. Intensity (Cy 5) . Intensity (Cy 3)

3. BICLUSTERING PROBLEM AND DEFINITIONS Given a G C microarray data matrix

A (G, C )

consisting of a set of G genes G = {I1 , I 2 ,K, I G } and a set of C conditions C = {J1 , J 2 ,K, JG} , a bicluster can be defined as follows:

2. MICROARRAY GENE EXPRESSION DATA

gene expression level = log2

205

(1)

Definition 1 (Bicluster) A bicluster is a submatrix M( I , J ) = [mij ] , i  I , j  J , of matrix A (G , C ) , where

I  G and J  C , and the subset of genes in the bicluster are similarly expressed over the subset of conditions. The problem of biclustering is thus to identify a set of biclusters from a given data matrix depending on some coherence criterion to evaluate the quality of the biclusters. In general, the complexity of a biclustering problem depends on the coherence criterion used. However, in almost all cases, the biclustering problem is known to be NP-complete. Therefore, a number of approaches use heuristics for discovering biclusters from a gene expression matrix. Depending on how the genes in a biclusters are similar to each other under the experimental conditions, biclusters can be categorized into different types. The following subsection provides the definitions of different types of biclusters. 3.1. Types of Biclusters

Although absolute levels of gene expression may be determined using the two-channel microarrays, the system is more useful for the determination of relative differences in gene expression within a sample and between samples. Single-channel microarrays (also called as one-color microarrays) are prepared to estimate the absolute levels of gene expression, thus requiring two separate single-dye hybridizations for the comparison of the two sets of conditions. As only a single dye is used, the data represent absolute values of gene expression. An advantage of singlechannel microarrays is that data are more easily compared to arrays from different experiments. However, in singlechannel system, one needs twice as many microarrays to compare the samples within an experiment. Mathematically, a microarray data set can be viewed as a G  C matrix A(G,C) that represents the expression level of a set of G genes G = {I1 , I 2 ,K, I G } over a set of C conditions C = {J1 , J 2 ,K, J G } . Each element mij of matrix A(G,C) represents the expression level of the

i th gene at the

j th condition, where i  G and j  C . (Eqn. 2).    I1 A (G, C ) =  I 2   M I  G

J1 J2 m11 m12 m21 m22 M M mG1 mG 2

L JC   L m1C  L m2C   O M  L mGC 

(2)

There are mainly six types of biclusters viz., (1) biclusters with constant values, (2) biclusters with constant rows, (3) biclusters with constant columns, (4) biclusters with additive pattern, (5) biclusters with multiplicative pattern and (6) biclusters with both additive and multiplicative pattern. The additive and multiplicative patterns are also referred as shifting and scaling patterns, respectively [22]. The different types of biclusters are defined as follows: Definition 2 (Biclusters with Constant Values) In a bicluster M( I , J ) = [mij ] , i  I , j  J with constant values, all the elements have the same value, i.e.,

mij =  , i  I , j  J .

(3)

Definition 3 (Biclusters with constant Rows) In a bicluster M( I , J ) = [mij ] , i  I , j  J with constant rows, all the elements of each row of the bicluster have the same value. Hence in this type of bicluster, each element is represented using one of the following notations:

mij =  + ai , i  I , j  J ,

(4)

mij = bi , i  I , j  J ,

(5)

mij = bi + ai , i  I , j  J .

(6)

206 Current Bioinformatics, 2010, Vol. 5, No. 3

Mukhopadhyay et al.

Here  is a constant value for a bicluster, additive (shifting) factor for row

ai is the

i and bi is the

multiplicative (scaling) factors. Hence in this type of bicluster, each element is represented as:

multiplicative (scaling) factor for row i .

mij = bi q j , i  I , j  J .

Definition 4 (Biclusters with Constant Columns) In a bicluster M( I , J ) = [mij ] , i  I , j  J with constant

Definition 7 (Biclusters with both Additive and Multiplicative Patterns) In a bicluster M( I , J ) = [mij ] ,

columns, all the elements of each column of the bicluster have the same value. Hence in this type of bicluster, each element is represented using one of the following notations:

i  I , j  J with both additive (shifting) and multiplicative

mij =  + p j , i  I , j  J ,

(7)

(scaling) pattern, each column and row has both additive (shifting) and multiplicative (scaling) factors. Hence in this type of bicluster, each element is represented as:

mij = bi q j + ai + p j , i  I , j  J . mij = q j , i  I , j  J ,

mij = q j + p j , i  I , j  J . Here



(12)

(8) Note that these biclusters are the most general form of biclusters. All other types of biclusters are special cases of these biclusters. (9)

is a constant value for a bicluster, p j is the

j and q j is the multiplicative (scaling) factor for column j .

additive (shifting) factor for column

Definition 5 (Biclusters with Additive Pattern) In a bicluster M( I , J ) = [mij ] , i  I , j  J with additive (shifting) pattern, each column and row has only some additive (shifting) factors. Hence in this type of bicluster, each element is represented as:

mij =  + ai + p j , i  I , j  J .

(11)

3.2. Some Important Definitions Here we discuss some important terms regarding biclusters and the biclustering problem. Definition 8 (Bicluster Variance) Bicluster variance VARIANCE ( I , J ) of a bicluster M( I , J ) is defined as follows: VARIANCE ( I , J ) =



(mij  mIJ ) 2 ,

iI , jJ

(13)

1  mij , i.e., the mean of the elements | I || J | iI , jJ in the bicluster.

where m = IJ

(10)

Definition 6 (Biclusters with Multiplicative Pattern) In a bicluster M( I , J ) = [mij ] , i  I , j  J with multiplicative

Definition 9 (Residue) The residue rij of any element

mij of a bicluster M( I , J ) is defined as:

(scaling) pattern, each column and row has only some

Fig. (1). Examples of different types of biclusters: (a) Constant, (b) Row-constant, (c) Column-constant, (d) Additive Pattern, (e) Multiplicative Pattern, (f) Both Additive and Multiplicative Patterns.

On Biclustering of Gene Expression Data

Current Bioinformatics, 2010, Vol. 5, No. 3

rij = mij  miJ  mIj + mIJ , where

miJ = i.e.,

miJ

is

the

mean

(14) of

the

i th row, i.e.,

1  mij , mIj is the mean of the j th column, | J | jJ

mIj =

1  mij , and mIJ is the mean of all the | I | iI

elements in the bicluster, i.e., m = IJ

1  mij . | I || J | iI , jJ

Definition 10 (Mean Squared Residue) The mean squared residue ( MSR( I , J ) ) of a bicluster M( I , J ) is defined as: MSR( I , J ) =

1  rij2 . | I || J | iI , jJ

(15)

The mean squared residue score of a bicluster represents the level of coherence among the elements of the bicluster. Lower residue score indicates greater coherence and thus better quality of the bicluster. Definition 11 (Row Variance) The row variance VAR(I, J ) of a bicluster M (I, J ) is defined as: VAR( I , J ) =

1  (mij  miJ ) 2 . | I || J | iI , jJ

207

using other homogeneity criteria to detect other types of biclusters. Cheng and Church first introduced the biclustering problem in the case of microarray gene expression data [24]. The coherence measure called Mean Squared Residue (MSR) was introduced by them (Eqn 15). Cheng and Church proposed a greedy search heuristic that searches for largest possible bicluster keeping MSR under a threshold  (called as  -bicluster). The algorithm has two phases. In the first phase, starting with the complete data matrix, they first delete rows and columns in order to bring the MSR score below  . In this regard, Cheng and Church suggested a greedy heuristic to rapidly converge to a locally maximal submatrix with MSR score below  . In the second phase, the rows and columns are added as long as MSR score does not increase. The same procedure is executed for K iterations in order to discover K  -biclusters. At each iteration, the bicluster found in the previous iteration is masked with random values in order to avoid overlaps. Since MSR score is zero for the biclusters with constant values, constant rows, constant columns and additive patterns, Cheng and Church algorithm is able to detect these kind of biclusters only. However, the algorithm is known to stuck at local optima often and also suffers from random interference due to masking of biclusters with random values.

The concept of biclustering was first introduced by Hartigan in [23] in the form of direct clustering. As a coherence measure of a bicluster M( I , J ) , bicluster variance ( VARIANCE ( I , J ) ) was used (Eqn. 13). The goal of the

In [25], the authors extended the concept of  -bicluster to cope with the problem of masking the missing values as well as masking the biclusters found in the previous iteration with random values. In this algorithm, the residue of a specified (non-missing) element in a bicluster is taken as same as per Eqn. 14, but residue of an unspecified (missing) element is taken to be zero. This algorithm allows the biclusters to overlap and thus is termed as FLexible Overlapped biClustering (FLOC). FLOC algorithm begins with a initial set of biclusters (seeds) and iteratively improves the overall quality of the biclustering. At each iteration, each row and column is moved among the biclusters to yield a better biclustering in terms of lower MSR . The best biclustering obtained during an iteration is used as the initial biclustering seed in the next iteration. The algorithm terminates automatically when the current iteration fails to improve the overall biclustering quality. Thus FLOC is able to evolve k biclusters simultaneously. However, this algorithm also can only identify constant and additive patterns, and fails to detect multiplicative patterns.

algorithm was to extract K biclusters from the given data set while minimizing the sum of the bicluster variances of the K biclusters. In each iteration, the algorithm partitions the data matrix into a set of submatrices, each of which is considered as a bicluster. As can be noted, for a constant bicluster, VARIANCE ( I , J ) is zero. As each element of the data matrix satisfies the zero variance criterion, to avoid this, the algorithm was executed until the data matrix was partitioned into K submatrices. Hartigan's algorithm was able to detect constant biclusters only. However, he proposed

In [26], an algorithm called Order Preserving Sub-matrix (OPSM) is proposed. Here a bicluster is defined as a submatrix where the order of the selected conditions is preserved for all of the selected genes. Hence, the expression values of the genes within a bicluster induce an identical linear ordering across the selected conditions. The authors proposed a deterministic iterative algorithm to find large and statistically significant biclusters. The time complexity of this technique is O (GC3 k ) where G and C are the number of genes and conditions of the input data set, respectively

(16)

A high row variance indicates that the rows (genes) of the biclusters have large variance across the conditions. Sometimes high row variance is desirable in order to escape from trivial constant biclusters. 4. BICLUSTERING ALGORITHMS In recent years, a large number of biclustering algorithms have been proposed for gene expression data analysis. In this section, we discuss some popular biclustering algorithms in different categories such as iterative greedy search, randomized greedy search, evolutionary techniques, graph based algorithms, fuzzy methods etc. 4.1. Iterative Greedy Search

208 Current Bioinformatics, 2010, Vol. 5, No. 3

and k is the number of biclusters found. Thus OPSM does not scale well for high-dimensional data sets. In [27] and [28], the authors proposed Iterative Signature Algorithm (ISA) where a bicluster is considered to be a transcription module, i.e., a set of co-regulated genes together with the associated set of regulating conditions. The algorithm starts with an initial set of genes and all samples are scored with respect to this gene set. The samples, for which the score exceeds a predefined threshold are chosen. Similarly, all genes are scored regarding the selected samples and a new set of genes is selected based on another user-defined threshold. This procedure is iterated until the set of genes and the set of samples converge, i.e., do not change anymore. ISA can discover more than one bicluster by starting with different initial gene sets. The choice of initial reference gene set plays an important role in ISA in order to obtain good quality results. ISA is highly sensitive to the threshold values and often tends to identify a strong bicluster many times. In xMotif biclustering [29], the biclusters which contain genes that are almost constantly expressed across the selected conditions are identified. At first, each gene is assigned a set of statistically significant states which define the set of valid biclusters. In xMotif, a bicluster is considered to be a submatrix where each gene is exactly in the same state for all the selected conditions. The aim is to identify the largest bicluster. To identify the largest valid biclusters, an iterative search method is proposed that is run on different initial random seeds. It should be noted that xMotif framework requires pre-identification of the classes of biclusters present in the data which may not be feasible for most of the real life data sets. In general, greedy search algorithms scale well in large data sets. However, they mainly suffer from the problem of getting stuck at local optima depending on the initial configuration. 4.2. Two-Way Clustering In [30], the authors present a coupled two-way clustering (CTWC) approach to gene microarray data analysis. The main idea is to identify subsets of the genes and samples, such that when one of these is used to cluster the other, stable and significant partitions emerge. They present an algorithm, based on iterative clustering, that performs such a search. This two-way clustering algorithm repeatedly performs one-way clustering on the rows and columns of the data matrix using stable clusters of rows as attributes for column clustering and vice-versa. Although the authors used hierarchical clustering, any reasonable choice of clustering method and definition of stable cluster can be used within the framework of CTWC. As a preprocessing step, they used normalization which allowed them to capture biclusters with constant columns also. Interrelated Two-Way Clustering (ITWC) [31], an algorithm similar to CTWC, combines the results of one-way clustering on both dimensions of the gene expression matrix for producing biclusters. As a preprocessing step, the rows of

Mukhopadhyay et al.

the data matrix is first normalized. Thereafter, the vectorangle cosine value between each row and a predefined stable pattern is computed to determine whether the row values vary much among the columns. The rows with very little variation are then removed. After that, correlation coefficient is used to measure the strength of the linear relationship between two rows or two columns, to perform the two-way clustering. As correlation coefficient is independent of the magnitude and only depends on the pattern, ITWC is able to detect both additive and multiplicative biclusters. Double Conjugated Clustering (DCC) [32] algorithm is node-driven algorithm that unifies the two view points of microarray clustering, viz., clustering the samples taking the genes as the features and clustering the genes taking samples as the features. DCC performs the both tasks simultaneously to achieve a unified clustering where the sample clusters are discriminated by subsets of genes. The clustering in sample space and gene space are synchronized by a projection of nodes between the spaces mapping the sample clusters to the corresponding gene clusters. The method may utilize any relevant clustering technique like SOM and K-means. The data does not scatter across all offered nodes due to the projection between the two clustering spaces. DCC algorithm can provide sharp clusters and empty nodes even in the case of number of nodes exceeding the number of clusters. However, DCC can only find constant biclusters from the input data set. The two-way clustering algorithms in general cluster the data set from both the dimensions (rows and columns) and finally try to combine the clustering of the two dimensions in order to obtain the biclusters. However, there is no standard rule for the choice of the number of clusters in both the gene and condition dimensions. 4.3. Evolutionary Biclustering Evolutionary algorithms, like Genetic Algorithms (GA) [33] and Simulated Annealing (SA) [34] have been used extensively in the biclustering problem. Some of these algorithms are described below. 4.3.1. GA Based Biclustering In [35], a genetic algorithm based biclustering framework has been developed. As an encoding strategy, the authors use a binary string of length G + C , where G and C denote the number of genes and number of conditions/samples/time points, respectively. If a bit position is `1', then the corresponding gene or condition is selected in the bicluster and if a bit position is `0', the corresponding gene or condition is not selected in the bicluster. Hence, each chromosome encodes one possible bicluster. Following fitness function F is minimized: 1  if MSR(I,J)    | I || J |  F=  MSR(I, J )  otherwise.  

(17)

On Biclustering of Gene Expression Data

Current Bioinformatics, 2010, Vol. 5, No. 3

Hence, if MSR of the bicluster encoded in a chromosome is less than the threshold  (i.e., a  bicluster), the objective is to maximize the volume. Otherwise, the objective is to minimize the MSR . The algorithms employs a special selection operator called environment selection to maintain the diversity of the population in order to identify a set of biclusters at one run. A local search strategy is used to expedite the rate of convergence. As the local search, one iteration of Cheng and Church node deletion and addition algorithm is executed before computing the fitness value of a chromosome. Also the chromosome is updated with the new bicluster obtained after the local search. Standard uniform crossover and bitflip mutation operators are adopted for generating the next generation. A similar GA based biclustering approach can be found in [36]. Here, instead of using Cheng and Church algorithm as a local search strategy in each step of fitness computation, it is only used once initially. The initial population consists of biclusters seeds generated through K-means clustering in both dimensions and combining the gene and sample clusters. Thereafter these seeds are grown up through Cheng and Church algorithm. Subsequently the normal GA process follows. As the fitness function, the authors minimized the ratio of MSR to the volume of the biclusters in order to capture large yet coherent biclusters. Another GA based biclustering, called Sequential Evolutionary BIclustering (SEBI) is proposed in [37]. In this work also, the authors use binary chromosomes as discussed above. SEBI minimizes the following fitness function: F=

MSR( I , J ) 1 + + wd + penalty, VAR( I , J ) 

(18)

where w = w ( w  + w  ) . Here wV , wr and wc d V r c |I| |J| represent weights on volume, number of rows and number of columns in the bicluster, respectively. Also , where is an weight w ( m ) penalty =  w p (mij ) p ij iI , jJ

associated with each element mij of the bicluster and it is defined as: 0  |COV ( mkl )|  e w p (mij ) =    kI ,lJ|COV ( m )| ij  e

209

are used for study. SEBI does not use any local search strategy for updating the chromosomes. All the above algorithms use chromosomes of length equal to the number of genes plus the number of conditions. Thus the chromosomes are very large if the data set is large. This may cause the other operators like crossover and mutation to take longer and thus slowing down the convergence. Taking this into account, a novel encoding strategy is proposed in GA based Biclustering (GABI) [38]. Here each string has two parts: one for clustering the genes, and another for clustering the conditions. If M and N denote the maximum number of gene clusters and the maximum number of condition clusters, respectively, then the length of each string is M + N . The first M positions represent the M cluster centers for the genes, and the remaining N positions represent the N cluster centers for the conditions. Thus a string looks like following: {gc1 gc2 K gcM cc1 cc2 K ccN }, where each gci , i = 1  M , represents the index of a gene that acts as a cluster center of a set of genes, and each cc j , j = 1… N , represents the index of a condition that acts as a cluster center of a set of conditions. For a data set having n points, it is usual to assume that the data set may contain at most n clusters. Taking this into account, the values of the maximum number of gene clusters ( M ) and the maximum number of condition clusters ( N ) are used as  G  and  C  , respectively. Here G and C denote the number of genes and the number of conditions in the data set, respectively. The first M positions can have values in the range {0,1, 2,K, G} and the next N positions can have values in the range {0,1,2…C}. Hence the gene and condition cluster centers are represented by indices of the genes and conditions, respectively, while a 0 value at any position means absence of any cluster center. A string that encodes M gene clusters and N condition clusters, represents a set of M  N biclusters, taking each pair of gene and condition clusters. Each pair < gci , cc j > , i = 1K M , j = 1K N , represents a bicluster that consists of all

genes of the gene cluster centered at gene

gci , and all

conditions of the condition cluster centered at condition cc j . During the fitness computation, the gene clusters and condition clusters encoded in the chromosome are updated in K-means like iteration. The fitness function of a bicluster is defined as follows:

if | COV(mij ) |= 0 if | COV(mij ) |> 0.

(19)

Here | COV ( mij ) | denotes the number of biclusters containing mij . The weight w p ( mij ) is used to control the amount of overlaps among the biclusters. Binary tournament selection is used. Three crossover operators, one-point, twopoint and uniform crossover have been studied. Also three mutation operators, namely standard bit-flip mutation, mutation by adding a row and mutation by adding a column

F=

MSR( I , J ) .  .(1 + VAR( I , J ))

(20)

The denominator of F is chosen such way to avoid accidental divide-by-zero condition when row variance ( VAR( I , J ) ) becomes 0. F is minimized to obtain highly coherent yet “interesting'' (high variance) biclusters. For each encoded  -bicluster, the fitness function F is computed. The fitness function of a chromosome is then computed as the mean of the fitness values of all the encoded  -

210 Current Bioinformatics, 2010, Vol. 5, No. 3

biclusters in it. Conventional roulette wheel selection and uniform crossover operation are used in GABI. The mutation operation works as follows. A random position is chosen from the first M positions and its value is replaced by an index randomly chosen from the range {0,1, 2,K,G} , where

G is the total number of genes. Similarly, to mutate the condition portion of the string, a random position is selected from the next N positions and its value is substituted using a randomly selected index from the range {0,1, 2,K,C} , where C is the total number of conditions. Elitism is used to track the best string found until the current generation. 4.3.2. SA Based Biclustering There are many instances in literature that use Simulated Annealing (SA) for the biclustering problem. A standard representation of a configuration in SA is equivalent to a binary string used in GA based biclustering. In [39], this representation is used. Here the initial configuration consists of all `1's, i.e., it encodes the complete data set. The perturbation is equivalent to bit-flipping mutation used in GA. The energy to be minimized is taken as MSR of the encoded bicluster. A similar approach is found in [40], where instead of starting from the complete data matrix, the author first create a seed bicluster by clustering the genes and samples and combining them. Thereafter SA is used to grow up the seed. Here also, MSR is used as the coherence measure. The perturbation includes only addition of a random gene and/or condition.

Mukhopadhyay et al.

In [45], the authors extended their work of [37] to the multiobjective case. The algorithm is called as Sequential Multi-Objective Biclustering (SMOB). Here also they used binary encoding strategy. Three objective functions, viz., mean squared residue, volume and row variance are optimized simultaneously. In [46], a Crowding distance based Multi-Objective Particle Swarm Optimization Biclustering (CMOPSOB) algorithm is proposed that uses binary encoding. The algorithm optimizes the MSR , volume and VAR simultaneously. In [47], a hybrid multiobjective biclustering algorithm that combines NSGA-II and Estimation of Distribution Algorithm (EDA) [48] for searching biclusters is proposed. The volume and MSR of the biclusters are optimized simultaneously. A multiobjective artificial immune system based biclustering that is capable of performing a multi-population search, named MOM-aiNet, is proposed in [49]. In general, evolutionary algorithms are known for their strength in avoiding locally optimum solutions. Specially, when they are equipped with some local search, they can converge fast toward the global optimum. However, the algorithms which optimize MSR as an objective function, fail to discover the multiplicative patterns. Also, evolutionary algorithms are inherently slower compared to the greedy iterative algorithms and depend a lot on different parameters like population size, number of generations, crossover and mutation rates, annealing schedule etc. But in general, it has been found that evolutionary algorithms, specially the multiobjective ones, work better than the greedy search strategies in terms of performance.

4.3.3. Hybrid Approaches

4.4. Fuzzy Biclustering

In [41], a hybrid Genetic Algorithm-Particle Swarm Optimization (GA-PSO) approach, which uses binary strings to encode the biclusters, is proposed. The GA and PSO have there own populations that evolve through standard GA and PSO process, respectively. At each iteration, a random set of individual solutions are exchanged between the two population. As the fitness function, it uses the same described in Eqn. 17.

Some recent biclustering algorithms employ fuzzy set theory in developing biclustering algorithms in order to capture overlapping biclusters. In [50], a flexible fuzzy coclustering algorithm which incorporates feature-cluster weighting in the formulation is proposed. The algorithm is called as Flexible Fuzzy Co-clustering with Feature-cluster Weighting (FFCFW) which allows the number of object clusters to be different from the number of feature clusters. A feature-cluster weighting scheme is incorporated for each object cluster generated by FFCFW so that the relationships between the two types of clusters are manifested in the feature-cluster weights. This enables FFCFW to generate more accurate representation of fuzzy co-clusters. FFCFW uses an iterative optimization procedure.

4.3.4. Multiobjective Biclustering As the biclustering problem requires several objectives to be optimized such as MSR , volume, row variance etc., there are some approaches that pose the biclustering problem as multiobjective optimization [42]. The work in [38] has been extended to multiobjective case in [43]. The algorithm is termed as MultiObjective GA based Biclustering (MOGAB). Here the authors used the same encoding strategy consisting of gene clusters and condition clusters. Two objectives, viz.,

1 MSR( I , J ) and are 1 + VAR( I , J ) 

optimized simultaneously. This algorithm uses NSGA-II [44] as the underlying multiobjective optimization tool. The crossover and mutation operators are kept same as in [38].

In [51], a GA based possibilistic fuzzy biclustering algorithm GFBA is proposed. In GFBA, instead of binary chromosome, the authors use real valued chromosome of length G + C . Each position in the chromosome has value between 0 and 1, representing the degree of membership of the corresponding gene or condition to the encoded bicluster. They fuzzified the different coherence and quality metrics such as MSR , VAR and volume of the biclusters as follows: The means of each row ( miJ ), each column ( mIj ) and all the elements ( mIJ ) of a bicluster are redefined as:

On Biclustering of Gene Expression Data

Current Bioinformatics, 2010, Vol. 5, No. 3

Here I is a fuzzy set corresponding to fuzzy gene cluster centered at gc x . It consists of all genes g i with

C

f

J

( j ) μ .mij

j =1 C

miJ =

211

,

 f J ( j)μ

μ xi , 1  i  G . Similarly, J is a fuzzy set corresponding to fuzzy condition cluster centered at cc y . membership degree

j =1

(21)

It consists of all conditions c j with membership degree G

mIj =

f

I

1 j  C .

(i ) μ .mij

i =1

,

G

Residue of an element aij of the fuzzy bicluster

 f I (i) μ

(22)

i =1

and C

 f miJ =

B ( I , J ) is defined as: frij = aij  aiJ  aIj + aIJ ,

G

μ

μ

(i ) . f J ( j ) .mij

I

i =1 j =1 G C

I

C

(i ) μ . f J ( j ) μ

i =1 j =1

(23)

f I (i ) and f J ( j ) denote the membership degree of the i th gene and j th condition to the bicluster, respectively and μ is the fuzzy exponent. Hence fuzzy mean squared residue FMSR is defined as:

 aiJ =

1 G C FMSR( I , J ) =  f I (i) μ . f J ( j ) μ (mij  miJ  mIj + mIJ ) 2 , | I || J | i =1 j =1

,

j =1

aIj =



C

| J |=  j =1

(28)

m xi ij

a

i=1



(29)

, m xi

i=1

(24)

and C

μ

f J ( j ) . The objective

aIJ =

function to be minimized is selected as:

m xi

 yjm aij

i =1 j =1

fvol ( I , J )

,

(30)

C

G

.FMSR( I , J ).(1  f (i))

 .FMSR( I , J ).(1  f J ( j )) μ

μ

I

i =1

G | I |

+

j =1

C | J |

(25) where  and  are parameters provided to satisfy different requirements on the incoherence and the sizes of the biclusters. Conventional roulette wheel selection and single point crossover followed by mutation (increasing or decreasing a membership value) have been used. GFBA also uses a bicluster optimization technique at each generation for faster convergence. In [52], an NSGA-II based multiobjective probabilistic fuzzy biclustering algorithm is proposed which uses chromosomes encoding a set of gene cluster centers and a set of condition cluster centers as in [38, 43]. In this case, the gene and condition cluster centers are updated using one step of fuzzy K-medoids clustering [53] and for each gene and condition, fuzzy membership degree to each gene cluster and condition cluster, respectively is computed. The fuzzy volume of bicluster corresponding to B( I , J )

< gc x , cc y > pair ( < gene cluster x , condition cluster y > ) is defined as: G

a

G

G

| I |= i =1 f I (i ) and

F = FMSR( I , J ) +

m yj ij

j =1 C

 yjm

where

where

(27)

where ,

 f

 yj ,

C

fvol ( I , J ) = μ xim yjm . i =1 j =1

(26)

where m is the fuzzy exponent. The fuzzy mean squared residue ( FMSR( I , J ) ) of the fuzzy bicluster B = ( I , J ) is defined as: FMSR( I , J ) =

G C 1  μ xim yjm frij2 . fvol ( I , J ) i =1 j =1

(31)

Subsequently, fuzzy expression profile variance of B( I , J ) is computed as: fvar ( I , J ) =

G C 1  μ xim yjm (aij  aiJ ) 2 . fvol ( I , J ) i =1 j =1

(32)

For each < gc x , cc y > pair, representing a fuzzy bicluster, the above three objectives ( fvol , FMSR and

frvar ) are computed. As each chromosome encodes a number of possible biclusters, the average value of each of the above three terms, i.e., fuzzy volume, fuzzy MSR and fuzzy variance are taken as three objectives to be optimized simultaneously. Note that, the first and the third objectives are to be maximized while minimizing the second one. The other genetic operators used are similar to that used in [43]. In [54], another fuzzy biclustering algorithm called Fuzzy Biclustering for Microarray Data Analysis (FBMDA)

212 Current Bioinformatics, 2010, Vol. 5, No. 3

is proposed. The method employs a combination of the Nelder-Mead and min-max algorithm to construct hierarchically structured biclustering, thus can represent the biclustering information at different levels. FBMDA uses multiobjective optimization that optimizes volume, variance and fuzzy entropy simultaneously. The Nelder-Mead algorithm is used to compute a single objective optimal solution, and the min-max algorithm is used to trade-off between multiple objectives. FBMDA is not subject to the convexity limitations, and also does not use the derivatives information. FBMDA ensures that the current local optimal solution is removed and that a higher precision is reached. Incorporation of fuzziness in biclustering algorithms enables them to deal with noisy data and overlapping biclusters efficiently. But as most of the aforementioned fuzzy algorithms use evolutionary techniques as the underlying optimization strategy, they suffer from the fundamental disadvantages of evolutionary methods. Furthermore, computation of fuzzy membership degrees takes additional time which adds up to the time taken by the fuzzy biclustering methods. 4.5. Graph Theoretic Approaches Graph theoretic concepts and techniques have been utilized in detecting biclusters. In [55], the authors introduced SAMBA (Statistical Algorithmic Method for Bicluster Analysis), a graph-theoretic approach to biclustering in combination with a statistical data model. In SAMBA the expression matrix is modeled as a bipartite graph consisting of two sets of vertices corresponding to genes and conditions. A bicluster is defined as a subgraph, and a likelihood score is used in order to assess the significance of observed subgraphs. SAMBA repeatedly finds the maximal highly connected subgraph in the bipartite graph. Then it performs local improvement by adding or deleting a single vertex until no further improvement is possible. SAMBA's time complexity is O(N 2 d ) , where d is the upper bound on the degree of each vertex. The Binary inclusion-Maximal (BiMax) biclustering algorithm proposed in [56] identifies all biclusters in the input matrix. BiMax algorithm works on a binary matrix. The input matrix is first discretized to zeros and ones according to a user-specified threshold. Based on this binary matrix, BiMax identifies all maximal biclusters where a bicluster is defined as a submatrix E containing all 1s. An inclusion-maximal bicluster means that this bicluster is not completely contained in any other bicluster. They used an incremental algorithm to find the inclusion-maximal biclusters exploiting the fact that the matrix E induces a bipartite graph. As BiMax works with binary matrix, it is suitable only for detecting constant biclusters. In [57], the optimal biclustering problem is posed as a problem of maximal crossing number reduction (minimization) in a weighted bipartite graph. In this regard, an algorithm called cHawk, is proposed that employs barycenter heuristic and local search technique. There are three main steps of the algorithm, viz., construction of a

Mukhopadhyay et al.

bipartite graph from the input matrix, bipartite graph crossing minimization and finally, the bicluster identification. This approach reorders the matrix so that all rows and columns belonging to the same bicluster are brought into the vicinity of each other. cHawk is able to detect constant, additive and overlapped noisy biclusters. The graph based biclustering algorithms usually model the input data set as a bipartite graph with two sets of nodes corresponding to the genes and conditions, respectively. The edges of the graph represent the level of overexpression and underexpression of a gene under the certain condition. A bicluster is a subgraph of the bipartite graph, where the genes have coherence across the selected conditions. In these types of algorithms, the genes and conditions are partitioned in same number of clusters, which may be impractical. Moreover, the input data set has to be discretized properly before applying graph based algorithms. Also they do not scale well with large data sets. 4.6. Randomized Greedy Search In [58], a greedy random walk search technique for biclustering problem that is enriched by a local search strategy to escape local optima has been presented. The algorithm begins with initial random solution and searches for a locally optimal solution by successive transformations (including random moves depending on some probability) to improve a gain function defined as a combination of mean squared residue, expression profile variance and the volume of the biclusters. The algorithm iterates k times to generate k biclusters. In [59], the basic concepts of the metaheuristics Greedy Randomized Adaptive Search Procedure (GRASP)construction and local search phases are reviewed. Also a method which is a variant of GRASP called Reactive Greedy Randomized Adaptive Search Procedure (Reactive GRASP) is proposed to detect significant biclusters from large microarray datasets. The method has two major steps. First, high quality bicluster seeds are generated by using the K means clustering from both dimensions and combining the clusters. In the second step, these seeds are grown using the Reactive GRASP. In Reactive GRASP, the basic parameter that defines the restrictiveness of the candidate list is selfadjusted, depending on the quality of the solutions found previously. Randomized greedy search algorithms try to combine the advantages of greedy search and randomization, so that they execute fast as well as don't stuck at local optima. However, sill these algorithms heavily depend on the initial choice of the solution and there is no clear way to get out from a poor choice. 4.7. Other Recent Approaches A number of biclustering algorithms have appeared in recent literature that follow new methodologies. Some of them are described here.

On Biclustering of Gene Expression Data

Current Bioinformatics, 2010, Vol. 5, No. 3

In [60], the authors introduces plaid model as a statistical model assuming that the expression value mij in a bicluster is the sum of the main effect

,

the gene effect

condition effect q j , and the noise term mij =  + pi + q j +  ij .

pi , the

 ij : (33)

Also it is assumed that the expression values of two overlapping biclusters are the sum of the two module effects. In plaid model, a greedy search strategy is used, hence errors can accumulate easily. Moreover, in case of multiple clusters, the clusters identified by the method tend to overlap to a great extent. In [61], a biclustering algorithm is proposed based on probabilistic Gibbs sampling. Gibbs sampling does not suffer from the problem of local minima that often characterizes Expectation Maximization. However, when the microarray data is organized as patient vs. gene fashion, and the number of patients is much lower compared to the number of genes, the algorithm faces computational difficulties. Moreover the algorithm is only able to identify biclusters with constant columns. In [62], the authors developed a spectral biclustering method that simultaneously clusters genes and conditions, finding distinctive checkerboard patterns in matrices of gene expression data, if they exist. The method is based on the observation that checkerboard structures can be found in eigenvectors corresponding to the characteristic expression patterns across the genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches such as singular value decomposition (SVD), coupled with closely integrated normalization steps. In [63], the authors proposed a biclustering method that employs dynamic programming and a divide-and-conquer technique, as well as efficient data structures such as the trie and zero-suppressed decision diagrams (ZBDDs). Use of ZBDDs extends the stability of the method substantially. In [64], the authors developed MicroCluster, a deterministic biclustering method. In MicroCluster, only the maximal biclusters satisfying certain homogeneity criteria are considered. The clusters can be arbitrarily positioned anywhere in the input data matrix, and they can have arbitrary overlapping regions. MicroCluster uses a flexible definition of a cluster that lets it mine several types of biclusters. Moreover, MicroCluster can delete or merge biclusters that have large overlaps. So, it can tolerate some noise in the data set and let the users focus on the most important clusters. As MicroCluster relies on extracting maximal cliques from the constructed range multigraph, it is computationally demanding. Moreover, there are several input parameters that are to be tuned properly in order to find suitable biclusters. A method based on application of the non-smooth nonnegative matrix factorization technique for discovering local

213

structures (biclusters) from gene expression datasets is developed in [65]. This method utilizes non negative matrix factorization with non-smoothness constraints for identifying biclusters in gene expression data for a given factorization rank. In [66], biclustering algorithms using basic linear algebra and arithmetic tools have been developed. The proposed biclustering algorithms can be used to search for all biclusters with constant values, biclusters with constant values on rows, biclusters with constant values on columns, and biclusters with coherent values from a set of data in a timely manner and without solving any optimization problem. In [67], the authors proposed a biclustering method by alternatively sorting the genes and condition using dominant set. By using weighted correlation coefficient, they emphasize the similarities across a subset of the genes/conditions. Additionally, a coherence measure called Average Correlation Value (ACV) is proposed which is effective in determining both additive and multiplicative patterns. Some special preprocessing of the input data set is needed for detecting additive and multiplicative biclusters. To detect different types of biclusters, different runs are needed. In [68], a biclustering algorithm that adopts bucketing technique to find a raw submatrix is proposed. The algorithm refines and extends the raw submatrix into a bicluster. The algorithm is called as Bucketing and Extending Algorithm (BEA). A Bayesian BiCustering (BBC) model is proposed in [69] that uses Gibbs sampling. For a single bicluster, the same model as in the plaid model is assumed. Whereas for multiple biclusters, the overlapping of biclusters is allowed either in genes or conditions. Moreover, the authors used a flexible error model, which permits the error term of each bicluster to have a different variance. In [70] the authors presented a rigorous approach to biclustering, which is based on the Optimal RE-Ordering (OREO) of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. Cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices to generate the biclusters. The reordering of the rows and the columns for large data sets can be computationally demanding. The authors in [71] proposed an algorithm that finds and reports all maximal contiguous column coherent (CCC) biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. This algorithm can only detect biclusters with columns arranged contiguously. In [72], an iterative density based biclustering algorithm, called BIDENS is proposed. BIDENS is able to detect a set

214 Current Bioinformatics, 2010, Vol. 5, No. 3

of k possibly overlapping biclusters simultaneously. The algorithm is similar to FLOC, but instead of having residue as the objecting function, it tries to maximize the overall density of the biclusters. The input data set is needed to be discretized before the application of BIDENS algorithm.

Mukhopadhyay et al.

5.4. EXPANDER

There are a number of free biclustering software available for downloading for offline use, or in the form of web server. Here we list some free/open source biclustering software and discuss them in brief.

EXPANDER (EXpression Analyzer and DisplayER) [76] is a java-based tool for analysis of gene expression data. It is capable of clustering, visualization, biclustering and performing downstream analysis of clusters and biclusters such as functional enrichment and promoter analysis. In general, EXPANDER can analyze groups of genes for enrichment of transcription factor binding sites in their promoters. EXPANDER currently integrates the SAMBA [68] biclustering algorithm. The software is freely downloadable from http://acgt.cs.tau.ac.il/expander/.

5.1. BicAT

5.5. BicOverlapper

BicAT (Biclustering Analysis Toolbox) [73] integrates various biclustering (Cheng and Church, Bimax, xMotif, OPSM, ISA) and clustering techniques (K-means, hierarchical clustering) with a common graphical user interface. Moreover, BicAT provides different facilities for data preparation, inspection and postprocessing such as discretization, filtering of biclusters according to specific criteria or gene pair analysis for constructing gene interconnection graphs. The toolbox is described in the context of gene expression analysis, but is also applicable to other types of data, e.g. data from proteomics or synthetic lethal experiments. The BicAT toolbox is freely available at http://www.tik.ee.ethz.ch/sop/bicat and it is platform independent. The Java source code of the program and a developer's guide is provided on the website as well. There is provision for the users to add further algorithms or extensions.

BicOverlapper [77] is a tool for visualizing biclusters from gene-expression matrices in a way that helps to compare biclustering methods, to unravel trends and to highlight relevant genes and conditions. The technique is based on a force-directed graph where biclusters are represented as flexible overlapped groups of genes and conditions. The BicOverlapper software and supplementary material are available at http://vis.usal.es/bicoverlapper.

5. BICLUSTERING SOFTWARE

5.2. BiVisu BiVisu (Bicluster detection and Visualization) [74] is an open-source biclustering software tool for detection and visualization of biclusters embedded in a gene expression matrix. By using of appropriate coherence relations, BiVisu is able to detect constant, row-constant, column-constant, additive and multiplicative biclusters. The biclustering results can also be visualized under a 2D setting in the form of parallel coordinate (PC) plots for each bicluster. From the PC plots of the biclusters, both objective and subjective cluster quality evaluation can be performed. BiVisu also integrates some data preprocessing and postprocessing techniques. BiVisu has been developed in Matlab and is available at http://www.eie.polyu.edu.hk/~nflaw/Biclustering/ for free download. 5.3. GEMS GEMS (Gene Expression Mining Server) [75] is a webenabled service for biclustering microarray gene expression data. Users may upload their gene expression data and specify a set of criteria. GEMS then performs biclustering based on a Gibbs sampling paradigm. GEMS web server provides a useful and flexible platform for the discovery of co-expressed and potentially co-regulated gene modules. GEMS is an open source software and is available at http://genomics10.bu.edu/terrence/gems/ for free down load.

5.6. BiGGEsTS BiGGEsTS (Biclustering Gene Expression Time Series) [78] is a free and open source software tool providing an integrated environment for the biclustering of time series gene expression data. It offers a set of biclustering algorithms (CCC [71], e-CCC [71], CC-TSB [79]) for time series expression data. Moreover, it implements several visualization techniques such that colored matrices, expression evolution charts, pattern charts, dendrograms and gene ontology graphs. BiGGEsTS integrates well known techniques for preprocessing data: filtering genes, filling missing values, smoothing, normalization and discretization. The software is available at http://kdbio.inescid.pt/software/biggests/. 5.7. BICLUST BICLUST is an R-package for biclustering analysis which contains a collection of bicluster algorithms, preprocessing methods (normalization and discretization) for two way data, and validation and visualization techniques for bicluster results. The main function biclust provides several algorithms to find biclusters in two-dimensional data: Cheng and Church, Spectral, Plaid Model, Xmotifs and BiMax. The package is available at the following website: http://crantastic.org/packages/biclust. CONCLUSION AND FUTURE CHALLENGES Biclustering is a method for simultaneous clustering of both genes and conditions of a microarray gene expression matrix. Unlike clustering, biclustering methods try to capture local modules, i.e., set of genes that are coregulated and coexpressed in a subset of conditions. In recent times, there has been a tremendous growth in biclustering research and a large number of algorithms have been proposed. In this article, we have made an attempt to present a comprehensive review on the biclustering models. Recent biclustering

On Biclustering of Gene Expression Data

algorithms of different categories along with their pros and cons have been discussed. Moreover, an overview of some freely available biclustering software is provided. Most of the biclustering algorithms have been applied to microarray gene expression data sets for identifying coregulated genes and classifying tissue samples. Biclustering algorithms have also been applied for detection of different responses to treatment, and the set of genes to be used as the most effective probes, mainly in cancer microarrays, such as Leukemia [29, 30, 61]. Other than gene expression data sets, biclustering algorithms have also been successfully applied to e-commerce data and collaborative filtering [80], marketing data [81], and text mining [82] etc. Although a lot of publications are coming out in biclustering area, still there remains many challenges to be addressed by the researchers. In many of the papers, the authors have posed the biclustering as an optimization problem that optimizes some coherence measures. Many such algorithms optimize MSR to capture the coherent biclusters. However, recently it has been proved that MSR is only able to detect constant and additive patterns and unable to detect the multiplicative or combined patterns [22]. Therefore, it is a challenge for the researchers to devise some new coherence measure that can capture both additive and multiplicative patterns and more desirably the combined patterns also. Moreover, still there is no overall accepted measure to compare the quality of the biclusters obtained using different biclustering algorithms. Therefore it is difficult to judge the superiority of any particular biclustering algorithm. This issue must be addressed by the researchers. Furthermore, some studies are to be made for extending the biclustering algorithms to generate the triclusters from 3D gene-sample-time microarray data sets.

Current Bioinformatics, 2010, Vol. 5, No. 3 [12]

[13] [14]

[15] [16] [17] [18]

[19]

[20] [21] [22] [23] [24] [25] [26]

[27]

[28]

REFERENCES [1]

[2] [3] [4]

[5] [6] [7] [8] [9] [10]

[11]

Sharan R, Adi M-K, Shamir R. CLICK and EXPANDER: A system for clustering and visualizing gene expression data. Bioinformatics 2003; 19: 1787-99. Alizadeh AA, Eisen MB, Davis R, et al. Distinct types of diffuse large B-cell lymphomas identified by gene expression profiling. Nature 2000; 403: 503-11. Chu S, DeRisi J, Eisen M, et al. The transcriptional program of sporulation in budding yeast. Science 1998; 282: 699-705. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95(25): 14863-68. Bandyopadhyay S, Maulik U, Wang JTL. Anal Biol Data: A Soft Comput Approach World Scientific 2007. Jain AK, Dubes RC. Algorithms for clustering data englewood cliffs. NJ: Prentice-Hall, 1988. Cho RJ, Campbell MJ, Winzeler EA, et al. A genome-wide transcriptional analysis of mitotic cell cycle. Mol Cell 1998; 2: 65-73. Herwig R, Poustka A, Meuller C, Lehrach H, OBrien J. Large-scale clustering of cDNA fingerprinting data. Genome Res 1999; 9(11): 1093-105. Dembele D, Kastner P. Fuzzy c-means method for clustering microarray data. Bioinformatics 2003; 19(8) 973-80. Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999; 96: 2907-12. Hartuv E, Shamir R. A clustering algorithm based on graph connectivity. Inform Proc Lett 2000; 76(200): 175-81.

[29]

[30] [31]

[32] [33] [34] [35]

[36] [37]

[38]

215

Lukashin AV, Fuchs R. Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 2001; 17(5): 405-19. Bandyopadhyay S, Mukhopadhyay A, Maulik U. An improved algorithm for clustering gene expression data. Bioinformatics 2007; 23(21): 2859-65. Maulik U, Mukhopadhyay A, Bandyopadhyay S. Combining pareto-optimal clusters using supervised learning for identifying coexpressed genes. BMC Bioinformatics 2009; 10: 27. Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics 2001; 17(4): 309-18. Qin ZS. Clustering microarray gene expression data using weighted Chinese restaurant process. 2006; Bioinformatics 22(16): 1988-97. Pan H, Zhu J, Han D. Genetic algorithms applied to multi-class clustering for gene expression data. Genomics Proteomics Bioinformatics 2003; 1: 279-87. Tasoulis DK, Plagianakos VP, Vrahatis MN. Unsupervised clustering of bioinformatics data. In Eur Symp Int Tech Hybrid Syst implementation Smart Adaptive Syst 2004; pp. 47-53. Mukhopadhyay A, Maulik U, Bandyopadhyay S. Unsupervised cancer classification through SVM-boosted multiobjective fuzzy clustering with majority voting ensemble. In Proc IEEE Congress on Evolutionary Comput 2009; pp. 255-61. Causton HC, Quackenbush J, Brazma A. Microarray gene expressions data analysis: A beginner's guide. Blackwell Pub., April 2003. http://en.wikipedia.org/wiki/dna microarray. Aguilar-Ruiz JS. Shifting and scaling patterns from gene expression data. Bioinformatics 2005; 21(20): 3840-45. Hartigan J. Direct clustering of a data matrix. J Am Stat Assoc 1972; 67(337): 123-29. Cheng Y, Church GM. Biclustering of gene expression data. Proc Int Conf Int Syst Mol Biol 2000; pp. 93-103. Yang J, Wang W, Wang H, Yu P. Enhanced biclustering on expression data. In Proc 3rd IEEE Conf Bioinform Bioeng 2003; pp. 321-27. Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: The order-preserving sub-matrix problem. In Proc 6th Ann Int Conf Comput Biol 2002; 1-58113-498-3: pp. 49-57. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nat Genet 2002; 31: 370-7. Ihmels J, Bergmann S, Barkai N. Defining transcription modules using large-scale gene expression data. Bioinformatics 2004; 20: 1993-2003. Murali TM, Kasif S. Extracting conserved gene expression motifs from gene expression data. In Proc Pacific Symp Biocomput 2003; 8: 77-88. Getz G, Levine E, Domany E. Coupled two-way cluster analysis of gene microarray data. Proc Natl Acad Sci USA 2000; 12079-84. Tang C, Zhang L, Zhang I, Ramanathan M. Interrelated two-way clustering: An unsupervised approach for gene expression data analysis. In Proc Sec IEEE Int Symp Bioinform Bioeng 2001; pp. 41-8. Busygin S, Jacobsen G, Krmer E, Ag C. Double conjugated clustering applied to leukemia microarray data. In Proc 2nd SIAM ICDM Workshop on clustering high dimensional data 2002. Goldberg DE. Genetic algorithms in search optimization and machine learning. New York: Addison-Wesley, 1989. Kirkpatrik S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science 1983; 220: 671-80. Bleuler S, Prelic A, Zitzler E, An EA framework for biclustering of gene expression data. In Proc IEEE Congress on Evolutionary Comput 2004; pp. 166-73. Chakraborty A, Maka H. Biclustering of gene expression data using genetic algorithm. In Proc IEEE Symp Comput Int Bioinform Comput Biol 2005. Divina F, Aguilar-ruiz JS. Biclustering of expression data with evolutionary computation. IEEE Trans Knowl Data Eng 2006; 18: 590-602. Mukhopadhyay A, Maulik U, Bandyopadhyay S. Evolving coherent and non-trivial biclusters from gene expression data: An evolutionary approach. Proc IEEE Region 10 Conf 2008.

216 Current Bioinformatics, 2010, Vol. 5, No. 3 [39]

[40] [41]

[42] [43]

[44] [45]

[46] [47]

[48] [49] [50]

[51] [52]

[53]

[54] [55]

[56] [57]

[58] [59] [60] [61]

Mukhopadhyay et al.

Bryan K, Cunningham P, Bolshakova N. Biclustering of expression data using simulated annealing. In Proc 18th IEEE Symp Comput Based Medical Syst (Dublin, Ireland) 2005; pp. 383-8. Chakraborty A. Biclustering of gene expression data by simulated annealing. In Proc Eighth Intl Conf High-Perform Comput AsiaPaci c Region 2005; pp. 627-32. Xie B, Chen S, Liu F. Biclustering of gene expression data using PSO-GA hybrid. Proc Int Conf Bioinform Biomed Eng 2007; 30205. Deb K. Multi-objective optimization using evolutionary algorithms. England: John Wiley and Sons, Ltd, 2001. Maulik U, Mukhopadhyay A, Bandyopadhyay S. Finding multiple coherent biclusters in microarray data using variable string length multiobjective genetic algorithm. IEEE Trans Inform Tech Biomed 2009; 13(6): 969-75. Deb K, Pratap A, Agrawal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 2002; 6: 182-97. Divina F, Aguilar-Ruiz JS. A multi-objective approach to discover biclusters in microarray data. In Proc 9th Ann Conf Genetic Evol Comput New York, NY, USA 2007; pp. 385-92, ACM. Liu J, Li Z, Hu X, Chen Y, Biclustering of microarray data with MOPSO based on crowding distance. BMC Bioinformatics 2008; 10(Suppl 4): S9. Fei L, Juan L. Biclustering of gene expression data with a new hybrid multi-objective evolutionary algorithm of NSGA-II and EDA. In Proc Int Conf Bioinform Biomed Eng 2008; pp. 1912-5. Larranaga P, Lozano JA. Estimation Distrib Algorithms: A New Tool Evol Comput MA: Kluwer Academic Publisher 2001. Coelho GP, Franca FO, Zuben FJ. A multi-objective multipopulation approach for biclustering. In Proc 7th Int Conf Artificial Immune Syst Springer-Verlag 2008; pp. 71-82. Tjhi W-C, Lihui C. Flexible fuzzy co-clustering with featurecluster weighting. In Proc 9th Int Conf Control, Automation, Robotics and Vision 2006. Fei X, Lu S, Pop HF, Liang LR. GFBA: A biclustering algorithm for discovering value-coherent biclusters. In Proc Int Symp Bioinform Res Appl 2007. Maulik U, Mukhopadhyay A, Bandyopadhyay S, Zhang MQ, Zhang X. Multiobjective fuzzy bi-clustering in microarray data: Method and a new performance measure. In Proc IEEE World Congress Comput Int/IEEE Congress Evol Comput 2008; (Hong Kong), pp. 383-8. Krishnapuram R, Joshi A, Yi L. A fuzzy relative of the k-medoids algorithm with application to document and snippet clustering. In Proc IEEE Intl Conf Fuzzy Systems -FUZZ-IEEE 99, 1999; (Seoul, South Korea), pp. 1281-6. Han L, Yan H. Fuzzy biclustering for dna microarray data analysis. In Proc IEEE Int Conf Fuzzy Syst FUZZ-IEEE 2008 (IEEE World Congress Comput Int) 2008; pp. 1132-8. Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002; 18: S136S44. Prelic A, Bleuler S, Zimmermann P, et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006; 22(9): 1122-9. Ahmad W, Khokhar A. cHawk: A highly efficient biclustering algorithm using bigraph crossing mini-mization. In Proc 2nd Int Workshop Data Mining Bioinform 2007. Angiulli F, Cesario E, Pizzuti C. Random walk biclustering for microarray data. Inform Sci 2008; 178(6): 1479-97. Dharan S, Nair AS. Biclustering of gene expression data using reactive greedy randomized adaptive search procedure. BMC Bioinformatics 2009; 10(Suppl 1): S27. Lazzaroni L, Owen A. Plaid models for gene expression data. Statistica Sinica 2002; 12: 61-86. Sheng Q, Moreau Y, Moor BD. Biclustering microarray data by gibbs sampling. Bioinformatics 2003; 19: 196-205.

Received: August 22, 2009

[62]

[63]

[64] [65]

[66] [67]

[68]

[69] [70]

[71]

[72]

[73] [74]

[75] [76] [77] [78]

[79] [80]

[81]

[82]

Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 2003; 13: 703-16. Yoon S, Nardini C, Benini L, Micheli GD. Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams. IEEE/ACM Trans Comput Biol Bioinform 2005; 2(4): 339-54. Zhao L, Zaki MJ. MicroCluster: Efficient deterministic biclustering of microarray data. IEEE Int Syst 2005; 20(6): 40-9. Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A. Biclustering of gene expression data by nonsmooth non-negative matrix factorization. BMC Bioinformatics 2006; 7: 366. Tchagang BA, Tewfik AH. DNA microarray data analysis: a novel biclustering algorithm approach. EURASIP J Appl Signal Proc 2006. Teng L, Chan L-W. Biclustering gene expression profiles by alternately sorting with weighted correlated coefficient. In Proc IEEE Int Workshop Machine Learning Signal Proc 2006; pp. 289-94. Liu F, Zhou H. Biclustering of gene expression data based on bucketing technique. In: Proc. 1st International Conference on Bioinformatics and Biomedical Engineering (ICBBE) Wuhan, China 2007; pp. 359-62, Gu J, Liu JS. Bayesian biclustering of gene expression data. BMC Genomics 2008; 9(Suppl 1): S4. Dimaggio P, Mcallister S, Floudas C, Feng XJ, Rabinowitz J, Rabitz H. Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics 2008; 9(1): 458. Madeira SC, Oliveira AL. An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data. In Proc 5th Asia Paci c Bioinformatics Conference Series in Advances in Bioinformatics and Computational Biology 5: (Hong Kong), Imperial College Press, January 2007; pp. 67-80. Mahfouz MA, Ismail MA. BIDENS: Iterative density based biclustering algorithm with application to gene expression analysis. Proc World Acad Sci Eng Tech 2009; 37: 342-48. Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E. BicAT: a Biclustering Analysis Toolbox. Bioinformatics 2006; 22(10): 12823. Cheng KO, Law NF, Siu WC, Lau TH. BiVisu: software tool for bicluster detection and visualization. Bioinformatics 2007; 23: 2342-4. Wu C-J, Kasif S. GEMS: A web server for biclustering analysis of expression data. Nucleic Acids Res 2005; 33: 596-9. Shamir R, Maron-Katz A, Tanay A, et al. EXPANDER–an integrative program suite for microarray data analysis. BMC Bioinformatics 2005; 6: 232. Santamar´ıa R, Theron R, Quintales L. BicOverlapper: a tool for bicluster visualization. Bioinformatics 2008; 24(9): 1212-3 . Gonçalves JP, Madeira SC, Oliveira AL. BiGGEsTS: integrated environment for biclustering analysis of time series gene expression data. Instituto de Engenharia de Sistemas e Computadores Investigao e Desenvolvimento em Lisboa (INESC-ID). Tech Rep 2009; p. 23. Zhang Y, Zha H, Chu C-H. A time-series biclustering algorithm for revealing co-regulated genes. In Proc Int Conf Inform Tech Coding Comput 2005; 1: pp. 32-7. de Franca FO, Ferreira HM, Zuben FJ V. Applying biclustering to perform collaborative filtering. In Proc Int Conf Int Syst Design Appl (Los Alamitos, CA, USA) 2007; pp. 421-6, IEEE Computer Society. Liu S, Chen Y, Yang M, Ding R. Bicluster algorithm and used in market analysis. In Intl Workshop Knowl Dis Data Mining (Los Alamitos, CA, USA), 2009; pp. 504-7, IEEE Computer Society. de Castro PA D, de Franca FO, Ferreira HM, Zuben FJ V. Applying biclustering to text mining: An immune-inspired approach. In Proc Int Conf Artificial Immune Syst 2007; pp. 83-94.

Revised: November 04, 2009

Accepted: January 06, 2010

On Biclustering of Gene Expression Data

algorithms. Moreover, a survey on freely available biclustering software is also made. ...... open-source biclustering software tool for detection and visualization of ... successfully applied to e-commerce data and collaborative filtering [80] ...

755KB Sizes 1 Downloads 136 Views

Recommend Documents

Ensemble machine learning on gene expression data ...
tissues. The data are usually organised in a matrix of n rows ... gene expression data on cancer classification problems. This ... preconditions and duplication.

Ranking Equivalent Rules from Gene Expression Data
Observation 2: Gene expression datasets can contain noise and error which can ... UBR with large number of genes or multiple LBRs, each with a small number of ... I(R ) item support of a row set R γ association rule sup(γ) support of rule γ conf(Î

Modeling Dependent Gene Expression
From Pathways to Conditional Independence Priors. ◦ Non-recursive graphs and Markov Random Fields. • Probability of Expression (Parmigiani and Garreth ...

Expression of Major Gene of Avian Influenza Virus ...
two-fold dilutions of SPF chicken serum starting from 1:50 to 1:6400 and then reacted with ...... 26. http://www.cdc.gov/flu/avian/outbreaks/index.htm. 27. Jin, M.

Effects of Downregulation of Inhibin α Gene Expression ...
E-mail: [email protected]; Tel & Fax: +86-27-8728 2092 www.jgenetgenomics.org .... were designed according to Ambion web-based crite- ria and BLAST ...

Gene Expression and Ethnic Differences
Feb 8, 2007 - 1Ludwig Center and Howard Hughes Medical Institute, .... for Bioinformatics, Salk Institute for Biological Studies, La Jolla, CA 92186, USA. D.

Gene Expression and Ethnic Differences
Feb 8, 2007 - MIC, lists a total of 109 silent mutations out of 2335 .... Genetics LLC, State College, PA 16803, USA. ... Company, Indianapolis, IN 46285, USA.

Modeling Dependent Gene Expression
Nov 13, 2008 - Keywords: Conditional Independence, Microarray Data, Probability Of Expression, Probit Models, Recip- ..... other words, we partition E into E = S ∪ M ∪ U. 2.3 A Prior ..... offers a good recovery of the true generating pattern.

Control of insulin gene expression by glucose
buffered Krebs bicarbonate medium containing 5 mg of BSA/ml for 1 h. Subsequently cells were incubated for a further 4 h in fresh medium containing test ...

regulation of gene expression in prokaryotes pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. regulation of ...

Population genomics of human gene expression
Sep 16, 2007 - Understanding the molecular basis of human phenotypic variation is a ... functional genetic effects between populations, and describe the degree ... Received 30 May; accepted 29 August; published online 16 ...... Annotation (GOA) Datab

Control of insulin gene expression by glucose
caused a dose-dependent increase in expression of CAT activity, with a half-maximal effect at ... The mechanism involves metabolism of the sugar, but does not.

Identification of genetic variants and gene expression ...
Affymetrix Inc., Santa Clara, California, USA) with ... University of Chicago, Chicago, IL 60637, USA ... Only 2098437 and 2286186 SNPs that passed Mende- ..... 15 Smith G, Stanley L, Sim E, Strange R, Wolf C. Metabolic polymorphisms and.

Regulation of Gene Expression in Flux Balance Model ...
... observe the cellular system as a whole. ... All the functions in cell taken together form a network of ... This method of analysis and finding best solution is called.

A gene expression signature of confinement in ...
graphical view of the contributions to each PC separately and clearly imply that PC1 largely captures the effect of habitat, PC2 captures genetic relatedness, PC3 ...