Optimization in DNA Microarray Manufacture

Viewer
Transcript

DEPARTMENT OF MECHANICAL ENGINEERING

NCTT 2006

Optimization in DNA Microarray Manufacture Gopikrishnan1, Achuthsankar S. Nair2, and S. Sivakumar3, Kerala, India.

Abstract— Microarray manufacture involves synthesis of known probe sequences on to a chip. Since the probe sequences are known in advance, an initial deposition sequence can be formed to synthesize the sequence. The situation can be compared to a gun ready to be fired. The loaded bullet represents the deposition sequence and the target board represents the solid surface (slide) where the probe has to be synthesized. The game comes into reality when the gun is loaded with bullets of different character (say colour) and we require a specific pattern on the target board. Creation of a specific pattern is not a difficult task. But the question is ‘Can we limit the number of bullets loaded inside the gun?’ or ‘Is it possible to minimize the length of the deposition sequence?’ The paper aims in finding efficient algorithms to generate such a deposition sequence for any random sets of probes generated (say 100 probes of length 25mers). The deposition sequence may be named as the Shortest Common Deposition Sequence (SCDS) of all probe sets. A shortest common deposition sequence can aid in lesser number of errors during a microarray experiment. Also the average probe quality increases along with reduction in overall cost and production time. The use of the algorithm has been tested using random simulation of probe-sets and fitting the values to appropriate distributions. Key Words : Microarray, Shortest Common Sequence, Deposition Sequence Optimization.

S

................................................................................................ sciences and technologies to make the vast, diverse, and complex life sciences data more understandable and useful. Biological systems are complex and range from the molecular and cellular levels to the organismal and ecosystem. Understanding these complex systems and their interactions require the development of algorithms, heuristics, and software for the accumulation, manipulation, and especially modeling of biological data, and their incorporation into biological disciplinary research. Computational and mathematical models are helping biologists to understand the beating of a heart, the molecular dances underlying the cell-division cycle and cell movement, and much more. The difficulty in life science modeling is that the behavior of living systems with respect to time cannot be predicted i.e., high amount of randomness is involved in biological processes. To be more clear, the stress variations in a piece of iron rod (Mechanical system) with respect to time may be easily and accurately obtained by finite element analysis technique but the behaviour of a cell injected with an antibody (Biological system) cannot be predicted but it should be tested i.e., the former is non-destructive in nature while latter involves destructive testing. Biological systems being stochastic, requires high amount of data and resources for modeling which ends in huge investments.

I. INTRODUCTION

EQUENCING may be defined as doing things in a logical and predictable order. In relation to life sciences, sequencing may be considered as determining the exact order of the base pairs in a segment of DNA. DNA or Deoxyribo Nucleic Acid is the most unique molecule in living world which is written with natures alphabets A, C, G and T. Mathematical models have got a significant role in life sciences especially in the field of genetics. Bioinformatics and computational biology are rooted in life sciences as well as computer and information sciences and technologies. Both these interdisciplinary approaches are drawn from specific disciplines such as mathematics, statistics, physics, computer science and engineering, biology, and behavioral science. Bioinformatics applies principles of information 1 P.G. Student, Department of Mechanical Engineering, College of Engineering, Thiruvananthapuram – 695016, Kerala, India. (Phone: +91 9447657729, email: [email protected]) 2 Hon. Director, Centre for Bioinformatics, Karyavattom Campus, Thiruvananthapuram – 695581, Kerala, India. (Phone: +91 471 2412759, email: [email protected]) 3 Professor, Department of Mechanical Engineering, College of Engineering, Thiruvananthapuram – 695016, Kerala, India. (Phone: +91 9495568929 , email: [email protected])

II. INDUSTRIAL ENGINEERING AND LIFESCIENCE Industrial engineering is concerned with the use of applied sciences (Operation research tools and techniques) in an industry. In raw sense, industry is a place where transformation takes place with the utilization of resources. The resources can be men, machine, money, management etc. The transformation process is generally done to add value to a product i.e., industries are value addition centers. Value addition may be obtained by changing shape, size, position, colour etc. Moving on to life sciences, the field is always characterized by transformations. Nature, ecosystems, living organisms all are subjected to changes. Infact, growth itself is a transformation process with the utilization of a resource called ‘food’. Food provides all the necessary nutrients – vitamins, minerals, carbohydrates, fats etc., which aids growth process. Considering human body as an industry, there are several processes happening inside it, say organization, metabolism, responsiveness, movement, reproduction, growth, differentiation, respiration, digestion, excretion etc. Basically, all these processes are optimized in nature. The controller of the process is the central nervous system comprising of brain, spinal cord and the numerous nerve cells. When the process is out of control we say that

1

DEPARTMENT OF MECHANICAL ENGINEERING the person is diseased. Medicines make the process again under control or optimized. Talking in this sense, many scientific tools used in industrial engineering can be applied to life sciences also [5]. The decision trees used in statistical decision making is synonymous to the hybrid cross-charts use by Mendel in his experiments. The plot of an organization structure may be compared to the pedigree chart (a chart showing a person’s ancestry). Conversely, ideas from life sciences can be used in the field of optimization. Use of Genetic Algorithm and Ant Colony Optimization (ACO) are examples indicating this fact. It has been widely seen that the natural processes follow statistical distributions. For example a study conducted by Davenport in 1913 on the inheritance of skin colour revealed that the frequency distribution of the phenotypes in a population follows a bell-shaped curve-very similar to normal distribution. He found that maximum number of progeny are of intermediate skin colour if the parents are Negro-Caucasian pairs and the phenotypic ratio would be 1:6:15:20:15:6:1. In a similar way if analysis is done on the spread of an epidemic, the frequency follows an irregular bell shaped curve. The phenomenon of central tendency among biological process needs to be studied in detail. The central tendency can be assumed to be the optimizer in many life processes i.e., the mean or the central value about which the process runs may be treated as the optimum value. Introduction of such central values into processes involving biological elements can bring optimum results. III. DNA MICROARRAYS A DNA microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. The affixed DNA segments are known as probes, thousands of which can be used in a single DNA microarray. The approximate size of a DNA microarray is approximately 1.28 cm × 1.28 cm. Each microarray contains millions of locations and each location contains thousands of spots as shown in Fig. 1. It may be noted that a single location contains copies of a probe i.e., probes in a particular location are identical but probe nature vary from location to location. The probe or the DNA segments are oligonucleotides of length 25-mers (say). A single probe may follow a sequence like AGTTTCG…… and the location contains copies of the probe per spot. The next location may have a different probeset. A. Microarray Working Mechanism The working mechanism of the DNA microarray follows Watson and Crick Duplex formation. Each single-stranded DNA fragment is made up of four different nucleotides, adenine (A), thymine (T), guanine (G), and cytosine (C) that are linked end to end. Adenine is the complement of, or will always pair with, thymine, and guanine is the complement of cytosine. When two complementary sequences meet each other, they hybridize together. One strand may be immobilized probe in the microarray and the second strand may be provided externally. When a solution containing fluorescently labelled DNA segments is poured over the ................................................................................................

NCTT 2006

Fig. 1. A DNA Microarray (a) Actual size is 1.28 cm × 1.28 cm approx., (b) 500,000 locations on each microarray, (c) Millions of DNA strands built up in each location, (d) Actual strand is about 25 base pairs.

microarray probe-target hybridization occur. Thousands of probes are provided in each location due to the stochastic nature of probe-target hybridization. This means that the chance of forming duplexes is limited when the probes are less in number. Once the hybridization process is completed, the microarray can be washed and scanned for image acquisition. The analysis of the image can provide helpful hints related to the nature of target solution. In application level, if two target solutions, one containing DNA segments from a healthy tissue and the other from a diseased tissue, are poured over identical microarrays, the image obtained after scanning can give information regarding disease genes. DNA microarrays possess numerous applications like gene discovery, disease diagnosis, drug discovery, toxicological research, gene expression profiling, differential expression, disease gene identification, reverse engineering of regulatory networks, identification of microorganisms, finding new transcripts etc. B. Microarray Manufacture There are two main technologies for making microarrays: the robotic spotting and in-situ synthesis. In the former method, pre-synthesised probes are spotted on to the glass slide with the help of a robot arm. But in in-situ synthesis, instead of presynthesising oligonucleotides, oligos are built base-by-base on the surface of the array. Here, during each round of synthesis, a single base is added to appropriate parts of the array. A mask is used to direct light to the appropriate region of the array so that the base is added to the correct spot. Light is used to excite a particular point in the array where the base is to be added. This is a straightforward method of manufacturing identical microarrays [8].

2

DEPARTMENT OF MECHANICAL ENGINEERING IV. THE SHORTEST SEQUENCE PROBLEM In this paper we focus on in-situ synthesis of microarrays. The probe sequences are known in advance and they must be synthesised on the chip. The synthesis should be done base-by-base to generate parallel probes. In each synthesis step, the same nucleotide is appended to all probes that have been selectively excited to receive it. Light is used to excite the point where the base is to be added and a mask is used to direct the light. This method is called photolithographic masking. Since the nucleotides are deposited base-by-base and the probe sequences are known in advance, sequencing technique may be applied to achieve optimum deposition of bases. An optimum deposition is required in order to minimize time as well as cost. Apart from this, the average probe quality increases because the deposition sequence is so formed that there is no confusion regarding the type of base which is to be deposited in each synthesis step. An optimum deposition means a shortest deposition sequence. The sequence of nucleotides used in the manufacture is called deposition sequence. Each probe can be treated as a subsequence of the deposition sequence and hence the deposition sequence is the common supersequence of all probes. If this supersequence is minimized then it leads to higher thoughput, short manufacturing time and the process becomes cost-effective. Photolithographic masking becomes imperfect if the deposition sequence is longer, which leads to higher error probability. Imperfect masking results in erroneous probes. Hence to reduce such errors we need the common supersequence to be of minimum length i.e., the deposition sequence must be a common shortest deposition sequence. Also the deposition of nucleotides is beyond human visibility i.e., it is done at nano levels with the utilisation of nanotechnolgy. Shorter but optimum sequences aid in easy handling at nano levels. The problem is to formulate and check algorithms that can generate shortest common deposition sequence (SCDS) of all probes used in a microarray. The algorithm should follow sequencing technique. The range of upper and lower bounds for such sequences may be found out using random simulations [4]. V. ALPHABET-LEFTMOST EMBEDDING ALGORITHM Alphabet-Leftmost Embedding algorithm is a generalized algorithm for finding out the shortest common superstring of a set of strings [6], [7]. The algorithm is flexible that it can be used for generating shortest common supersequence of random probesets [1], [2], [3]. It is clear that probes are formed by combinations of DNA alphabets A,G,C and T. These four alphabets form the fixed common alphabet of all sequences. According to Chase (1976), the sequences of length N that contain the largest number of distinct subsequences of length n, uniformly of each n ≤ N, are precisely the repeated permutations of the alphabet. Hence if we denote Σ = {A,G,C,T} i.e., |Σ| = 4 and π is a fixed permutation of the characters in Σ, then a common supersequence can be generated using folded repetitions of π. The number of folds depend on the length of longest

NCTT 2006 input string (probe). The common supersequence is to be minimized in length and we use Alphabet-Leftmost technique for obtaining SCDS. The algorithm as applied for random probe sets can be summarized using the following steps. 1. The input set R consists of m random probesets formed by alphabets from Σ where Σ = {A,G,C,T} 2. Let L be the length of the longest input probe in mers. 3. Let π be any fixed permutation of the letters in Σ (say π =ACTG) 4. Form L-fold repetitions of π and name the string as S. Hence S has a length of n = (|Σ|)L mers. 5. Embed the strings from the input set R on to S to form the embedding matrix. The size of the embedding matrix will be m × n. Each row of the embedding matrix is a binary vector representing the embedding of a string on S. 6. Remove all unproductive steps from the embedding matrix to obtain SCDS. Denote SCDS length as l. VI. LOWER AND UPPER BOUNDS Lower bound is a number equal to or less than any other number in a given set. A sequence has a lower bound LB, if all of the terms of the sequence are greater than or equal to LB. Conversely, upper bound is a number equal to or greater than any other number in a given set and a sequence has an upper bound UB, if all of the terms of the sequence are less than or equal to UB. Lower and upper bounds are used for forecasting the expected range of microarray length. The lower and upper bounds are found by random simulations of probesets. A lower bound over length L of the probes can be easily obtained. For i=1,…….,m and x∈Σ, let Ni(x) denote the number of occurrences of character x in the ith sequence, and define: (1) N ( x) = max i=1,......, n N i ( x ) Clearly, every common supersequence must contain at least N(x) occurrences of x. Thus a lower bound on its length is given by LB = Σ x N (x) (2) To be more specific, Let Np(x) denotes number of x’s in probe p; x ∈ {A,C,T,G}. In the supersequence, x occurs at east N(x) times i.e., maxp Np(x) times. Thus, N(A) + N(C) + N(G) + N(T) is a lower bound. An upper bound over the length of probesets can be obtained directly by running the Alphabet-Leftmost algorithm. After computing the embeddings of R on S we remove the unproductive steps from S to obtain SCDS. We take the length of SCDS as the upper bound i.e., UB = |SCDS|. In practice, the actual SCDS may be different from the theoretical one since the shortest common deposition sequence can be taken for further refinement to obtain improved SCDS. The refinement is done in such a way that low productivity steps are removed from the supersequence with out affecting the embeddings.

3

DEPARTMENT OF MECHANICAL ENGINEERING

NCTT 2006

A. Distribution of Lower Bound In this section, we analyze the nature of lower and upper bounds. The usual procedure is to fit the obtained values on to discrete distributions. The goodness of fit determines the best fit. In the case of lower bound, since Ni(x) is the number of occurrences of character x in the ith sequence, Ni(x) follows binomial distribution with parameters Li and 1/σ. Li is the length of ith sequence and σ = 4 since there are four alphabets. 1/σ is the probability of occurrence of any of the four alphabets. Furthermore, for fixed x, the Ni are independent. The distribution nature for k steps can be given as k ⎛L ⎞ P( N i ( x) ≤ k ) = ∑ ⎜⎜ i ⎟⎟ ⋅ σ − j ⋅ (1 − 1 / σ ) Li − j (3) j =0 ⎝ j ⎠

m

P ( N ( x) ≤ k ) = ∏ P( N i ( x) ≤ k )

(4)

i =1

From (2), it can be recalled that summation over N(x) gives the lower bound. Thus the approximate distribution of LB is obtained by convolutions of N(x) from (4). B. Distribution of Upper Bound Upper bund is obtained by removing unproductive steps from the embedding matrix. For k = 1, . . . , Li, let Wi,k be the number of consecutive unproductive steps between the (k − 1)st and kth productive step for sequence i. It is clear that the deposition sequence is periodic with alphabets from Σ and the alphabets are independent and uniformly distributed. Hence Wi,k can be assumed to be independent and uniformly distributed [9]. For (k-1)st and kth step, Wi,k can take values 0,1,2, or 3 i.e., Wi,k ∈ {0,1,2,3}. Hence the completion step for a sequence may be written as Li

Ci = Li + ∑ Wi,k

(5)

k =1

The distribution of, Ci may be found by Li-fold convolution of the uniform distribution on {0,1,2,3}. If C is set to maxi Ci, we can compute the distribution of C as m

P (C ≤ c) = ∏ P (C i ≤ c)

(6)

i =1

It may be noted that the initial supersequence before unproductively removal has a size of Li+3Li and UB>Li. From (5) it can be concluded that a non-negative value when added to Li gives the completion step for a sequence. This value is obtained by summation over the unproductivity steps. The upper bound is always less than or equal to C. C considers the unproductive steps that are removed by Alphabet-Leftmost. Distribution of C can be used as an approximation to UB. However, in our analysis we go for actual fitting of lower and upper bounds on to known distributions.

VII. COMPUTATIONAL EXPERIMENTS, RESULTS AND DISCUSSION Computational experiments done by Sven Rahmann and Sergio Carvalho [1],[2] pioneered the current situation. In their work, random sequences were generated; consequently upper and lower bounds were empirically found. The theoretical distribution was derived based on the unproductivity measure obtained from random simulations. Hence the stochasticity associated with the simulation results cast their shadows over the theoretical results. In this work, the simulated results were fitted on to different discrete distributions and the best-fit distribution was found. The advantage of such a method is that one can actually forecast the resource requirement during microarray manufacture. Moreover, the success value (p) indicates how far the obtained SCDS length value agrees with the theoretical one. The SCDS can be obtained by running Alphabet-Leftmost over the generated probe sequences. To measure and check the power of Alphabet-Leftmost algorithm, several uniformly distributed random probesets of 25mers were generated. The shortest common deposition sequence and the corresponding bounds were found out for these sequences by running Alphabet-Leftmost. The algorithm has been coded in MATLAB [10] and was tested on AMD Athlon, 256 MB, 1.4 Ghz machine with Windows XP platform. The observations were fitted using EASYFIT Version 3.0. Generation of SCDS for 30 probes of length 10 mers has been illustrated in Table i. The strong sequence above the embedding matrix indicates the SCDS. Approximate completion time for SCDS generation is 10.5 s for 100 probes of length 25 mers. Lower and upper bound simulation requires 1563.25 s for 100 probes and 2478.32 s for 200 probes each of length 25 mers. The simulation analysis of lower and upper bounds on random probesets of length 25 mers show that the values tend to centralize as expected. Simulation has been done for 100 and 200 probes. The results are shown in Fig. 2. Both lower bound and upper bound has a value that occurs the most. This value is taken for experimental purposes. The initial setting of resources (oligonucleotides) is based on this value. This value, which is taken as the mean, has the highest probability of occurrence. The fitting data shows that both upper and lower bounds fit the best when the distribution is binomial. The mean length (n in Fig. 2) as obtained from fitting may vary slightly from the simulated one due to stochasticity and shorter number of probes. However, much variation is not seen. From Fig. 2, the probability of success (p) is always above 90%, which indicates that the values given by the distribution can be safely used for resource setting. VIII. CONCLUSION The basic essence behind this work spins around the influence of modeling techniques in the field of life science. In this paper it has been shown that the four DNA alphabets can be sequenced for experimental works. A basic search algorithm called Alphabet-Leftmost is used for optimal sequencing. Results indicate that the sequencing can result

4

DEPARTMENT OF MECHANICAL ENGINEERING in values, which follows specific statistical distributions. Such results can aid in interpolating and extrapolating values, which are otherwise very difficult to obtain. Contradictory to the usual destructive nature of biological experiments, this method is non-destructive in nature and requires less time to arrive at a result. IX. IN FUTURE The present algorithm is based on removal of unproductive steps. But there is a provision that low productivity steps can be re-embedded or merged with other productive steps. This may be termed as supersequence refinement, which results in much shorter SCDS. An algorithm for supersequence editing and refinement can be formulated as apart of future work. Apart from this, a model illustrating the probabilistic nature of probe-target hybridization can be formed. Such a model can substitute the actual microarray experiment.

NCTT 2006 REFERENCES [1]. Sven Rahmann, Sergio A. de Carvalho Jr. “Combinatorial Optimization Problems in DNA Microarray Design”, Algorithms and Statistics for Systems Biology Group Genome Informatics,2005, 1-13 [2]. Sven Rahmann, “The shortest common supersequence problem in a microarray production setting”, Bioinformatics Journal, Oxford University Press, Vol. 19 Suppl. 2 2003, ii156–ii161 [3]. Sven Rahmann, “Algorithmic Problem Solving”, SoSe 2006, Course #392016, Mo 14-16 in U2-205, 1-4 [4]. Paolo Barone, Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri, “An Approximation Algorithm for the Shortest Common Supersequence Problem”, 2001, ACM Journals [5]. Richard H. Lathrop, Michael J. Pazzani, “Combinatorial Optimization in Rapidly Mutating Drug-resistant Viruses”, Journal of Combinatorial Optimization 3, 1999, pp. 301-320 [6]. A.S. Rebai, and M. Elloumi, “Approximation Algorithm for the Shortest Approximate Common Superstring Problem”, Enformatica V12 2006, ISSN 1305-5313, 301- 306 [7]. Assaf Zaritsky, “Coevolving Solutions to the Shortest Common Superstring Problem”, M.Sc Thesis, Ben-Gurion University of the Negev, 2003 [8]. Dov Stekel, Microarray Bioinformatics, Oxford University and Bius, Cambridge University Press, First South Asian Edition 2005. [9]. Narsingh Deo, Graph Theory with Applications to Engineering and Computer Science, Prentice Hall of India, Nineteenth Edition 2000 [10]. Delores M. Etter, David C. Kuncicky, Introduction to MATLAB 6, The Pearson Education, Second Edition 2004.

Table i. SCDS generation for 30 probes of length 10 mers

5

DEPARTMENT OF MECHANICAL ENGINEERING

NCTT 2006

a

d

b

e

c

f

Fig. 2. Simulation and Fitting Results. (a) SCS bounds for 100 DNA sequences of length 25 mers. (b) Fitted lower bounds for 100 DNA sequences of lenth 25 mers. (c) Fitted upper bounds for 100 DNA sequences of lenth 25 mers. (d) SCS bounds for 200 DNA sequences of length 25 mers. (b) Fitted lower bounds for 200 DNA sequences of lenth 25 mers. (c) Fitted upper bounds for 200 DNA sequences of lenth 25 mers. Note the rightward shifting of bounds as number of probes increases.

6

Finding Multiple Coherent Biclusters in Microarray Data ... - IEEE Xplore

C:\Users\LANCAR JAYA\Desktop\1\pdf\ DOWNLOAD PDF Optimization of DNA Concentration in Rapd Fingerprinting of Phytophthora Infestans.pdf

NEXT: In-Network Nonconvex Optimization - IEEE Xplore