segmentation techniques of microarray images â a ...

Viewer
Transcript

SEGMENTATION TECHNIQUES OF MICROARRAY IMAGES – A COMPARATIVE STUDY Submitted in the partial fulfilment of the award of the degree Of Bachelor of Technology In Electronics and Communication Engineering Of Cochin University of Science and Technology By

JISHNU L REUBEN GEORGE STEPHEN SAJIT S NAIR VINEETH P

Under the guidance of

Mrs. DEEPA J Assistant Professor in Electronics

April 2009 Department of Electronics and Communication Engineering College of Engineering, Chengannur-689121 Phone: (0479) 2454125, 2451424

Fax: (0479) 2451424

COLLEGE OF ENGINEERING, CHENGANNUR KERALA

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING CERTIFICATE This is to certify that the project report entitled

SEGMENTATION TECHNIQUES OF MICROARRAY IMAGES – A COMPARATIVE STUDY Submitted by JISHNU L REUBEN GEORGE STEPHEN SAJIT S NAIR VINEETH P is a bonafide account of work done by them under our supervision.

Guide

Head of the Department

Coordinator

ACKNOWLEDGEMENT First of all we would like to thank Mr. V P Devassia, Principal, College of Engineering, Chengannur and Mrs. Nisha Kuruvilla, Head of the Department, Electronics and Communication, for providing us with all the necessary facilities for implementation of the project.

Our sincere thanks goes to Mr. Anilkumar C V, Assistant Professor in Electronics, for coordinating all our activities.

We express our gratitude to Mrs. Deepa J, Assistant Professor in Electronics, for providing us with her invaluable guidance and constant encouragement, without which we could not have completed this project.

We also thank Mr. Ulsah R and other supporting staff in the Project Lab

We thank all the staff and students of the college who have directly or indirectly helped us in the completion of this project.

Above all, we thank God Almighty for all His blessings.

ABSTRACT A microarray is a device that detects the presence and abundance of labelled nucleic acids in a biological sample. Microarray imaging is considered an important tool for large scale analysis of gene expression. A DNA microarray is a multiplex technology used in molecular biology and in medicine. It consists of an arrayed series of thousands of microscopic spots of DNA oligo-nucleotides, called features, each containing picomoles of a specific DNA sequence. Probe-target hybridization is usually detected and quantified by fluorescence-based detection of fluorophore-labelled targets to determine relative abundance of nucleic acid sequences in the target. The accuracy of the gene expression depends on the experiment itself and further image processing. Image analysis is the first step in microarray data analysis. It has a strong impact on the successive phases of analysis (clustering and identification of differently expressed genes). The current project aims at performing image analysis on a microarray image involving automatic gridding, segmentation and background correction. Segmentation of microarray images is performed using various common techniques such as fixed circle, adaptive circle and adaptive shape segmentation and the values of the logarithmic ratios of red and green channel intensities of the image obtained using each technique are compared. Edge detection is performed using the Canny method, and background correction using a method called the Extended Local Background Quantification method (ELB-Q).

CONTENTS I.

II.

INTRODUCTION

1

1.

DNA MICROARRAYS

2

1.1. IMPORTANCE OF MICROARRAYS

3

1.2. DNA MICROARRAY

4

1.3. DESIGNING A MICROARRAY EXPT.

4

1.4. THE COLOURS OF A MICROARRAY

7

1.5. TYPES OF MICROARRAYS

8

1.5.1. Changes in Gene Expression Levels

8

1.5.2. Genomic Gains and Losses

9

1.5.3. Mutations in DNA

10

MICROARRAY IMAGE ANALYSIS

12

1.

ADDRESSING

13

2.

SEGMENTATION

14

2.1. FIXED CIRCLE SEGMENTATION

15

2.2. ADAPTIVE CIRCLE SEGMENTATION

15

2.3. ADAPTIVE SHAPE SEGMENTATION

16

2.4. HISTOGRAM SEGMENTATION

17

INTENSITY EXTRACTION

18

3.1. SPOT INTENSITY

18

3.2. BACKGROUND INTENSTIY

18

3.

4.

3.2.1. Local Background Intensities

18

3.2.2. Morphological Opening

20

3.2.3. Constant Background

20

3.2.4. No Adjustment

21

OVERVIEW OF EXISTING SOFTWARE

22

4.1. ScanAlyze

22

4.2. GenePix

22

4.3. Spot

23

4.4. Angulo and Serrà

23

4.5. Matarray

24

4.6. MAGIC

24

III. PROBLEM DEFINITION

25

IV. EXPERIMENTAL DETAILS

26

V.

1. PREPROCESSING

27

2. EDGE DETECTION

28

3. GRIDDING

31

4. SEGMENTATION

32

5. BACKGROUND CORRECTION

34

6. INTENSITY EXTRACTION

39

7. GRAPHICAL USER INTERFACE

39

RESULTS AND DISCUSSION

41

VI. CONCLUSION AND FUTURE SCOPE

44

VII. REFERENCES

45

Segmentation of Microarray Images Using MATLAB®

1

I. INTRODUCTION Biomedical research evolves and advances not only through the compilation of knowledge but also through the development of new technologies. Using traditional methods to assay gene expression, researchers were able to survey a relatively small number of genes at a time. The emergence of new tools enables researchers to address previously intractable problems and to uncover novel potential targets for therapies. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and efficiently. They represent a major methodological advance and illustrate how the advent of new technologies provides powerful tools for researchers. Scientists are using microarray technology to try to understand fundamental aspects of growth and development as well as to explore the underlying genetic causes of many human diseases [1]. With only a few exceptions, every cell of the body contains a full set of chromosomes and identical genes. Only a fraction of these genes are turned on, however, and it is the subset that is expressed that confers unique properties to each cell type. Gene expression is the term used to describe the transcription of the information contained within the DNA, the repository of genetic information, into messenger RNA (mRNA) molecules that are then translated into the proteins that perform most of the critical functions of cells. Scientists study the kinds and amounts of mRNA produced by a cell to learn which genes are expressed, which in turn provides insights into how the cell responds to its changing needs. Gene expression is a highly complex and tightly regulated process that allows a cell to respond dynamically both to environmental stimuli and to its own changing needs. This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary[1].

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

2

1. DNA MICROARRAYS Two recent complementary advances, one in knowledge and one in technology, are greatly facilitating the study of gene expression and the discovery of the roles played by specific genes in the development of disease [1]. As a result of the Human Genome Project, there has been an explosion in the amount of information available about the DNA sequence of the human genome. Consequently, researchers have identified a large number of novel genes within these previously unknown sequences. The challenge currently facing scientists is to find a way to organize and catalog this vast amount of information into a usable form. Only after the functions of the new genes are discovered will the full impact of the Human Genome Project be realized. The second advance may facilitate the identification and classification of this DNA sequence information and the assignment of functions to these new genes: the emergence of DNA microarray technology. A microarray works by exploiting the ability of a given mRNA molecule to bind specifically to, or hybridize to, the DNA template from which it originated. By using an array containing many DNA samples, scientists can determine, in a single experiment, the expression levels of hundreds or thousands of genes within a cell by measuring the amount of mRNA bound to each site on the array. With the aid of a computer, the amount of mRNA bound to the spots on the microarray is precisely measured, generating a profile of gene expression in the cell. A microarray is a tool for analyzing gene expression that consists of a small membrane or glass slide containing samples of many genes arranged in a regular pattern.[1]

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

1.1.

3

IMPORTANCE OF MICROARRAYS

Microarrays are a significant advance both because they may contain a very large number of genes and because of their small size [1]. Microarrays are therefore useful when one wants to survey a large number of genes quickly or when the sample to be studied is small. Microarrays may be used to assay gene expression within a single sample or to compare gene expression in two different cell types or tissue samples, such as in healthy and diseased tissue. Because a microarray can be used to examine the expression of hundreds or thousands of genes at once, it promises to revolutionize the way scientists examine gene expression. This technology is still considered to be in its infancy; therefore, many initial studies using microarrays have represented simple surveys of gene expression profiles in a variety of cell types. Nevertheless, these studies represent an important and necessary first step in our understanding and cataloging of the human genome. As more information accumulates, scientists will be able to use microarrays to ask increasingly complex questions and perform more intricate experiments. With new advances, researchers will be able to infer probable functions of new genes based on similarities in expression patterns with those of known genes. Ultimately, these studies promise to expand the size of existing gene families, reveal new patterns of coordinated gene expression across gene families, and uncover entirely new categories of genes. Furthermore, because the product of any one gene usually interacts with those of many others, our understanding of how these genes coordinate will become clearer through such analyses, and precise knowledge of these inter-relationships will emerge. The use of microarrays may also speed the identification of genes involved in the development of various diseases by enabling scientists to examine a much larger number of genes. This technology will also aid the examination of the integration of gene expression and function

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

4

at the cellular level, revealing how multiple gene products work together to produce physical and chemical responses to both static and changing cellular needs.

1.2.

DNA MICROARRAY

DNA microarrays are small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations. The supports themselves are usually glass microscope slides, the size of two side-by-side pinky fingers, but can also be silicon chips or nylon membranes. The DNA is printed, spotted, or actually synthesized directly onto the support. It is important that the gene sequences in a microarray are attached to their support in an orderly or fixed way, because a researcher uses the location of each spot in the array to identify a particular gene sequence. The spots themselves can be DNA, cDNA, or oligonucleotides. An oligonucleotide, or oligo as it is commonly called, is a short fragment of a single-stranded DNA that is typically 5 to 50 nucleotides long.

1.3.

DESIGNING A MICROARRAY EXPERIMENT: THE BASIC STEPS

The whole process of extracting information about a disease condition from the dime-sized glass or silicon chip containing thousands of individual gene sequences is based on hybridization probing. It is a technique that uses fluorescently labeled nucleic acid molecules as "mobile probes" to identify complementary molecules, sequences that are able to base-pair with one another. Each single-stranded DNA fragment is made up of four different nucleotides, adenine (A), thymine (T), guanine (G), and cytosine (C), which are linked end to end. Adenine is the complement of, or will always pair with, thymine, and guanine is the complement of cytosine. Therefore, the complementary sequence to G-T-CC-T-A will be C-A-G-G-A-T. When two complementary sequences find each other, such as

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

5

the immobilized target DNA and the mobile probe DNA, cDNA, or mRNA, they will lock together, or hybridize. Now, consider two cells: cell type 1, a healthy cell, and cell type 2, a diseased cell. Both contain an identical set of four genes, A, B, C, and D. Scientists are interested in determining the expression profile of these four genes in the two cell types. To do this, scientists isolate mRNA from each cell type and use this mRNA as templates to generate cDNA with a "fluorescent tag" attached. Different tags (red and green) are used so that the samples can be differentiated in subsequent steps. The two labeled samples are then mixed and incubated with a microarray containing the immobilized genes A, B, C, and D. The labeled molecules bind to the sites on the array corresponding to the genes expressed in each cell.

A DNA Microarray Experiment[1] 1. Prepare the DNA chip using the chosen target DNAs. 2. Generate a hybridization solution containing a mixture of fluorescently labeled cDNAs. 3. Incubate the hybridization mixture containing fluorescently labeled cDNAs with the DNA chip. 4. Detect bound cDNA using laser technology and store data in a computer. 5. Analyze data using computational methods.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

6

After this hybridization step is complete, a researcher will place the microarray in a "reader" or "scanner" that consists of some lasers, a special microscope, and a camera. The fluorescent tags are excited by the laser, and the microscope and camera work together to create a digital image of the array. These data are then stored in a computer, and a special program is used either to calculate the red-to-green fluorescence ratio or to subtract out background data for each microarray spot by analyzing the digital image of the array. If calculating ratios, the program then creates a table that contains the ratios of the intensity of red-to-green fluorescence for every spot on the array. For example, using the scenario outlined above, the computer may conclude that both cell types express gene A at the same level, that cell 1 expresses more of gene B, that cell 2 expresses more of gene C, and that neither cell expresses gene D. But remember, this is a simple example used to demonstrate key points in experimental design. Some microarray experiments can contain up to 30,000 target spots. Therefore, the data generated from a single array can mount up quickly.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

1.4.

7

THE COLORS OF A MICROARRAY [1]

.

In this schematic: GREEN represents Control DNA, where either DNA or cDNA derived from normal tissue is hybridized to the target DNA. RED represents Sample DNA, where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA. YELLOW represents a combination of Control and Sample DNA, where both hybridized equally to the target DNA. BLACK represents areas where neither the Control nor Sample DNA hybridized to the target DNA.

Each spot on an array is associated with a particular gene. Each color in an array represents either healthy (control) or diseased (sample) tissue. Depending on the type of array used, the location and intensity of a color will tell us whether the gene, or mutation, is present in either the control and/or sample DNA. It will also provide an estimate of the expression level of the gene(s) in the sample and control DNA.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

1.5.

8

TYPES OF MICROARRAYS

There are three basic types of samples [1] that can be used to construct DNA microarrays, two are genomic and the other is "transcriptomic", that is, it measures mRNA levels. What makes them different from each other is the kind of immobilized DNA used to generate the array and, ultimately, the kind of information that is derived from the chip. The target DNA used will also determine the type of control and sample DNA that is used in the hybridization solution.

1.5.1. Changes in Gene Expression Levels Determining the level, or volume, at which a certain gene is expressed is called microarray expression analysis, and the arrays used in this kind of analysis are called "expression chips". The immobilized DNA is cDNA derived from the mRNA of known genes, and once again, at least in some experiments, the control and sample DNA hybridized to the chip is cDNA derived from the mRNA of normal and diseased tissue, respectively. If a gene is over-expressed in a certain disease state, then more sample cDNA, as compared to control cDNA, will hybridize to the spot representing that expressed gene. In turn, the spot will fluoresce red with greater intensity than it will fluoresce green. Once researchers have characterized the expression patterns of various genes involved in many diseases, cDNA derived from diseased tissue from any individual can be hybridized to determine whether the expression pattern of the gene from the individual matches the expression pattern of a known disease. If this is the case, treatment appropriate for that disease can be initiated. As researchers use expression chips to detect expression patterns— whether a particular gene(s) is being expressed more or less under certain circumstances—expression chips may also be used to examine changes in gene expression over a given period of time, such as within the cell cycle. The cell cycle is a molecular network that determines, in the normal

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

9

cell, if the cell should pass through its life cycle. There are a variety of genes involved in regulating the stages of the cell cycle. Also built into this network are mechanisms designed to protect the body when this system fails or breaks down because of mutations within one of the "control genes", as is the case with cancerous cell growth. An expression microarray "experiment" could be designed where cell cycle data are generated in multiple arrays and referenced to time "zero". Analysis of the collected data could further elucidate details of the cell cycle and its "clock", providing much needed data on the points at which gene mutation leads to cancerous growth as well as sources of therapeutic intervention. In the same way, expression chips can be used to develop new drugs. For instance, if a certain gene is over-expressed in a particular form of cancer, researchers can use expression chips to see if a new drug will reduce over-expression and force the cancer into remission. Expression chips could also be used in disease diagnosis as well, e.g., in the identification of new genes involved in environmentally triggered diseases, such as those diseases affecting the immune, nervous, and pulmonary/respiratory systems.

1.5.2. Genomic Gains and Losses DNA repair genes are thought to be the body's frontline defense against mutations and, as such, play a major role in cancer. Mutations within these genes often manifest themselves as lost or broken chromosomes. It has been hypothesized that certain chromosomal gains and losses are related to cancer progression and that the patterns of these changes are relevant to clinical prognosis. Using different laboratory methods, researchers can measure gains and losses in the copy number of chromosomal regions in tumor cells. Then, using mathematical models to analyze these data, they can predict which chromosomal regions are most likely to harbor important genes for tumor initiation and disease progression. The results of such

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

10

an analysis may be depicted as a hierarchical treelike branching diagram, referred to as a "tree model of tumor progression". Researchers use a technique called microarray Comparative Genomic Hybridization (CGH) to look for genomic gains and losses or for a change in the number of copies of a particular gene involved in a disease state. In microarray CGH, large pieces of genomic DNA serve as the target DNA, and each spot of target DNA in the array has a known chromosomal location. The hybridization mixture will contain fluorescently labeled genomic DNA harvested from both normal (control) and diseased (sample) tissue. Therefore, if the number of copies of a particular target gene has increased, a large amount of sample DNA will hybridize to those spots on the microarray that represent the gene involved in that disease, whereas comparatively small amounts of control DNA will hybridize to those same spots. As a result, those spots containing the disease gene will fluoresce red with greater intensity than they will fluoresce green, indicating that the number of copies of the gene involved in the disease has gone up.

1.5.3. Mutations in DNA When researchers use microarrays to detect mutations or polymorphisms in a gene sequence, the target, or immobilized DNA, is usually that of a single gene. In this case though, the target sequence placed on any given spot within the array will differ from that of other spots in the same microarray, sometimes by only one or a few specific nucleotides. One type of sequence commonly used in this type of analysis is called a Single Nucleotide Polymorphism, or SNP, a small genetic change or variation that can occur within a person's DNA sequence. Another difference in mutation microarray analysis, as compared to expression or CGH microarrays, is that this type of experiment only requires genomic DNA derived from a normal sample for use in the hybridization mixture.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

11

Once researchers have established that a SNP pattern is associated with a particular disease, they can use SNP microarray technology to test an individual for that disease expression pattern to determine whether he or she is susceptible to (at risk of developing) that disease. When genomic DNA from an individual is hybridized to an array loaded with various SNPs, the sample DNA will hybridize with greater frequency only to specific SNPs associated with that person. Those spots on the microarray will then fluoresce with greater intensity, demonstrating that the individual being tested may have, or is at risk for developing, that disease.

In Brief: Microarray Applications [1] Microarray type

Application

CGH

Tumor classification, risk assessment, and prognosis prediction

Expression analysis

Drug development, drug response, and therapy development

Mutation/Polymorphism

Drug development, therapy development, and tracking disease

analysis

progression

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

12

II. MICROARRAY IMAGE ANALYSIS Image analysis is the first microarray technology processing step; it has a strong impact on the successive phases of analysis (clustering and identification of differently expressed genes) [7]. The way in which the image of a microarray experiment is extracted from a scanned slide can have a substantial impact on subsequent analyses [2]. In a microarray experiment, hybridized microarrays are imaged in a microarray scanner to produce red and green fluorescence intensity measurements over a large collection of pixels that collectively cover the array. Fluorescence intensities correspond to the levels of hybridization of the two targets to the probes spotted on the slide. These fluorescence intensity values are stored as 16-bit images that are typically described as “raw” data [2]. Usually, input images are properly scaled by reducing the overall dynamic range to 8-bit obtaining a single RGB image (for sake of visualization). The reduction is obtained by a square root transformation 16 8 ( 2  2 ). Alternatively it is possible to select only 8 of the overall 16 bits. In GenePix,

for example, it is possible to manually select 8 bits or to use the predefined options (high, low or centre bits). In the final 24-bit RGB image the blue channel is set to zero, red and green values came respectively from the 630-660 nm and 510-550 nm microarray scanning [7].

During the last few years, a number of image analysis packages for glass slide cDNA microarrays, both commercial software and freeware, have become available [2]. Some of these packages are variants of those that are used to analyze radioactive signals from arrays spotted onto nylon membranes. Other software packages are designed specifically for glass slide microarrays. These specifically designed packages take advantage of the rigid layout of the spots in their spot finding algorithm, utilize the information from the two fluorescent channels, and deal with characteristics of fluorescent signals that are different from signals

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

13

generated by radioactivity. The processing of scanned microarray images can be separated into three major tasks [2]: 

Addressing or gridding is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high-throughput analysis.



Segmentation allows the classification of pixels either as foreground, i.e., within a printed DNA spot, or as background.



Intensity extraction involves calculating, for each spot on the array, red and green fluorescence intensity pairs (R, G), background intensities, and possibly, quality measures.

In microarray image analysis, the higher the quality of slide printing, target hybridization, and image scanning, the easier it is for image analysis programs to correctly measure spot intensities at each of these stages. Frequently, the last slides of a print run are the most highly variable in terms of spot morphology.

1. ADDRESSING The basic layout or structure of a microarray image is known as it is determined by the spot deposition by the arrayer. This information is used to help the microarray image analysis software define the spots. To address the spots in an image, i.e., to match an idealized model of the microarray with the scanned image data, a number of parameters must be estimated, including 

displacement of grids or spots (translation) from the expected position caused by slight variations in print-tip positions,



small individual translations of spots,



separation between rows and columns of grids,



separation between rows and columns of spots within each grid, and

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®



14

overall position of the microarray in the scanned image.

Other parameters that may need to be considered are misregistration of the red and green channels, rotation of the array in the image, or deviation from symmetry due to printer or scanning artefacts. The last two parameters are important issues for automated gridding algorithms. It is important that the addressing procedure be both accurate, to ensure precision with subsequent steps of image analysis, and efficient, to allow rapid slide throughput. Allowing user intervention can increase reliability of the addressing stage, although this has the potential to make the process very slow. The addressing steps are often referred to as “gridding” in the microarray literature. Most software systems now provide both manual and automatic gridding procedures. Addressing procedures are very varied and have not been well documented.

2. SEGMENTATION Segmentation of an image can generally be defined as the process of partitioning the image into different regions, each having certain properties [2]. In a microarray experiment, segmentation is the classification of pixels as foreground (i.e., within a spot) or background. Segmentation of microarray images allows fluorescence intensities to be calculated for each spotted DNA sequence as measures of relative transcript abundance in the two samples tested on the microarray. Each segmentation method produces a spot mask, which consists of the set of foreground pixels for each given spot. Existing segmentation methods for microarray images can be categorized into four groups, according to the geometry of the spots they produce: 

Fixed-circle segmentation



Adaptive circle segmentation

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®



Adaptive shape segmentation



Histogram segmentation

2.1.

15

FIXED-CIRCLE SEGMENTATION

Fixed-circle segmentation fits a circle with a constant diameter to all the spots in the image. This method is easy to implement and works nicely when all of the spots are circular and of the same size. It was first implemented in the ScanAlyze software written by M. B. Eisen (http://rana.lbl.gov/EisenSoftware.htm), and it is typically provided as a standard option in most software. However, for microarrays with spots of varying size, fixed-diameter segmentation may not be satisfactory.

2.2.

ADAPTIVE CIRCLE SEGMENTATION

In adaptive circle segmentation, the diameter of the circle that defines the foreground is estimated independently for each spot. The software GenePix (Axon Instruments http://axon.com/GN_Genomics.htm/#Software) for the Axon scanner implements such an algorithm. ScanAlyze and other similar software packages provide the user with the option to manually adjust the circle diameter spot by spot. In practice, this can be very timeconsuming, as each array contains thousands of spots. Another example of adaptive circle segmentation is the software Dapple, which finds spots by detecting edges of spots. Briefly, Dapple calculates the negative second derivative of the image (Laplacian). Pixels with high values in the Laplacian image correspond to edges of a spot. In addition, Dapple enforces a circularity constraint by finding the brightest ring (circle) in the Laplacian images. Adaptive circle segmentation methods work quite well with well-produced microarrays but are less effective with oval- or doughnut-shaped spots. Common sources of non-circularity include the printing process (e.g., features of the print tips, uneven solute deposition) and

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

16

effects of post-processing of the slides after printing. Segmentation algorithms that do not place restrictions on the shape of the spots are then more desirable.

2.3.

ADAPTIVE SHAPE SEGMENTATION

Two commonly used methods of adaptive shape segmentation in image analysis are the watershed, and the seeded region growing (SRG). Adaptive shape segmentation methods are not incorporated in to the most widely used microarray analysis software packages. Both watershed and SRG segmentation require the specification of starting points, or seeds, and the weak points of image segmentation procedures using these methods are typically the selection of the number and the location of the seed points. In microarray image analysis, however, the number of features (spots) is known exactly a priori, and the approximate locations of the spot centres are determined at the addressing stage. Microarray images are therefore well-suited to such methods. The SRG algorithm is implemented in the software Spot (http://www.cmis.csiro.au/iap/Spot/Spotmanual.htm) and AlphaArray. These programs have the advantage of being able to cope with spot shapes that deviate from circles.

Figure 1: The thick white line shows the result of the SRG (seeded region growing) segmentation to image a noncircular shaped spot. The pixels inside the thick white line are classified as foreground and the other pixels are classified as background.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

2.4.

17

HISTOGRAM SEGMENTATION

Histogram Segmentation methods do not explicitly classify pixels into foreground or background. Instead, these methods analyze pixels within designated regions and estimate foreground intensity from the distribution of the pixels. This type of method uses a target mask that is chosen to be larger than any given spot. For each spot, foreground and background intensity estimates are determined from a histogram of pixel values for the pixels within the masked area. These methods therefore do not use any local spatial information. An example of the histogram segmentation is implemented in the QuantArray software, where the foreground and background are defined as mean intensities between some predefined percentile values. The default values are the 5th and 20th percentile for the background and the 80th and 95th percentiles for the foreground. Another example of implementation of the histogram segmentation method is described by Chen et al. (1997). These authors use a circular target mask and compute a threshold value based on a MannWhitney test. Pixels are classified as foreground if their value is greater than the threshold and as background otherwise. This method is implemented in the QuantArray software for the GSI Lumonics scanner and DeArray by Scanalytics. The main advantage of the histogram method is simplicity. However, a major disadvantage is that quantification is unstable when a large target mask has been set to compensate for spot size variation. Furthermore, the resulting spot masks are not necessarily connected.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

18

3. INTENSITY EXTRACTION 3.1.

SPOT INTENSITY

Each pixel value in a scanned image represents the level of hybridization at a specific location on the slide. The total amount of hybridization for a particular spotted DNA sequence is proportional to the total fluorescence at the spot. The natural measure of spot intensity is therefore the sum of pixel intensities within the spot mask. Later calculations are based on ratios of fluorescence intensities, and as the ratio of averages is equal to the ratio of sums, the average pixel value over the spot mask is computed. Likewise, a ratio of medians may be substituted; thus, the median pixel value over the spot mask is computed.

3.2.

BACKGROUND INTENSITY

It is important in microarray image analysis to adjust for background, as the measured intensity of each spot includes a contribution of nonspecific hybridization and fluorescence emitted from other chemicals on the glass. Apart from histogram-based methods, the segmentation procedures described above identify local background regions for this calculation. The various background methods implemented in software packages can be divided into four categories as described below.

3.2.1. Local Background Intensities These intensities are estimated by focusing on small regions surrounding the spot mask. Usually the background estimate is the median of pixel values within these specific regions. Most software packages currently use this approach.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

19

Figure 2: Image illustrating different local background adjustment methods. The region inside the dashed circle represents the spot mask and the other regions bounded by lines represent regions used for local background calculation by different methods. Solid circles: used in QuantArray; dotted square: used in ScanAlyze; and dashed diamond shapes: used in Spot

Figure 2 illustrates different local background adjustment methods. One approach (used in ScanAlyze) is to include all pixels within a square centred at the spot centre and exclude all those that are within the spot mask (Figure 2, dotted square). An alternate approach employed in QuantArray and ArrayVision is to estimate the area between two concentric circles (Figure, solid circles). By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure. Finally, local background may be estimated in the regions having the furthest distance from all four surrounding spots, also known as the valley regions (Figure, diamond shaped areas). The local background for each spot can be estimated by the median of values from the four surrounding valleys. Depending on the software, the shapes of the local valley regions are different, but this method of background estimation is somewhat independent of segmentation results. GenePix implements this method for background estimation. Using valley pixels that are distant from all spots ensures, to a large degree, that the background estimate is not corrupted by pixels belonging to a spot. Corruption by bright pixels may occur in the other methods, particularly the ScanAlyze method, introducing an upward bias into the background estimate. Using remote

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

20

pixels reduces this bias effectively but entails the use of a smaller number of pixels and therefore increases the variance of the estimate, an example of a bias-variance trade-off.

3.2.2. Morphological Opening A second approach to background adjustment relies on a nonlinear filter called a morphological opening [2]. This filter is obtained by computing a local minimum filter (erosion) followed by a local maximum filter (dilation) with the same window. In a microarray image, the effect of using a window that is larger than any spot is to replace all spots with nearby background values. In Spot, the morphological opening is applied to the original images using a square structuring element with a side length at least twice as large as the spot separation distance. This operation removes all of the spots and generates an image that is an estimate of the background for the entire slide. For individual spots, background is estimated by sampling this background image at the nominal centre of the spot. Because a large window is used to create the morphological background image, spatial variation is expected to be low. Morphological opening results in smaller background estimates than those of other simpler methods. More importantly though, morphological background estimation is expected to be less variable than the other methods, because spot background estimates are based on pixel values in a large local window, yet are not biased upward by brighter pixels belonging to or on the edge of spots.

3.2.3. Constant Background A global method that subtracts a constant background for all spots is yet another alternative to estimate background noise [2]. The approaches previously described assume that the nonspecific contribution to a spot signal can be estimated from the area surrounding the spot. However, some findings (X.J. Lou, unpublished. DNA Microarrays) suggest that the

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

21

binding of fluorescent dyes to negative control spots, for example, spots corresponding to plant genes that will not hybridize with human mRNA samples, is lower than the binding to the glass slide. If this is the case, it may be more meaningful to estimate background on the basis of a set of negative control spots. This approach is limited in value if background across the slide is unevenly distributed due to inadequate washing, for example.

3.2.4. No Adjustment Finally, we also consider the possibility of no background adjustment at all. A comparison of different methods [3] found that the choice of background correction method has a larger impact on the log ratio of intensities than the segmentation method used. Thus, finding better segmentation methods may not be as important as choosing a stable background adjustment method. In the estimation of background contribution, it is suggested [3] that the morphological opening provides a better estimate of background compared to other methods. The log ratios log2(R/G) computed after morphological background correction tended to exhibit low within and between slide variability. In addition, this method did not seem to compromise accuracy. A good image analysis method should permit a clear distinction to be made between differentially expressed genes and noise. Methods using local background subtraction show greater variability around the low intensity spots than methods not using any background subtraction or the morphological background adjustment. Local background adjustment tends to blur the distinction between differentially expressed genes and noise. The comparison of different background correction methods indicates that estimates based on means or medians over local neighbourhoods tend to be quite noisy and can potentially double the standard deviation of the log ratios. At the other extreme, no background adjustment seems to reduce the ability of identifying differentially expressed genes [3].

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

22

A widely accepted way to correct spot intensities for background has yet to emerge. It seems possible that use of large numbers of negative control spots across a slide may lead to better background adjustments, which differ from those obtained by measuring fluorescence on non-spot regions of the slide in ways that have been discussed above. Any advice must necessarily be tentative, but for the moment, it is recommended to adjust the background by using a method that varies the background more slowly than do local background methods [2].

4. OVERVIEW OF EXISTING SOFTWARE [4] 4.1.

ScanAlyze

In ScanAlyze, the addressing phase is manual. Various parameters for grid construction have to be specified (e.g. separation between rows and columns). ScanAlyze uses the fixed circle segmentation method. Size dimension of circles is manual and occurs contextually in addressing phase. All spots are considered valid even if they do not enclose signal; such spots can be excluded manually from the information extraction phase. To estimate background local intensity the median values are used. The ratio G/R of the average foreground is calculated possibly taking into account the correction of the background. The software produces a set of information and several quality values (e.g. channels correlation) for each spot. 4.2.

GenePix

Genepix uses a square root transformation to reduce the dynamic range of input images. It is also possible to select manually the 8 bits to be saved, or use the predefined options to save high, low or central bits. Addressing is automatic. In the previous version, circle based segmentation with variable dimension was used, and the spot was classified as “not found” according to some conditions (e.g. “spot diameter less then 6 pixel”, “spot position overlap

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

23

another spot” or “spot diameter is outside the planned options limits”). In the last version this segmentation method has been replaced by irregular spot segmentation. Local and global methods are used to compute background and foreground intensities for each spot. Spot circularity is also used as measure of quality. 4.3.

Spot

Spot uses linear combination weighted by the median values to obtain a single image, I = G’ + (mG/mR)*R’ where G’ and R’ are the images obtained from G and R using squared root transformation and mG and mR are the corresponding median values. Spot addressing is based on a batch processing over a collection of microarray images with the same geometric structure. Successive steps are automatic and produce two estimated grids: 

fitted foreground grids: horizontal and vertical lines passing for the centres of the estimated spots;



fitted background grids: horizontal and vertical lines passing through the gaps of the estimated centres between spots.

In the segmentation phase a seeded region growing algorithm is used. The seeds are chosen according to the estimated grids. Background, foreground intensity and quality measures are computed similarly to GenePix. 4.4.

Angulo and Serrà

Angulo and Serra combine input images using linear combination weighted by median. The overall pipeline combines addressing and gridding techniques making use of morphological operators together with classical segmentation algorithms; the overall performances are evaluated in terms of segmentation accuracy without providing quality measures.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

4.5.

24

Matarray

Matarray uses a combination of intensity and spatial information for spot detection and signal segmentation. The anchor point and grid dimension are specified by the user. Starting from a first draft identification of the spot centres, the overall area is splitted in patches defining a circular mask for each patch used for spot segmentation. An iterative process is then achieved calculating signal intensity and local background to improve the detection. The combined quality index is used for quality assessment. 4.6.

MicroArray Genome Imaging and Clustering Tool (MAGIC)

MicroArray Genome Imaging and Clustering Tool (MAGIC) analyzes all types of gene expression data on all major operating systems. Visualization is performed by linear combination weighted. Gridding does not require grid and feature dimensions or spacing. Segmentation is performed with one of three algorithms: fixed circle, adaptive circle or seeded region growing. The fixed circle is centred in the grid square, with a user specified radius. The adaptive circle algorithm examines the signal in each grid square to determine the most appropriate centre and radius (within a user-specified range) for each circle. Finally, the adaptive circle’s centres are set to be those containing the largest number of ‘on’ pixels. A seeded region growing algorithm connects each pixel to a background or foreground region, continuing until all pixels are assigned. A user-specified threshold and geometric considerations determine which pixels are used to ‘seed’ the regions. The user can choose if consider background in computation of green/red ratio signal. Each spot can be ignored using manual flag selection. MAGIC creates an “Expression file” containing foreground and background spots intensity (for each not flagged spot) for each channel and channels ratio intensity.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

25

III. PROBLEM DEFINITION Microarray images are difficult to handle in general, due to several factors such as the variability from experiment to experiment, noise, and image artefacts [7]. Segmentation of spots in microarray images can further be complicated with non-uniform shape and surface intensity distribution. In addition, the effect of background noise and reflection should also be taken into account. The aim of the project is to classify pixels in terms of foreground, signal of interest, and background, taking the background effects into account.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

26

IV. EXPERIMENTAL DETAILS The project aims to perform an analysis of a microarray image with addressing and segmentation leading to intensity extraction of each spot and to perform a comparison among the various techniques available for segmentation. Gene expression information acquired in the form of an image has to be extracted from the image using the techniques of image processing and analysis. Image processing of microarrays consists of the following steps: 1. Gridding: identify spots (automatic, semiautomatic, manual) 2. Segmentation: separate spots from background. Fixed circle, Adaptive circle, Adaptive shape, Histogram 3. Intensity extraction: mean or median of pixels in spot 4. Background correction: local or global In this data acquisition and processing pipeline, segmentation of cDNA spots is the most challenging task. Segmentation is a process that divides an image into mutually exclusive regions. Each region is homogeneous with respect to a region property such as gray-level intensity.

Image

Gridding

Segmentation

Background correction

Intensity Extraction

Output Data

Figure 3: Block Diagram for Microarray Image Analysis

The entire system is being implemented on the MATLAB 7.0.1 platform using the in-built image processing toolbox.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

27

The algorithm followed for processing the microarray image and comparing various segmentation techniques, has the following steps:

1. PREPROCESSING To avoid altering the original intensities of the microarray image, a binary reference image is generated from original image using the preprocessing step. This reference image can be used for calculating the parameters necessary for gridding. Preprocessing starts from image enhancement. The RGB image is first read from the file and then converted to greyscale using the function rgb2gray. rgb2gray converts RGB images to greyscale by eliminating the hue and saturation information while retaining the luminance. I = rgb2gray(RGB) converts the true colour image RGB to the greyscale intensity image I. rgb2gray converts the RGB values to NTSC coordinates, sets the hue and saturation components to zero, and then converts back to RGB colour space [5]. The intensity image as obtained above is then converted to double precision format using the function im2double. im2double takes an image as input, and returns an image of class double. If the input image is of class double, the output image is identical to it. If the input image is not double, im2double returns the equivalent image of class double, rescaling or offsetting the data as necessary. I2 = im2double(I) converts the intensity image I to double precision, rescaling the data if necessary [5]. Next, the histogram of the image obtained above is equalized using the function adapthisteq. J = adapthisteq(I) enhances the contrast of the intensity image by transforming the values using contrast-limited adaptive histogram equalization (CLAHE). CLAHE operates on small regions in the image, called tiles, rather than the entire image. Each tile's contrast is enhanced, so that the histogram of the output region approximately

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

28

matches the histogram specified by the 'Distribution' parameter. [In our program, the 'Distribution' parameter is not specified, and so it defaults to 'uniform', which represents a flat histogram.] The neighbouring tiles are then combined using bilinear interpolation to eliminate artificially induced boundaries. The contrast, especially in homogeneous areas, can be limited to avoid amplifying any noise that might be present in the image [5]. After histogram equalization, the image intensity values are adjusted using imadjust. J = imadjust(I) maps the values in intensity image I to new values in J such that 1% of data is saturated at low and high intensities of I. This increases the contrast of the output image J [5].

2. EDGE DETECTION The edges in the image are found in order to identify the locations of the spots. The function edge takes an intensity image I as its input, and returns a binary image BW of the same size as I, with 1's where the function finds edges in I and 0's elsewhere. edge supports six different edge-finding methods. The Sobel, Prewitt and Robert method finds edges using the corresponding approximation to the derivative. It returns edges at those points where the gradient of I is maximum. The Laplacian of Gaussian method finds edges by looking for zero crossings after filtering I with a Laplacian of Gaussian filter. The zero-cross method finds edges by looking for zero crossings after filtering I with a filter you specify [5].

Ideally, an edge point should be detected precisely in the sense that a true edge point in an image should not be missed, while a false, non-existent edge point should not be erroneously detected. These conditions conflict each other and depends on the threshold. If

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

29

the selected threshold is large, then there is a possibility that true edge points may be undetected, while if threshold is low, many noisy points may be falsely detected as edge points. The goal of an ideal edge detector is choice of right threshold. Since gradient operation enhances noise, the best way is to use a high spatial frequencies filter containing noise prior to applying the gradient operators. Canny proposed that an optimal detector can be approximated by the first derivative of a Gaussian [8]. Canny’s Edge Detection: Canny ensures good noise immunity and at the same time detects true edge points with minimum error. Canny used three criteria to design his edge detector. 

Reliable detection of edges with low probability of missing true edges, and a low probability of detecting false edges.



The detected edges should be close to true location of the edge



There should be only one response to a single edge.

He optimised the edge detection process by 

Maximizing the S/N ratio of gradient



An edge localization factor, which ensures that detected edge is localized as accurately as possible.



Minimizing multiple responses to a single edge.

The S/N ratio of gradient is maximized when true edges are detected and false edges are avoided. Thus when there are multiple number of responses to a single edge, by discarding false response, the noise corrupted edges may be removed. But increasing S/N ratio by a factor decreases localization by same factor. This suggests maximizing the product of the two. The optimum filter that is derived from these requirements can be approximated with first derivative of Gaussian filter. The choice of σ depends on size or scale of objects contained in the image. In this method [8],

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®



30

First, the image is convolved with Gaussian smoothing filter of standard deviation σ.



This operation is followed by gradient operation



Non-maxima suppression: The Canny edge detector produces thick edges wider than a pixel. The operation of non maxima suppression thins down the broad ridges of gradient magnitude. There are several techniques for such a thinning operation. In one technique, the edge magnitudes of two neighbouring edge pixels perpendicular to the edge direction are considered and the one with the lesser edge magnitude is discarded.



Double thresholding: Here two thresholds T1 and T2 to create two different edge images E1 and E2, where T2 ≈ 1.5T1. E1 will contain some false edge points, whereas E2 will contain very few false edge points and miss a few true edge points. First it links all edge points in E2 forming a contour. At the boundaries of the contour, the algorithm searches for the next edge points from the edge image E1 in its 8-neighbourhood. The gaps between two edge contours may be filled by taking edge points from E1 till gap has been completely filled up. This process yields complete contour constituted by true edges of the image. Threshold selection is the most important of all.

In MATLAB®, the Canny method finds edges by looking for local maxima of the gradient of I. The gradient is calculated using the derivative of a Gaussian filter. The method uses two thresholds, to detect strong and weak edges, and includes the weak edges in the output only if they are connected to strong edges. This method is therefore less likely than the others to be fooled by noise, and more likely to detect true weak edges.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

31

After edge detection, the image regions defined by the edges obtained above are filled using the imfill function. BW2 = imfill(BW,'holes') fills holes in the binary image BW. A hole is a set of background pixels that cannot be reached by filling in the background from the edge of the image [5]. The noise that has been emphasized by the intensity adjustment step needs to be removed. Mathematical morphology has been widely used in image processing especially when dealing with the geometry shape change. Here morphological opening (succession of erosion then dilation) can be used to eliminate the noise from the local region. The function bwmorph is used to perform erosion on the filled binary image. Erosion “shrinks” or “thins” objects in a binary image. The manner and extent of the shrinking process is controlled by a structuring element. Erosion can be seen as a process of translating the structuring element throughout the domain of the image and checking to see where it fits entirely within the foreground of the image. The output image has a value of 1 at each location of the origin of the structuring element, such that the element overlaps only 1valued pixels of the input image (i.e., it does not overlap any of the image background). A statement such as BW2 = bwmorph(BW,operation) applies a specific morphological operation to the binary image BW. Here the operation is 'erode', which performs erosion using the structuring element ones(3) [5].

3. GRIDDING The next step is to detect an optimum sub-image for calculating the intensity projection profile. This is done by applying block processing in the binary reference image. The block size is selected depending on the size of input image. A sliding window of specified block size is defined, which moves row-wise and column-wise .The mean intensity value of pixels

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

32

inside the window is calculated at each position. The block with maximum mean intensity is considered as optimum sub-image. Intensity projection profile of the sub-image is calculated by summing up the intensity of each pixel along the x and y axis. Horizontal sum profile Srow is obtained by adding the pixel intensities of every row. Similarly vertical sum profile Scol was obtained by adding pixel intensities of every column for the sub-image I of size (M × N). By applying thersholding to this sub-image, the sum profile is obtained. The width of the bars in this profile indicates the diameter of the spots and spacing between bars indicates the distance between spots. The median value of all the spots diameters and inter spot distances are considered as the estimated spot diameter and the distances between spots respectively .These parameters are essential for precisely gridding the image.

4. SEGMENTATION Segmentation subdivides an image into its constituent regions or objects. The level to which the subdivision is carried depends on the problem being solved. That is, segmentation should stop when the objects of interest in an application have been isolated. In microarray image processing, segmentation refers to the classification of pixels as either the signal or the surrounding area, i.e. foreground or background [9]. Segmentation algorithms for monochrome images generally are based on one of two basic properties of image intensity values: discontinuity and similarity [8]. In the first category, the approach is to partition an image based on abrupt changes in intensity, such as edges in an image. The principal approaches in the second category are based on partitioning an image into regions that are similar according to a set of predefined criteria. Here, segmentation is done using three methods: fixed circle, adaptive circle and region growing,

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

33

which is an adaptive shape method. Before performing segmentation, the red and green channel intensity images are extracted from the original RGB image of the microarray and subsequent processing is done on each of these images separately. For the fixed circle method, hexagons (which form the closest approximation to a circle) are drawn in each grid, the length of the side being determined by the grid size. Since, the grid size is the same for a given image, the size of the hexagons (circles) are the same for all the spots in a particular image. These small hexagons are then filled with the function imfill, which has been explained earlier. The filled regions are then considered as the foreground while the rest of the area in the grid is considered as background. The mean or median intensities can be calculated as the case may be using these foreground and background regions. For the adaptive circle method, the image is thresholded using the function graythresh, and then each region is labelled using bwlabel. The regions that are above the threshold determined by graythresh are approximately circular and they are considered as the foreground while regions below the threshold are considered as the background. As its name implies, region growing is a procedure that groups pixels or sub-regions into larger regions based on predefined criteria for growth [8]. The basic approach is to start with a set of “seed” points and from these grow regions by appending to each seed those neighbouring pixels that have predefined properties similar to the seed (such as specific ranges of gray level or colour). Here, a function regiongrow [8] is used to do basic region growing. The syntax for this function is [g, NR, SI, TI] = regiongrow(f, S, T), where f is an image to be segmented and parameter S can be an array (the same size as f) or a scalar. If S is an array, it must contain 1s at all the coordinates where seed points are located and 0s elsewhere. Such an array can be determined by inspection, or by an external seed-finding

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

34

function. If S is a scalar, it defines an intensity value such that all the points in f with that value become seed points. Here we have specified S as a scalar representing the maximum intensity in a particular grid. Similarly, T can be an array (the same size as f) or a scalar. If T is an array, it contains a threshold value for each location in f. If T is a scalar, it defines a global threshold. The threshold value(s) is (are) used to test if a pixel in the image is sufficiently similar to the seed or seeds to which it is 8-connected. Here, we have given the threshold T as a scalar equal to twice the standard deviation of the intensity in each grid. In the output, g is the segmented image, with the members of each region being labelled with an integer value. Parameter NR is the number of different regions. Parameter SI is an image containing the seed points, and parameter TI is an image containing the pixels that passed the threshold test before they were processed for connectivity. Both SI and TI are of the same size as f. The function bwmorph is used to reduce to 1 the number of connected seed points in each region in S (when S is an array) and function imreconstruct to find pixels connected to each seed [8].

5. BACKGROUND CORRECTION It is a common practice to approximate the local background of a spot using inter-spot background calculated from pixels within the proximity of the target spot (excluding other nearby spots). The most common way to get such an estimate is to calculate the sample mean or median of pixels identified as background pixels in a region near each spot [9]. ScanAlyze, ImaGen, and GenePix are examples of software packages that implement this type of method. Other ways of estimating local background include those using histogram information and those using rank filters, which do not depend on the segmentation and the exact positioning of spots.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

35

Figure 4: Background region definition used in ScanAlyze (left, the square minus the gray disk), ImaGene and QuantArray (middle, the larger disk minus the center disk) and GenePix (right, the small rectangle regions at the four corners).

Here we have used a new technique, ELB-Q (Estimated Local Background Quantification), to perform microarray local background estimation, segmentation, and quantification separately and not as a continuation of the above steps. It is said that ELB-Q reduces the uncertainties in background and foreground pixel classification and, thus, increases the robustness and accuracy of quantification of spot intensities [9]. In ELB-Q, pixels in a large “extended local background” (ELB) region, which is similar, in spirit, to those used in ScanAlyze, are used to obtain the robust estimates of the local background. In segmentation, based on the estimates of the mean and the variance from the local background calculation, background pixels inside the putative target spot regions are further identified and excluded. Finally, quantification of the spot intensities is performed based on the classification of the background/foreground pixels in the target spot area. The new method takes advantage of the abundant spatial information around each spot of interest, makes no assumption of the shape and size of the spots, and needs no sophisticated adjustment.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

36

Figure 5: A sample configuration of the extended local background (the shaded area within the square region) Note that the ELB background estimate obtained using such a region is for the centre spot only.

Figure 6: Sample ELB configurations (a) Square (5×5, spot-wise) (b) Circle (radius is 55, pixel-wise). (c) Rectangle (sides are 104 and 72, pixel-wise) (d) Ellipse (long and short radii are 55 and 36, pixel-wise)

Figure 7: Flowchart of quantification procedure in ELB-Q.

The algorithm is as explained below [9]. Step 1) Choose an ELB configuration: Based on the layout of the sub-arrays and density of grids, a suitable ELB configuration must be chosen. Here we have implemented a 3×3 spot-wise and square shape.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

37

Step 2) Calculate the local background cutoff value: ELB-Q uses the spot gridding template constructed from the gridding stage to estimate the global background. All pixels in the background region of the whole chip are collected and sorted according to their intensity values. The top 20% pixels are excluded for the purpose of estimating the global background, which helps disregard the noises such as chemical contaminations in the microarray slide. We calculate the mean and standard deviation, ¯vglobal and σglobal, using the remaining pixels whose intensities follow roughly a shifted Gaussian distribution. The local background cutoff threshold, vbg cutoff

is determined as

vbg cutoff = ¯vglobal + c1σglobal (5) in which c1 could assume a value of either 1, 2, or 3 to yield confidence intervals of 91%, 95%, or 99%, respectively (the default value is set to c1 = 1), so that all “true” background pixels are included in the local background calculation in the ELB region in the next step. Step 3) Calculate the local background value and local foreground cutoff: In the ELB region, i.e., the large area template from the chosen ELB configuration centered at the target spot, the mean value and standard deviation of all pixels with intensity lower than vbg

cutoff

are calculated as ¯vlocal

background

and σlocal

background,

respectively, and the former is used as the local background intensity, i.e., vELB ≡ ¯vlocal background. Other pixels are considered as “noise of the background” that might be due to contaminations within the ELB regions, and, thus, are excluded for the purpose of estimating the local background. The local foreground cutoff threshold vforeground cutoff is then established as

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

38

vforeground cutoff = vELB + c2σlocal background (6) in which c2 could assume a value of 1, 2, or 3 to yield confidence intervals of 91%, 95%, or 99%, respectively, such that all “true” background pixels are excluded from the foreground estimates assuming shifted Gaussian distribution of the local background pixels (the default value is set to c2 = 2). Step 4) Calculate the foreground intensities: In the spot region, pixels with intensities less than or equal to vforeground

cutoff

are classified as background pixels.

Pixels with intensities greater than vforeground cutoff are classified as “true” foreground pixels from which we compute their median or mean value vforeground. Step 5) Quantify the true signals: Foreground intensities calculated from Step 4 contain contributions from both “true” signals and background noise. Assuming the interspot background intensity to be equal to the background intensity of the actual spot, the true signal intensity is revealed through background correction by the following equation: vtruesignal = vforeground − vELB. In our implementation of this method, only originally bright spots could be included in the calculation, as we had used the graythresh and bwlabel functions to convert the microarray image into a labelled binary image. We could not do anything about the dark regions as no gridding as such was performed before implementation of the ELB-Q algorithm.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

39

6. INTENSITY EXTRACTION Segmentation of the microarray image was performed using the fixed circle, adaptive circle and the adaptive shape (seeded region growing) methods on both the red and green channel images. The mean intensity in the foreground (spot) region is calculated for the red and green intensity images and the logarithm of the intensity ratio R/G was calculated as log2(R/G) for each spot in the microarray image. The values are calculated for each segmentation method and hence, the performance of these methods can be evaluated.

7. GRAPHICAL USER INTERFACE GUIDE, the MATLAB® graphical user interface development environment, provides a set of tools for creating graphical user interfaces (GUIs). These tools greatly simplify the process of designing and building GUIs. You can use the GUIDE tools to [5] 

Layout the GUI. Using the GUIDE Layout Editor, you can lay out a GUI easily by clicking and dragging GUI components -- such as panels, buttons, text fields, sliders, menus, and so on -- into the layout area. GUIDE stores the GUI layout in a FIG-file.



Program the GUI. GUIDE automatically generates an M-file that controls how the GUI operates. The M-file initializes the GUI and contains a framework for the most commonly used callbacks for each component -- the commands that execute when a user clicks a GUI component. Using the M-file editor, you can add code to the callbacks to perform the functions you want.

Here we have developed a GUI using the GUIDE tool. It enables the user to select and open a file with the desired microarray image. After loading the input image, the user can select the segmentation method to be used in processing the image. On selecting the segmentation

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

40

method and clicking the ‘Start’ button, the processing starts and the program is executed on the input file. The Status Bar shows the current state of progress of the execution of the program. The output of the program are sets of values that correspond to the logarithmic ratios of intensities of red and green as explained above, calculated for the segmentation method selected, and these values are exported to a worksheet in the .xls format that is compatible with popular database software like Microsoft Excel.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

41

V. RESULTS AND DISCUSSION The microarray image was processed and gridded by calculating a standard row width and column width for the entire image from the intensity projection profile of the optimum subimage of the original image. This gridded image was then segmented using the commonly used methods of fixed circle segmentation, adaptive circle segmentation and adaptive shape (region growing) method. The region growing was performed with the maximum intensity point selected as the seed in each grid. The threshold was selected as twice the standard deviation of the intensity variation in a particular grid. In a separate program, the ELB-Q method of local background estimation, segmentation and quantification was implemented. A Graphical User Interface was created in MATLAB® version 7.1, which allows the user to select the desired input microarray image. After loading the desired microarray image, the user is allowed to select the segmentation method to be used on the image through the GUI. On clicking the ‘Start’ button in the GUI processing is carried out on the loaded image with the segmentation method selected by the user. The logarithmic ratios of red and green intensities for each spot in the image, calculated for the particular segmentation method are exported to a worksheet in the .xls format. The GUI in various stages of interaction and a worksheet showing the values obtained with each segmentation method side-by-side are shown Figures 8 and 9. On comparing the values obtained with those available in the Stanford Microarray Database [6], it can be seen that the seeded region growing method of segmentation generates closer values to those in the database. For example, for the first spot, the value of log2(R/G) from the database [6] is –0.481, whereas the corresponding values calculated in our project are – 0.418 for seeded region growing, –0.378 for fixed circle and –0.402 for adaptive circle. Thus it can be inferred that, of the three segmentation methods compared here, the seeded

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

42

region growing method of segmentation exhibits better performance in segmentation of microarray images.

(a)

(b)

(c) Figure 8: (a) Initial screen of Graphical User Interface (b) GUI after loading the microarray image (c) GUI during processing of the image

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

Seed Region Growing -0.418110718 0.080767563 -0.623502431 -0.215408451 -0.124840302 -0.43762187 -0.023852429 -0.282173776 -0.256021397 0.334264876 -0.040980547 -0.188732478 -0.555152799 -0.600108371 0.479970226 0.394619541 -0.018001354 0.190293768 -0.005255725 -0.345496566 -0.184178985 -0.745816512 0.121990524 -0.368644594 -0.358047521 -0.742842161 -0.905784658 0.394278939 -0.347923303 -0.033245934 0.29071952 0.070135801 0.683526335 0.377983067 0.166795995 0.240051088 0.543270162 -0.391804422 -0.668187642

43

Fixed Circle Adaptive Circle -0.377601693 -0.401683464 0.136240139 0.082155374 -0.499117701 -0.561527418 -0.179447776 -0.27408562 -0.016678741 -0.020015069 -0.365106788 -0.363794028 -0.019255338 -0.011557659 -0.431021139 -0.277650583 -0.128039191 -0.090403494 0.322903658 0.342176784 -0.038081744 -0.038419707 -0.122593842 -0.16899644 -0.506815025 -0.565695523 -0.53738916 -0.55622101 0.521410869 0.472345699 0.460239323 0.394902792 -0.017943137 -0.065177172 0.170997557 0.157746006 0.089467977 -0.039266241 -0.313274783 -1.184424571 -0.074231394 0.021072224 -0.314314911 -1.907011571 0.018378529 -0.777172652 -0.179323699 -0.726960594 -0.283973744 -0.257395133 -0.401448435 -1.011353477 -0.22521294 -2.457002936 -0.430634354 -0.053142768 -0.060024771 -0.163249374 -0.02777838 -0.074246042 0.343045044 0.300701099 0.023690106 -0.032514396 0.360191111 0.468926961 0.374053401 0.403426135 -0.199425735 -0.007177332 0.067476455 0.20398865 0.596308767 0.589083668 -0.376611018 -0.418536419 -0.630182143 -0.684458848

Figure 9: Spreadsheet showing log2(R/G) values for 40 spots in the image segmented using the seed-region growing, fixed circle and adaptive circle methods

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

44

VI. CONCLUSION AND FUTURE SCOPE The applications and development of microarray technology have been growing exponentially in the past few years, since their discovery in 1994. There are numerous applications of this technology, including clinical diagnosis and treatment, drug design and discovery, tumor detection, and in the environmental health research. One of the key issues in the experimental approaches that utilize microarrays is to extract quantitative information from the spots, which represent genes in a given experiment. For this process, the initial stages are quite important and influential in future steps of the analysis. Thus, identifying the spots and separating background from the foreground is a fundamental problem in DNA microarray data analysis and any improvement in the same from the present methods used is always welcome [7]. As we have mentioned that microarray technology is still growing rapidly, there are no established standards for microarray experiments and how the raw data should be processed. There are quite a few problems that remain open and deserve investigation. In the microarray image segmentation, it is not enough to just group the pixels into foreground and background. In some cases, the noisy pixels also need to be identified. In the process of background correction, an adaptive formula could be used to reveal the true foreground intensity instead of using a rigid formula like It = If - Ib, where the true intensity It is obtained by subtracting the background intensity Ib from the foreground intensity If. Neural networks could also be applied to obtain the true foreground intensity [7]. When conducting the gene expression data analysis, although hierarchical clustering is a widely used method, it suffers from drawbacks such as dealing with noise and providing a non-unique solution. These two problems are currently being investigated [7]. The values obtained with the three segmentation methods can also be used for a comparative study with those generated by existing commercial software using these methods.

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

45

VII. REFERENCES 1. “Microarrays: chipping away at the mysteries of science and medicine”, A Science Primer

–

National

Centre

for

Biotechnology

Information,

http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html, accessed April 2009 2. David Bowtell and Joseph Sambrook (edited by), “DNA Microarrays – A Molecular Cloning Manual”, Cold Spring Harbor Laboratory Press, 2003 3. Yee Hwa Yang, Michael J Buckley and Terrence P Speed, “Analysis of cDNA Microarray images”, Briefings in Bioinformatics, Vol. 2, No. 4, 341-349, December 2001 4. S. Battiato, G. Di Blasi, G. M. Farinella, G. Gallo, G. C. Guarnera, “Ad-Hoc Segmentation Pipeline for Microarray Image Analysis”, IPLab – Image Processing Laboratory

http://www.dmi.unict.it/~iplab,

Dipartimento

di

Matematica

e

Informatica, University of Catania, Via Andrea Doria, Catania (Italy) 5. MATLAB 2007b® Image Processing Toolbox, MATLAB® Help 6. Stanford Microarray Database, http://smd.stanford.edu/, accessed February 2009 7. Li Qin, Luis Rueda, Adnan Ali and Alioune Ngom, “Spot Detection and Image Segmentation in DNA Microarray Data”, School of Computer Science, Department of Biological Sciences, University of Windsor 401 Sunset Ave., Windsor, Canada 8. Rafael C Gonzalez, Richard E Woods, Steven L Eddins, “Digital Image Processing Using MATLAB®”, 1st ed., Pearson Education Inc., 2004 9. Marc Q. Ma, Kai Zhang, Hui-Yun Wang, and Frank Y Shih, “ELB-Q: A New Method for Improving the Robustness in DNA Microarray Image Quantification”, IEEE Transactions on Information Technology in Biomedicine, Vol. 11, No. 5, September 2007

College of Engineering, Chengannur

Segmentation of Microarray Images Using MATLAB®

College of Engineering, Chengannur

46

Geometry Motivated Variational Segmentation for Color Images