B. Sc. Engg. Thesis Applications of Graphs in Bioinformatics By Abdullah Al Mueen Student No. 0005040 & Md. Nurul Amin Student No. 0005091

Submitted to Department of Computer Science and Engineering in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering

Department of Computer Science and Engineering Bangladesh University of Engineering and Technology (BUET) Dhaka 1000, Bangladesh November, 2006

CERTIFICATE

This is to certify that the work presented in this thesis entitled “Applications of Graphs in Bioinformatics” is the outcome of the investigation carried out by us under the supervision of Professor Dr. Md. Saidur Rahman in the Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka. It is also declared that neither this thesis nor any part thereof has been submitted or is being currently submitted anywhere else for the award of any degree or diploma.

(Author) Abdullah Al Mueen (Supervisor)

Student No: 0005040

Dr. Md. Saidur Rahman Professor Department of Computer Science and Engineering, BUET, Dhaka.

(Author) Md. Nurul Amin Student No: 0005091

Contents Acknowledgments

vi

Abstract

vii

1 Introduction

1

1.1

History of Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Computations in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Scope of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2 Preliminaries

6

2.1

Cell-The Unit of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2.1

Genome Organization in Eukaryotic Cell . . . . . . . . . . . . . . . . . .

7

2.2.2

Genome Organization in Prokaryotic Cell . . . . . . . . . . . . . . . . . .

7

2.3

Genome Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.4

Generation of Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.5

2.4.1

DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.2

Haplotype Reconstruction and Inference . . . . . . . . . . . . . . . . . . 10

Genomic Data Analysis and Applications . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1

Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2

Study of Single Nucleotide Polymorphism

2.5.3

SNPs and Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 13 i

. . . . . . . . . . . . . . . . . 13

CONTENTS

ii

2.5.4

SNPs and Personalized Drug Prediction . . . . . . . . . . . . . . . . . . 14

2.6

Graph Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Haplotyping

18

3.1

Haplotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2

Individual Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3

3.4

3.2.1

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2

Minimum Fragment Removal : MFR . . . . . . . . . . . . . . . . . . . . 22

3.2.3

Minimum SNP Removal : MSR . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.4

Minimum Error Correction : MEC . . . . . . . . . . . . . . . . . . . . . 25

3.2.5

A Heuristic Algorithm for MEC Problem . . . . . . . . . . . . . . . . . . 26

3.2.6

Comparison of Different Individual Haplotyping Problems . . . . . . . . 35

Population Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.2

Pure Parsimony Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.3

Maximum Resolution Problem – MR . . . . . . . . . . . . . . . . . . . . 39

3.3.4

Perfect Phylogeny Haplotyping Problem – PPH . . . . . . . . . . . . . . 40

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Phylogenetic Tree 4.1

4.2

4.3

44

Majority Rule Consensus Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.1

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.2

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.3

A Simulated Example

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Uniform Sampling of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.2

Sampling Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

CONTENTS

iii

5 Conclusion

55

References

57

Index

59

List of Figures 2.1

Organization of DNA molecule in a cell . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Directed and undirected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1

Minimum fragment removal using fragment conflict graph . . . . . . . . . . . . . 23

3.2

Minimum SNP removal using SNP conflict graph . . . . . . . . . . . . . . . . . 25

3.3

SNP matrix and its partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4

An example calculation of Gain measure . . . . . . . . . . . . . . . . . . . . . . 30

3.5

An example iteration of HMEC . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6

An example log table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7

An example of haplotype perfect phylogeny (hpp) . . . . . . . . . . . . . . . . . 42

3.8

Forbidden matrix for hpp

4.1

A sample phylogenetic tree of living organisms . . . . . . . . . . . . . . . . . . . 45

4.2

An example of majority rule consensus tree for l =

4.3

Steps of majority rule tree construction . . . . . . . . . . . . . . . . . . . . . . . 50

4.4

An induced subtree of a phylogenetic tree

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

iv

1 2

. . . . . . . . . . . . . . . 47

. . . . . . . . . . . . . . . . . . . . . 52

List of Tables 3.1

A chromosome and two haplotypes assembled from it . . . . . . . . . . . . . . . 19

3.2

An SNP matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3

Different relations between fragments . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4

The pseudocode for the HMEC algorithm . . . . . . . . . . . . . . . . . . . . . . 32

3.5

Comparison among the SNP problems . . . . . . . . . . . . . . . . . . . . . . . 36

3.6

A chromosome and its genotype . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7

Example of optimal application of haplotype inference rule . . . . . . . . . . . . 40

v

Acknowledgments All the appraisal belongs to the Almighty ALLAH. We would like to express our deep gratitude to our supervisor Professor Dr. Md. Saidur Rahman. We would like to thank him for introducing us in this beautiful field of bioinformatics and for teaching us how to carry on research work. We again express our most sincere gratitude to him for his affectionate supervision, continuous motivation and valuable advice, without which this thesis would not have come to a completion. We would like to thank every one associated with our research group. We specially thank Mr. Abdullah Adnan, Mr. Abul Hasan Samee and Mr. Md. Shariful Islam Bhuyan for their invaluable comments and encouragement throughout the period of this thesis. Each presentation session was greatly supported by the lab assistants and officials of the department. We would like to thank them for each of the services and facilities they provided us. Finally, we are grateful to our families for giving us the moral support to overcome the tedium of repetitive trials to new findings.

vi

Abstract Bioinformatics principally is the science of biological information processing. It requires wide range of computational models for effective representation and efficient computation of biological data. This thesis describes graphs as computational models in bioinformatics. Here we focus on two major areas of bioinformatics: Haplotyping and Phylogeny. We have compiled four optimization problems of individual haplotyping and their graphical modelings. We propose a new heuristic algorithm to solve minimum error correction problem which constructs a pair of haplotypes of an individual from aligned and overlapping but intermixed and erroneous fragments. We describe the perfect phylogeny haplotyping problem which arranges a set of haplotypes for a population in a phylogenetic tree. We describe an algorithm to build a single consensus tree from a number of different small phylogenetic trees on the same set of taxa using majority rule. We discuss the uniform sampling of a phylogenetic tree that generates an induced subtree uniformly at random and carrying a subset of leaves of the actual tree satisfying some constraints. We annotated three problems of uniform sampling having constraints on leaf depths, edge lengths and pairwise leaf distances of the resulting samples.

vii

Chapter 1 Introduction Analysis of biological experiments and processes is labor intensive and time consuming due to the increasing complexity of the processes and explosive growth of biological data emerging from laboratories worldwide. Hence, the recent challenge is to transform this huge data into knowledge for complete understanding of the biological processes and experiments relating to both health and diseases. The quest for this knowledge has given rise to a new era of science, bioinformatics. Bioinformatics is a highly interdisciplinary field where techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics merge into a single branch of science. The most prominent and important of these disciplines is information processing. Therefore, “bioinformatics” can be stretched as “Biological information processing”. More precisely, bioinformatics derives knowledge from computer analysis of biological data. The development of powerful computers, and the availability of experimental data, that can readily be treated by computation, launched bioinformatics as an independent field. From the perspective of information processing, bioinformatics can be divided into several areas. The major areas are : • Develop better tools for data generation, capture, and annotation. An example of data generation would be “shotgun sequencing” which identifies the sequence of nucleotides of a 1

Chapter 1. Introduction

2

target DNA molecule. • Develop and improve tools for comprehensive functional and relational analysis. Numerous tools for phylogenetic analysis of biological data have already been developed to determine the evolutionary relationship among species. • Develop and improve tools for representing and comparing sequence similarity and variation. An example would be BLAST (Basic Local Alignment Search Tool) which aligns two or more sequence and computes different similarity and dissimilarity measures. • Improve content and utility of biological databases to store the data in effective ways. An example would be GenBank. GenBank, developed and housed at NCBI (National Center for Biotechnology information), is the U.S depository for all DNA and protein sequences containing more than 61 million sequence records. • Create mechanisms to support effective approaches for producing robust, exportable software that can be widely shared. Each of these areas are evolving day by day and are subjects to extensive research. The next section of this chapter contains a snap of the history of bioinformatics. The section following it contains an overview of the computations required in bioinformatics. Finally the scope of this thesis is detailed.

1.1

History of Bioinformatics

The generation of micro-biological data was started first in 1955 when amino acid sequence of a protein (bovine insulin) was announced for the first time by F. Sanger. Since then, hundreds of proteins have been sequenced and analyzed and the need for creating a data bank for these proteins so that the information can be spread very easily to all. In 1973 one such Protein Data Bank (PDB) named Brookhaven was announced. This PDB was fully described in 1977 (http://www.pdb.bnl.gov). These PDBs were actually in printed form because of the unavailability of desktop computers and communication facilities at those days. The first complete genome sequence for an organism (FX174) was published in 1980. The

1.2. Computations in Bioinformatics

3

gene consists of 5,386 nucleotide base pairs encoding nine proteins. Just after one year, in 1981, IBM introduced its personal computer in the market and concurrently the smith-waterman algorithm for sequence alignment was published. The growth of computer performance and biological data is exponential since then and leads to the creation of some new databases like SWISS-PR0T in the ’80s. Many organizations were also founded in that decade namely Genetics Computer Group(GCG), National Center for Biotechnology Information(NCBI), etc. Whole genome sequences are published for different single cell organisms like Haemophilus influenza of 1.6 Mbp (million base pair), baker’s yeast of 12.1 Mbp, E. coli of 4.7 Mbp in ’90s. The Human Genome Project has successfully published the complete sequence of human genome (3000 Mbp) in 2001. Biochemical methods to sequence, recombine and engineer the DNA sequences have been developed by the 90s. Besides, new modeling and analysis provides new dimensions in the datasets. Thus, the growth of the datasets obviates the necessity of various forms of computation in bioinformatics from the beginning of this decade.

1.2

Computations in Bioinformatics

Computations required in bioinformatics varies largely depending on the objective of the application. Almost all the developed branches of computer science have strong applications in bioinformatics. Some examples of the computational tools that are used in bioinformatics are given below. Graphs are used to represent relationships among species on different physical and microbiological criteria. For example, the evolutionary relationships among the existing species are expressed in a tree structure called phylogenetic tree. Graphs are also used in problems to analyze biological data. Numerical simulations of biological systems are used to model systems that are very difficult to be modeled by analytical methods and deterministic operations. For example, the genetic regulatory networks can be modeled by stochastic process. Similarly, host-parasite system,

Chapter 1. Introduction

4

ecosystem etc are well studied through numerical simulation. Machine learning has many applications in bioinformatics. Generally biological data are huge in quantity but with no established theory. For such data, learning theories provide methods to gain an insight into the underlying theory of the origin of these data. Besides, statistical analysis can be used in population oriented biology like epidemic controlling, drug designing, etc. Data mining and advanced database technology are one of the main part of biological information analysis and preservation. The huge amount of data requires efficient processing to maximize their use in research and educational purpose.

1.3

Scope of This Thesis

There are varieties of applications of graph theory in modeling, representing, analyzing and comparing biological data. Our study covers two major areas concerning the data generation through biological experiments and analyzing represented data. The first area we covered is computational problems of constructing haplotypes. Haplotypes are sequence of nucleotides that are positioned into some fixed sites in the DNA molecule called the SNP sites. Experimental readout can not completely determine haplotypes due to the lack of precision and correctness in constructing haplotypes, that’s why methods are required to build the haplotypes in an optimum way so that the reliability of the haplotypes maximizes. In modeling such optimization problems graphs are extensively used [CIKT05, BVDL03, BILR05]. We studied several models and in this thesis, we have compiled the problems in a well sorted fashion. Besides, we provide an heuristic algorithm to one of the optimization problems of correcting minimum errors in a set of nucleotide sequences. The second area we covered is the analysis of relationships among species through phylogeny. Phylogeny is also a well developed branch of bioinformatics. Construction, assembling, visualizing and sampling of phylogenetic trees are major applications of graph in this branch [ACJ03, KMP03, MM81]. In this thesis, we discuss the majority rule method of assembling

1.4. Summary

5

phylogenetic tree from many small trees. We also discuss a way of arranging haplotypes in phylogenetic order which justifies the relation between the major two areas of our study.

1.4

Summary

In this chapter highlighted bioinformatics from the biological and computational viewpoints. We also placed a short history of bioinformatics to demonstrate the rapid growth of biological data throughout the world. The final section describes the scope of our thesis.

Chapter 2 Preliminaries Basic concepts and terminologies used in this thesis have come from two distinct and different fields of natural and applied science. Hence, it is worth reviewing and initiating these concepts in a brief manner. The first section is about the typical structure of a cell down to molecular level including major biological processes. Generation, assembly, analysis and applications of biological data are discussed in the later sections. The final section contains the definitions and terminologies of graph theory that are required in this thesis.

2.1

Cell-The Unit of Life

Cells are the structural and functional unit of all living organisms. Cells range in size from one millimeter down to one micrometer. Cells are highly organized and complicated assembly of large polymeric molecules with specific compartments called organelles. The most important organelle is the nucleus, which houses most of the cellular DNA-the hereditary material of living organisms. Cells are of two types - prokaryotic cells (e.g. bacteria, amoebae), which lack a defined nucleus and with a simplified internal organization, and eukaryotic cells (body cells of human, animals), which have a more complicated organization including a defined nucleus. In addition to the nucleus, there are several other organelles in typical eukaryotic cells: the mitochondria, where the cell’s energy metabolism is carried out; the rough and smooth 6

2.2. Genome Organization

7

endoplasmic reticula, a membranous network in which glycoproteins and lipids are synthesized; Golgi vesicles, which transfers membrane constituents to appropriate places in the cell; and peroxisome, in which fatty acids and amino acids undergo degradation. Animal cells, but not plant cells, contain lysosomes, which degrade unnecessary materials taken in by the cell. Plant cells have chloroplasts, where photosynthesis takes place.

2.2 2.2.1

Genome Organization Genome Organization in Eukaryotic Cell

DNA (deoxyribonucleic acid) is the master molecule of the cell that stores genetic information and transfers genetically determined characteristics from one generation to the next. The threedimensional structure of DNA, consists of two long helical strands coiled around a common axis forming a double helix. Each strand of DNA is composed of continually varying sequence of just four different types of monomers(Adenine, Cytosine,Guanine and Thymine) called nucleotides. Genes are simply portion of nucleotide sequence residing in the DNA molecule that codes polypeptides or proteins- the main constituents of cells. DNA also contains instructions in form of nucleotide sequence to direct when, which proteins are to be made and in what quantities. The DNA in the nucleus of a eukaryotic cell is organized among 1 to more than 50 long linear, compact structures. They are called chromosomes. All cells of an organism contain chromosomes of same size and number but they vary among different types of organisms. Each chromosome comprises of a single DNA molecule associated various DNA binding proteins. The total DNA content in the chromosomes of an organism is referred as its genome. The organization schematic is shown in Fig. 2.1 [www.mtsinai.on.ca/pdmg/images/chromosome.jpg]

2.2.2

Genome Organization in Prokaryotic Cell

In all prokaryotic cells, most of or all the genetic information resides in a single circular DNA molecule that folds back on itself many times. Usually it is one millimeter in length and stays

Chapter 2. Preliminaries

Figure 2.1: Organization of DNA molecule in a cell

8

2.3. Genome Expression

9

in the central region of the cell. The genomic DNA molecule in prokaryotes is also associated with proteins and often is referred to as a chromosome. But the organization of DNA within a bacterial chromosome differs greatly from that within the chromosomes of eukaryotic cells.

2.3

Genome Expression

DNA encodes all of the RNA(ribonucleic acid) and protein molecules of the cells of an organism. Proteins are the most abundant and functionally versatile of the cellular macromolecules. Proteins are polymers formed from only 20 different monomers, the amino acids. Many proteins within cells are enzymes, which accelerate (catalyze) biochemical reactions. Proteins also direct their own synthesis and that of other macromolecules, maintain internal cell rigidity, and transport small molecules and ions across membranes,. Protein synthesis from genes does not occur directly. RNA acts as an intermediary molecule. Firstly, a portion of DNA sequence of the large DNA molecule in a chromosome is copied into RNA. The process is called transcription. These RNA copies of segments of the DNA works as templates to direct the synthesis of the protein. This process is called translation as genetic information stored in the form of nucleotide sequence is decoded in the form of amino acid sequence in proteins. DNA can undergo replication (synthesis of new DNA) also. Therefore genetic information in cells flows from DNA to RNA to protein. This fundamental principle genome expression is termed as the central dogma of molecular biology. Genome expression is under fine regulation at various levels.

2.4

Generation of Genomic Data

Bioinformatics primarily deals with a huge range of genomic data gathered from different biological experiment. Computational biology studies the data and derives knowledge from it. Thus generation of genomic data is of prime importance. This section briefly discuss about few effective ways for generation of genomic data.

Chapter 2. Preliminaries

2.4.1

10

DNA Sequencing

DNA sequencing is the laboratory technique by which chemical code of the genome is deciphered. DNA sequencing determines the exact order of chemical building blocks, nucleotides (abbreviated A, T, C, and G) that make up the DNA. First, chromosomes are broken into much shorter pieces. After sequencing of the short sequences (in blocks of about 500 bases each, called the read length) they are assembled into long continuous stretches that are analyzed for errors, gene-coding regions, and other characteristics. The DNA sequence acts as a blueprint, determining what species of organism is produced, and within a species, the DNA sequence of each individual is unique. The DNA sequence that makes up a genome may include coding DNA (gene) and also non-coding DNA. Non-coding DNA does not encode a protein but may regulate where and when genes and proteins are active. Advances in molecular biology, genomics, and robotics have resulted in automatic sequencing of DNA. Today, the DNA sequence of an entire microbial genome can be determined in just a few weeks rather than several years as in the past. The resulting DNA sequence maps are being used by 21st century scientists to explore biology phenomena.

2.4.2

Haplotype Reconstruction and Inference

A haplotype, is simply the genetic constitution of an individual chromosome. It also refers to a set of single nucleotide polymorphisms (SNPs) found to be statistically associated on a single chromatid (one-half of a replicated chromosome). Such information is most valuable to investigate the genetics behind common diseases. Haplotypes may be used to compare different populations. Haplotype diversity refers to the uniqueness of a particular haplotype in a given population. Haplogroups are large groups of haplotypes that define genetic populations and are often geographically oriented.

2.5. Genomic Data Analysis and Applications

2.5

11

Genomic Data Analysis and Applications

Bioinformatics creates the tools to store, manage, analyze, compare genomic data.

Vast

amounts of sequence are now stored in organized computer databases. The genome sequence has been interpreted using computational tools combined with biological knowledge. Computer software associated with the database is being used for easier data retrieval and data analysis process. Sequence analysis tools can also translate the DNA sequence into protein sequence and can provide information on the predicted physical properties of the protein such as molecular weight. Sequence comparisons also can be used to categorize groups of related gene or sequences into families. Sequences in the same family suggest that the genes or proteins perform similar functions. Another use for sequence comparisons is studying the relatedness and evolution of different genes or organisms. Here are some research areas where important relationships and predictions are being generated by genomic data analysis with the help of bioinformatics tools: • Gene number, exact locations, and functions • Gene regulation • DNA sequence organization • Chromosomal structure and organization • Non-coding DNA types, amount, distribution, information content, and functions • Coordination of gene expression, protein synthesis, and post-translational events • Predicted vs experimentally determined gene function • Evolutionary conservation among organisms • Correlation of SNPs (single-base DNA variations among individuals) with health and disease • Disease-susceptibility prediction based on gene sequence variation • Genes involved in complex traits and multi-gene diseases Some uses of genomic data are discussed in this section which will be elaborated throughout this thesis.

Chapter 2. Preliminaries

2.5.1

12

Phylogeny

Phylogeny is the description of biological relationships based on classification according to similarity of one or more sets of characters or on a model of evolutionary processes. Phylogenetic relationship based different characters are consistent and support one another. Phylogeny is usually expressed as trees called phylogenetic trees. The goals of phylogeny is to work out the relationships among species, populations, individuals or genes(generally referred as taxa). Relationship is taken as assignment of a scheme of descendants of a common ancestor.The results are usually presented as an evolutionary tree. A phylogenetic tree is a two dimensional graph composed of nodes representing the taxa and branches representing the relationships among the taxa. A tree is a connected graph in which there is exactly one path (consecutive set of edges beginning at one point and ending at the other) between every two points. A particular node may be selected as a root. Abstract trees may be rooted or unrooted. Unrooted tree displays the topology of relationship. But it does not show the pattern of the descent. Rooted trees are directed graphs in which each edge is a one-way street and the ancestor-descendant relationship implies the direction of each edge. If every node of rooted trees has two descendants, they are called Binary trees. Numbers are often assigned to the edges of a graph to signify a distance between the nodes connected by the edges. Thus the sizes of the edges can be drawn proportional to the assigned lengths. The length of a path through the graph is the sum of the edge lengths. In phylogenetic trees edge length signify either some measure of the dissimilarity between two species or the period since their separation. There are two approaches for derivation of phylogenetic trees • Phenetic approach proceeds by measuring a set of distances between species to generate a tree by hierarchical clustering procedure. • Cladistic approach considers possible pathways of evolution, inferring the characteristics of the ancestor at each node. There are two types of data used for building phylogenetic trees: • Distance-based: A matrix of distances between the species is used as input(e.g., the

2.5. Genomic Data Analysis and Applications

13

alignment score between them or the fraction of residues they agree on). • Character-based: Each character (e.g., a base in a specific position in the DNA) is examined separately.

2.5.2

Study of Single Nucleotide Polymorphism

Single nucleotide polymorphisms or SNPs (pronounced ”snips”) are DNA sequence variations of single nucleotide (A,T,C,or G) in the genome sequence. For example, a change in the DNA sequence AATTAC to ATTTAC is a SNP. When a variation occurs in at least 1% of the population, it is considered as an SNP. SNPs are found in both coding (gene) and non-coding regions of the genome. Most SNPs occur outside of “coding sequences”. SNPs found within a coding sequence are of particular interest to researchers because they are more likely to alter the biological function of a protein. SNPs, make up about 90% of all human genome sequence variation. Two of every three SNPs are due to the replacement of cytosine (C) with thymine (T). These variations in DNA sequence are believed to be associated with humans respond to disease; environmental insults such as bacteria, viruses, chemicals; and therapies. Thus SNPs has become of great value for biomedical research and for developing pharmaceutical products or medical diagnostics. SNPs are easier to follow in population studies because they are evolutionarily stable - not changing much from generation to generation.

2.5.3

SNPs and Disease Diagnosis

Each person has a unique SNP pattern. In most cases, SNPs do not cause disease, they only serve as biological markers for pinpointing a disease on the human genome map, because they are usually located near a gene found to be associated with a certain disease. Thus SNPs help to determine a person’s genetic predisposition to a particular disease based on genes and hereditary factors. Researchers may also identify relevant genes associated with a disease by studying stretches of DNA that have been found to harbor a SNP associated with a disease

Chapter 2. Preliminaries

14

trait. Thus SNPs will also allow researchers a better evaluation about the impact of non-genetic factors like behavior, diet, lifestyle, and physical activities diseases. To create a genetic test to screen a disease in which the disease-causing gene has already been identified, scientists may compare SNP patterns of a group of individuals affected by the disease with that of individuals unaffected by the disease. This association study can indicate which pattern is most likely associated with the disease-causing gene. Then SNP profiles that are characteristic of a variety of diseases can be established and a physician would easily screen individuals for susceptibility to a disease just by analyzing their DNA samples for specific SNP patterns. SNP maps may help them identify the multiple genes associated with such complex diseases as cancer, diabetes and mental disorder, etc.

2.5.4

SNPs and Personalized Drug Prediction

A treatment proved effective in one patient for a particular disease may be found ineffective in others. SNPs are also believed to be useful in helping to determine and understand why individuals differ in their abilities to absorb or clear certain drugs, as well as to determine why an individual may experience an adverse side effect to a particular drug while other may not have any. The most appropriate drug for an individual could be determined by analyzing a patient’s SNP pattern. When a medicine works well for a group of people, researchers tries to find out the DNA markers (SNPs) that are alike for these people. Scientists could identify which medicines are best for any one person by using these markers. For example, when a person is in need of medicine, doctors will compare the person’s SNP pattern with several SNP patterns of different groups of individuals having same disease and will prescribe individualized therapies specific to a patient’s needs. The prediction of the appropriate treatment to the right person is referred to as “personalized drug prediction”. Personalized medicine would allow pharmaceutical companies to bring many more drugs to market and allow doctors to suggest a drug that will be most effective for that individual patient. Therefore, the recent discovery of SNPs are on the way of a revolution in the process of disease detection and the practice of preventative and curative medicine

15

2.6. Graph Terminologies

2.6

Graph Terminologies

The focus of this thesis is to study the use of graphs in bioinformatics. The necessary graph terminologies are very briefly described here. Interested readers are referred to detailed texts of the literature [Wes03,Die05]. A graph is a tuple (V, E), which consists of a finite set of vertices V and a finite set of edges E. Each edge is an ordered or unordered pair of distinct vertices. The set of vertices are denoted by V (G) and the set of edges are denoted by E(G). A graph may look like the Fig. 2.2(i) where each vertex in V(G) = {v1 , v2 , v3 , v4 , v5 } is drawn by a small black circle and each edge in E(G) = {e1 , e2 , e3 , e4 , e5 } is drawn by a line connecting a pair of vertices. Depending upon the directional property of the edges a graph can be one of two classes: Directed Graph and Undirected Graph. A graph where the edges are simple undirected line between two vertices is an undirected graph. A graph where the edges are directed line from a source vertex to a destination vertex is called directed graph.

v2

v2 v5

e4 v1

e5

e3

e1

e2

v3

e4 v1

v4

v5 e5

e3

e1

e2

v3

(i)

v4

(ii) Figure 2.2: Directed and undirected graphs

In this thesis an edge e of an undirected graph is represented by (u, v) where u and v are the vertices which are connected by e and an edge of a directed graph is represented as < u, v > where u is the source vertex and v is the destination vertex. From here on, we will discuss every terminology for undirected graph unless otherwise specified. Let G be a graph. A graph H is called a subgraph of G if every edge in H is an edge in

Chapter 2. Preliminaries

16

G and every vertex of H is a vertex of G. It is to be noted that the definition is a one way implication. A v0 − vl walk, v0 ; e1 ; v1 ; . . . ; vl−1 ; el ; vl , in G is an alternating sequence of vertices and edges of G, beginning and ending with a vertex, in which from each vertex one can move to the following vertex through their intermediate edge. If the vertices v0 , v1 , . . . , vl are distinct (except possibly v0 and vl ), then the walk is called a path and usually denoted either by the sequence of vertices v0 , v1 , . . . , vl or by the sequence of edges e1 , e2 , . . . , el . A path with l edges has length l. A path or walk is closed if v0 = vl . A closed path containing at least one edge is called a cycle. A graph G is connected if for every pair u, v of distinct vertices there exists a path between them on G. Number of incident edges in a vertex v is called the degree of v in an undirected graph. It is denoted by d(v). For directed graph there are two types of degree for each vertex. The in-degree denotes the number of incoming edges and the out-degree the number of outgoing edges from a vertex. A tree is a connected graph without any cycle. Since there is no cycle there exists exactly one path between any two vertices. The vertices of a tree are usually called nodes. The vertex v of a tree with d(v) = 1 is called a leaf and with d(v) > 1 is called an internal node. Thus a leaf has exactly one edge connected to it. A rooted tree is a tree that has a distinguished vertex called root. Since the tree is a connected graph, there is a path from root to every other vertex in the tree. The vertex p that immediately precedes vertex u in the path from root to u is called the parent of u and u is called the child of p. An internal node may have more than one child but exactly one parent. A leaf has no child. If u1 , u2 , u3 , . . . , ul is a sequence of nodes in a tree such that u1 is the parent of u2 which is the parent of u3 and so on, then u1 is called the ancestor of ul and ul is called the descendant of u1 . The root is the ancestor of all nodes and it is descendant of none. The height of a tree is the length of the longest path from root to a leaf.

2.7

Summary

This chapter is written as a suitable starting point for the readers who lack necessary biological background to read the rest of the thesis. The very basics of bioinformatics is actually deeply

2.7. Summary

17

rooted in the concepts and discoveries of biologists. That’s why, as a student of computer science, to study of bioinformatics is a difficult job and requires much time and effort to grasp the ideas inherent in this beautiful research area.

Chapter 3

Haplotyping

One of the biggest achievements in the quest for mystery of life is the complete sequencing of human genome. The Human Genome Project successfully discovered the fact that humans are almost 99% identical at the DNA level. Therefore what makes us different is very small region of the whole genome. The smallest possible variation in DNA sequence is the Single Nucleotide Polymorphism abbreviated as SNP and pronounced as “snip”. It is also the prominent kind of variation in humans. Let us describe SNP more elaborately because haplotyping is the process of locating and determining SNPs.

An SNP is a specific nucleotide, placed inside a DNA molecule which is otherwise identical for all of us, whose value varies, in a statistically significant way, within a population. For a chromosome all the sites where SNP occurs have been well identified in the human genome project. The base at an SNP site is called allele. The possible variations in a particular SNP site over the entire population are most of the times between two alleles. Such SNP is called bi-allelic. It is still to be explained the reason why bi-allelic SNPs are prominent than multiallelic SNPs which has three or more different bases. Snips are the most extensive research topic in recent years for the computational biologists. 18

19

3.1. Haplotype

3.1

Haplotype

Several computational problems related to SNPs have been devised in few years. One of the popular problems is Haplotyping which deals with generating haplotypes from genomic sequence data. Diploid organisms are organized in pairs of chromosomes. A diploid organism has one copy of a specific chromosome inherited from father and another copy of that same chromosome inherited from mother. For each SNP site one can be homozygous (same allele on both chromosomes) or heterozygous (different alleles). The values of a set of SNPs on a particular chromosome copy define a haplotype. An example is shown in Table. 3.1. Two copies of the same chromosome of an individual are aligned and the SNP sites are shown by capital letters. The individual is heterozygous at SNPs 1 and 3 and homozygous at SNPs 2 and 4. The haplotypes are TGGT and AGCT.

paternal copy:

atcatcTcaagtGgaattGctcTctaa

maternal copy:

atcatcAcaagtGgaattCctcTctaa

Haplotype 1:

T

G

G

T

Haplotype 2:

A

G

C

T

Table 3.1: A chromosome and two haplotypes assembled from it Two major category of problems related to haplotyping are • Individual Haplotyping - The Haplotype Assembly Problem • Population Haplotyping - The Haplotype Inference Problem The following sections elaborate the idea of haplotyping problems in details.

3.2

Individual Haplotyping

Haplotyping an individual consists of determining a pair of haplotypes, one for each copy of a given chromosome. This pair of haplotypes completely define the SNP fingerprints of an

Chapter 3. Haplotyping

20

individual for a specific chromosome. Given a sequence of bases of a specific chromosome, we just need to check all the SNP sites to generate the haplotypes. But, the situation is a bit complicated for generating haplotypes from sequencing data. Sequencing data for a genome does not contain the total sequence of bases for a specific chromosome, rather it provides sequences of a set of fragments of the whole genome. Hence the actual problem of individual haplotyping is to find two haplotypes from the set of overlapping fragments of the both copies of a chromosome where fragments may have error and to which copy of the chromosome a fragment belongs is not determined. The very general formulation of the problem is given below [BILR05]. “Given a set of fragments obtained by DNA sequencing from the two copies of a chromosome, find the smallest amount of data to remove so that there exist two haplotypes compatible with all the data remaining.” Before describing some of the optimization problems of individual haplotyping, the mathematical terminology is presented below.

3.2.1

Terminology

Let, S be the set of n bi-allelic SNP sites over which the haplotypes will be constructed. Let, F be the set of m fragments of the two chromosomes where each fragment contains information for nonzero number of SNPs in S. Without loss of generality, let the two alleles for each SNP are 0 and 1 which are two different elements of {A, T, G, C}. Thus each fragment f ∈ F is a string of symbols {0, 1, −} of length n where ‘−’ denotes an undetermined or unknown SNP named as hole. All the fragments can be arranged in an m × n matrix M [f, s] named as SNP matrix . An example of an SNP matrix is given in Table. 3.2. The Consecutive sequence of ‘−’ that lies between two non-hole symbols is called a gap. A gapless SNP matrix is the one that has no gap in any of the fragments. In the example, the first, second and third rows have no gaps while the fourth and sixth rows both have one gap. Two fragments f and g are said to have conf lict, if there exist an SNP position s where two fragments disagree, i.e. M [f, s] = 0 and M [g, s] = 1 or vice versa. If two fragments do not have

21

3.2. Individual Haplotyping

- - - -1101- - - - - - - - - - - - - - - -0001110101- - - - 11010010011- - - - - - - - - - -10100- - -010- - - - - - - - - - - - - -10110101011 010111- - - - - - - - -01011 Table 3.2: An SNP matrix. conflict on any SNP, they are said to agree. Some pairs of fragments are given in Table. 3.3 to illustrate the idea of conf lict and agree.

relation pair of fragments conflict

0111010101100 - - - -010001- --

agree

0111010101100 - - - -010101- --

agree

011101- - - - - -- - - - - - -110001

conflict

010001- - - - - -- - - - -00001- --

Table 3.3: Different relations between fragments

An SNP matrix is error-free if it can be partitioned into two classes of non-conflicting fragments. For a specific SNP matrix there may be more than one such partitions. From each non-conflicting partition of fragments, a haplotype can be constructed by just taking the common allele of the non-conflicting fragments for a particular SNP site. Let, partition of rows in M are Ml and Mr where rows in each set are pairwise non-conflicting. From these two

22

Chapter 3. Haplotyping

partitions, the construction of haplotypes Hl and Hr by combining the rows can be described as

Hij =

            

1

if Mij = 1 for at least one row

0

if Mij = 0 for at least one row

(3.1)

− if Mij = − for all fragments

where i ∈ {l, r} and j = 1, 2, . . . , n. Although actual haplotypes can not have any hole, haplotype assembly problems can introduce holes in the haplotypes if there is no allelic information in the fragments of the partition. Thus the general problem can now be redefined as to finding an error-free matrix from a given SNP matrix through optimal changes. The following sections deal with different approaches of finding error-free matrix and finally the comparison of the complexities of these approaches is presented.

3.2.2

Minimum Fragment Removal : MFR

The fragment data generated from DNA sequencing contains not only overlapping fragments but also multiple fragments from the same region of the DNA. That’s why there are many redundant fragments and optimal removing of redundant fragments to make the matrix errorfree is the most reviewed SNP problem. Problem : Minimum Fragment Removal – MFR Input : An SNP matrix M . Output : Find the smallest set of fragments (rows) whose removal makes M error-free. MFR has been modeled using a graph so that its tractability can be found. Let GF (F, EF ) is a fragment conflict graph , where there is an edge between two fragments if they conflict on any of their common SNPs. Therefore, to find out two classes of non-conflicting fragments is equivalent to find out a Bipartite graph from G by removing minimum number of vertices, as illustrated in Fig. 3.1. Here Fig. 3.1(a) is an SNP matrix, Fig. 3.1(b) is its fragment conflict graph and Fig. 3.1(c) is the bipartite graph resulted after removing the fragment E.

23

3.2. Individual Haplotyping A A 101011 B 100010 C 1010 D 1 0001 E 00011 (a)

A B

C

D

B

C

D

E (b)

(c)

Figure 3.1: Minimum fragment removal using fragment conflict graph Depending upon the presence of gaps, solution to MFR varies in its time complexity. For gapless SNP matrix, MFR is polynomial time solvable [LBIS01]. Dynamic programming method is a good approach to implement MFR for gapless SNP matrix [BILR05]. MFR problem with gapped SNP matrix is NP-hard [BILR05].

Variation MFR deletes a fragment if necessary. Whenever there are more than one fragments, any of whose deletion can make the matrix error-free, MFR does not chose wisely. Therefore, a variation of MFR is readily available and described below [CIKT05]. The length of a haplotype Hi , i ∈ {l, r} is the number of non-hole SNP sites. Let, the sum of the lengths of the two haplotypes reconstructed from an error-free matrix M is L(M ). In this problem, the matrix M is made error-free by removing fragments in such a way that the haplotypes constructed from the matrix achieve maximum length. Thus maximizing L(M ) as well as removing fragments is the goal of the new version. Problem : Longest Haplotype Reconstruction – LHR Input : An SNP matrix M . Output : Find the smallest set of fragments (rows) whose removal makes M error-free with maximum L(M ) As for MFR, LHR is also polynomial time solvable for gapless cases and NP-Hard for gapped versions. The complexity status of LHR for “at most k-holes” is still open. Here

Chapter 3. Haplotyping

24

also, dynamic programing algorithm could be an approach to implement LHR for gapless SNP matrix [CIKT05].

3.2.3

Minimum SNP Removal : MSR

If a fixed sequencing technique is applied to all the fragments then it is likely that errors occur in the same SNPs for all the fragments. That’s why, MSR makes the SNP matrix error-free by removing columns from the matrix. This problem is similar to MFR from algorithmic viewpoint. Problem : Minimum SNP Removal – MSR Input : An SNP matrix M . Output : Find the smallest set of SNPs (columns) whose removal makes M error-free. Similar to the conflict between fragments, conflict between two SNPs is defined. Two SNPs s and t are said to be conflicting if they have both 0 and 1 in their corresponding columns and there exists two fragments f and g, where M [f, s], M [f, t], M [g, s] and M [g, t] are non-hole symbols in which exactly one is different from the other three. Let GS (S, ES ) is a SNP conflict graph , where there is an edge between two SNPs if they conflict. For an error free SNP matrix the GS is an independent set i.e. ES = φ [LBIS01]. Hence, to make an SNP matrix error free by minimum SNP removal is equivalent to finding the maximum independent set of GS and removing all vertices (SNPs) s ∈ S that are not in the maximum independent set of GS . Here Fig. 3.2(a) is an SNP matrix, Fig. 3.2(b) is its SNP conflict graph and Fig. 3.2(c) is the maximum independent set resulted after removing the SNPs 3, 5 and 6 from S. Depending upon the presence of gaps, MSR also varies in its time complexity. For a gapless SNP matrix the GS is a perfect graph with no chordless cycle of length > 4 and for perfect graphs the maximum independent set problem is polynomial time solvable. Hence, for gapless SNP matrix, MSR is polynomial time solvable. A class of gapped SNP matrix can also be solved in polynomial time. This class of matrices is called C1P matrices [LBIS01].

25

3.2. Individual Haplotyping

101011 00010− 101001 100010 000100 (a)

6

1

123456

1 5

2

2 3

4

4 (b)

(c)

Figure 3.2: Minimum SNP removal using SNP conflict graph

3.2.4

Minimum Error Correction : MEC

With the advent of sophisticated sequencing methods, errors in SNP matrices are getting smaller. As a result, correcting these errors rather than removing a whole row or column is more preferable approach. Erroneous SNP values can be corrected by just flipping it to the other allele. Obviously it would have not been possible if it were a multi-allele (more than two possible symbols in an SNP) SNP matrix. A good example of saving information by correcting the errors would be for the SNP matrix in Fig. 3.1. In that matrix fragment E has conflict with fragment B just for the last SNP that is common to them. Thus flipping that SNP of fragment E to 0 would result into an error-free matrix having the partition {A,C} and {B,D,E}. MEC retains the information of the other four alleles of fragment E, while MFR would delete all the five alleles from the matrix. Problem : Minimum Error Correction – MEC Input : An SNP matrix M . Output : The smallest set of SNP alleles whose flipping makes M error-free. MEC problem for gapless SNP matrix is NP-hard [CIKT05]. Hence, MEC is more difficult than the gapless MFR or MSR which have polynomial time algorithm. For 1-gap case MEC is proved to be APX-Hard. It has been showed that haplotype assembly problem has reasonable input size for practical exact algorithms [Huf05]. An exact algorithm based on branch and bound technique is available. It searches all possible pairs of haplotypes to find the solution [WWLZ05]. Now, we present a heuristic algorithm to find the solution of a minimum error

26

Chapter 3. Haplotyping correction problem.

3.2.5

A Heuristic Algorithm for MEC Problem

Terminologies and Definitions Before describing the algorithm, let us redefine the terminologies. Let, M = {Mij } is an SNP matrix of dimension m × k where Mij ∈ {0, 1, −}. There are m fragments; each of which has either a fixed allele (i.e. {0, 1}) or a gap (i.e. ‘−’) in each of its k SNP sites. Conceptually, the matrix M = {M1 , M2 , . . . , Mm } is a set of m fragments where a fragment Mi = {Mi1 , Mi2 , . . . , Mik } is an array of k alleles or holes (see Fig. 3.3(a)). A fragment Mi is called to cover the jth SNP if Mij ∈ {0, 1} and called to skip the jth SNP if Mij =‘−’. Let, Ms and Mt are two fragments. The distance between two fragments, D(Ms , Mt ) is defined as the number of SNPs that are covered by both of the fragments and have different alleles. For example in Fig. 3.3(a) the D(M3 , M2 ) = 4.

D(Ms , Mt ) =

k X

d(Msj , Mtj )

(3.2)

j=1

where d(x, y) is defined as

d(x, y) =

    

1 if x 6= − and y 6= − and x 6= y

(3.3)

0 Otherwise

Two fragments Ms and Mt are said to be conf licting if D(Ms , Mt ) > 0. Now, P (C1 , C2 ) is a partition of M , where C1 and C2 are two collections (sets) of fragments taken from M so that C1

S

C2 = M and C1

T

C2 = φ. In Fig. 3.3(b) a partition is shown. Therefore, M will be

an error-free matrix if and only if there exists a partition P (C1 , C2 ) of M such that for any two fragments x, y ∈ Ci , i ∈ {1, 2}, x and y are non-conflicting, in other words D(x, y) = 0 for all x and y belonging to the same collection Ci . Such a partition is called a error-free partition. The partition in the figure is not error free because D(M1 , M2 ) > 0 and D(M5 , M6 ) > 0 insert error

27

3.2. Individual Haplotyping 1

01000

M

1000001

3

11001

4 5

P

00111000

2

010010 1000001

6

1 2 3

4 5 6

C1

C2

H1 = 0100000000 H2 = 0100000001

(b)

(a)

(c)

Figure 3.3: SNP matrix and its partition in C1 and C2 respectively. Construction of the two haplotypes from an error free partition is similar to that of LHR. A haplotype Hi , i ∈ {1, 2} is constructed by taking common allele from the component fragments for both the parts Ci , i ∈ {1, 2}. The mathematical expression that describes such construction is

Hij =

            

1

if at least one fragment in Ci has a 1 in jth SNP

0

if at least one fragment in Ci has a 0 in jth SNP

(3.4)

− if all the fragments in Ci skips jth SNP

If the matrix M is not error-free, there will be no error-free partition P as well and there will be conflicts between fragments in any or both of the collections of all possible partitions. Thus, for such M , there is no partition which we can use to construct two haplotypes by just taking the common allele for a particular. So far we described how an SNP matrix becomes erroneous, now the question is to correct an erroneous SNP matrix to make it error-free. If we are given a partition P (C1 , C2 ) of an erroneous matrix M and the two actual haplotypes H1 and H2 , the number of errors E(P ) that must be corrected are just the sum of the distances of all the fragments from their corresponding haplotypes. More mathematical expression is given below.

E(P ) =

2 X X

i=1 f ∈Ci

D(f, Hi )

(3.5)

28

Chapter 3. Haplotyping

Now, the MEC problem reduces to finding a partition P that minimizes the error function E(P ).

Algorithm - HMEC Now, to minimize the E(P ), we need to search all possible partitions of a matrix M . This would certainly require running time exponential to the number of fragments in M . But such a search is not possible because we don’t have the real haplotypes to calculate E(P ). Hence, we have to approximate E(P ) by constructing two haplotypes from the given partition P . For best approximation we should construct haplotypes which are minimum conflicting with the fragments of their corresponding collections. Therefore, for each SNP site, the haplotype Hi should take the allele that is present in majority of the fragments in Ci . In case of ties, 0 is favored because it is the more common than 1. In Fig. 3.3(c) the two haplotypes H1 and H2 are shown for the partition P in Fig. 3.3(b) which are constructed in this method. To define this construction more mathematically, let Nj0 (Ci ) is the number of fragments in a collection Ci that have 0 in jth SNP. Similarly, Nj1 (Ci ) defines the number of 1s. Thus to minimize the number of errors E(P ) for a specific partition P , the haplotype should be constructed following the rule

Hij =

            

1

if Nj1 (Ci ) > Nj0 (Ci )

0

if Nj0 (Ci ) >= Nj1 (Ci ) and Nj0 (Ci ) 6= 0

(3.6)

− if Nj1 (Ci ) = Nj0 (Ci ) = 0

where i ∈ {1, 2} and j = 1, 2, . . . , k. To find the best partition we will use a local search heuristic. The algorithm iteratively searches a better partition with respect to the current one and chooses it to move to. Because the algorithm searches from within a small set of partitions, the chosen partition may not lead to the optimum solution and the algorithm has a chance to fall into the local optimum solution. The algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm for bipartitioning a hypergraph minimizing the cut size.

29

3.2. Individual Haplotyping

This algorithm starts with an arbitrary partition as for example P (M, φ) and iteratively searches a better partition. In each iteration the algorithm performs a sequence of transfer of distinct fragments from their present collection to the other one so that the partition becomes less erroneous. It should be noted that, a fragment’s transfer of collection can both increase or decrease the error function E(P ). Let, the partition before transferring a fragment f is Pp and the partition resulted is Pn . We define the gain of the transfer as Gain(f ) = E(Pp ) − E(Pn ). Let, F =< fi >, i = {1, 2, . . . , m} is an ordering of all the fragments in a partition P in such a way that fragment fi will precede fragment fj if all the fragments preceding fi have been transferred to form an intermediate partition Pi and Gain(fi ) >= Gain(fj ) over Pi . Thus the first intermediate partition P1 is the current partition Pc of the ongoing iteration. We also define the cumulative gain of a fragment ordering F up to its nth fragment as CGain(F, n) =

Pn

j=0

Gain(fj ). Note that, all the gains

used to compute the cumulative gain are calculated over different intermediate partitions. The maximum cumulative gain, M CGain(F ) is defined as M CGain(F ) =

max CGain(F, i)

1≤i≤m

Now, in each iteration the algorithm finds the ordering Fc of current partition Pc and transfers only those fragments of Fc that can achieve the M CGain(Fc ). The fragment that is the last to be transferred is referred as fmax . Thus, the algorithm iterates from one partition to another reducing the error function. Since our algorithm is local search algorithm, it continues whenever M CGain(Fc ) > 0 and stops as soon as M CGain(Fc ) ≤ 0.

Implementation and Complexity There are several issues to discuss about the above described algorithms. We will discuss the data structure for each such issue. To find Fc in each iteration the algorithm repeatedly transfers the fragment that is not transferred previously in this iteration and has maximum gain over all such fragments. To accomplish this we use a locking mechanism. At the beginning of each iteration all the fragments

30

Chapter 3. Haplotyping

are set free. The free fragment with maximum gain is found out and tentatively transferred to the other collection. After the transfer the fragment is locked at the new collection. This tentative transfer creates the first intermediate partition P1 . The algorithm then finds the next free fragment with maximum gain in P1 and transfer and lock that fragment to create the P2 . Thus, free fragments are transferred until all the fragments are locked and the order of the transfer (Fc ) is stored in the log table along with the cumulative gains (CGain). M CGain is the maximum CGain and fmax is the fragment corresponding to M CGain in the log table. Although after finishing all such tentative transfers Pc has been changed to an undefined partition, the algorithm checks the log to find the M CGain(Fc ) and fmax and rollback the transfer of all the fragments that were transferred after fmax . When the rollback completes the Pc becomes ready for the next iteration. 1 2 3 C1

Pp

Pn

4 5 6

1 3

4 2 5 6

C2

C1

C2

H1 = 0100000000 H2 = 0100000001 E(Pp ) = 8 Gain(2) = E(Pp )

H1 = 01000001−− H2 = 0100101001 E(Pn ) = 6 E(Pn ) = 2

Figure 3.4: An example calculation of Gain measure While tentatively transferring a free fragment, the algorithm needs to find the fragment with maximum gain among the free fragments (which are not yet transferred). This requires calculating gains for each of them. To calculate the Gain(f ) = E(Pp )−E(Pn ) for a fragment we need to calculate two error values of two different partitions; the present intermediate partition and the next partition which will be resulted if f is transferred. Each of these error functions requires calculation of two new haplotypes from their corresponding collections (see Fig. 3.4). Although E(Pp ) and the haplotypes of Pp can be found from the previous transfer, calculation of E(Pn ) requires construction of haplotypes of Pn . Since, the difference between Pp and Pn is only one transfer, we can introduce differential calculation of haplotypes Hnj , j ∈ {1, 2} of

3.2. Individual Haplotyping

31

next partition from the haplotypes of Hpi , i ∈ {1, 2} of present partition. For this purpose, the algorithm stores Nj1 (Cpi ) and Nj0 (Cpi ) values of the present partition. After a transfer these values will either be incremented or decremented by 1 or remain the same. Hence, it is now possible to construct Hnj , j ∈ {1, 2} in O(k) time. To compute E(Pn ) from the haplotypes requires O(mk) time. Therefore, running time to compute the E(Pn ) as well as to compute Gain(f ) is O(mk + k). For each intermediate partition Pi , i = 1, . . . , n we need to compute Gain measures for m − i unlocked fragments to find the maximum one. The transfer of this fragments require updating of Nj1 (Ci ) and Nj0 (Ci ), i ∈ {1, 2} and j = 1, 2, . . . , k. So, it also needs O(k) time to run. Finally, there will be m such transfer in each iteration. Thus each iteration will require O(m(m(mk + k) + k)) ∼ O(m3 k) running time.

Approximation to Improve Performance For large SNP matrix O(m3 k) running time is critical to the performance of the algorithm. We can use an approximation in the calculation of the Gain(f ) by using only the fragment f and not using the m − 1 other fragments. The approximate gain should be

AppxGain(f ) = D(Hip , f ) − D(Hjn , f )

(3.7)

where Hpi is the haplotype of f ’s present collection Ci of partition Pp and Hnj is the haplotype of f ’s next collection Cj of partition Pn . This function ignores the effect of fragments other than f on Gain(f ) but reduces the run time of calculating gain to O(k). The total run time of each iteration will be O(m2 k)

A Simulated Example In Fig. 3.5 we present an example iteration of HMEC. We consider that the current partition Pc = P1 is the partition given in Fig. 3.3(b) for the matrix M . All the intermediate partitions Pi , i ∈ {1, . . . , 7} are shown sequentially and the gains of each fragment over the intermediate

32

Chapter 3. Haplotyping

Algorithm HMEC(M) 1:

Pc = P (M, φ)

2:

FREE LOCKS()

3:

CLEAR LOG()

4:

repeat always

5:

while there is an unlocked fragment in Pc do begin

6:

find a free fragment f so that Gain(f ) is maximum

7:

transfer f to the other collection

8:

update the haplotypes after the transfer

9:

LOCK(f )

10:

LOG RECORD(f ,Gain(f )) end

11:

FREE LOCKS()

12:

check the log and find M CGain(Fc ) and fmax

13:

if M CGain(Fc ) > 0 begin

14:

set new Pc by rolling back the transfers that occurred after the transfer of fmax

15:

calculate haplotypes of Pc

16:

CLEAR LOG()

17:

continue the loop end

18:

else begin

19:

terminate the algorithm and output current haplotypes end end repeat Table 3.4: The pseudocode for the HMEC algorithm

33

3.2. Individual Haplotyping

1 2 3

P1

P2 1 3

1 3 6

P3

4 5 6

Gain(1)=0 Gain(2)=2 Gain(3)=2 Gain(4)=1 Gain(5)=1 Gain(6)=1

2 4 5 6

Gain(1)=−1 Gain(3)=−2 Gain(4)=−2 Gain(5)=1 Gain(6)=2

2 4 5

Gain(1)=−1 Gain(3)=−2 Gain(4)=−1 Gain(5)=0

1 3 5 6

P4

P7

4

2

6

3

P6

3 4 5 6

Gain(3)=−2

P5 4

5 6

4

locked fragment

1 2

1 3

2

1

5

2

Gain(1)=−1 Gain(3)=−3

Gain(1)=−2 Gain(3)=−3 Gain(4)=−1

Figure 3.5: An example iteration of HMEC

34

Chapter 3. Haplotyping

partitions are shown on the right of each partition. The free fragment with maximum gain is marked in each intermediate partition. For example, the maximum gaining fragment on P2 is fragment 6 with gain 2. After each transfer the transferred fragment is locked by a circle. Here, the ordering Fc of the fragments is < 2, 6, 5, 1, 4, 3 > which is also the order of locking of the fragments. In the log this ordering will be stored along with the CGains (see Fig. 3.6). All the tentative transfers after fmax have to be rolled back so that the P2 becomes the next Pc . Log Table Fc

CGain

2 6 5 4 1 3

2 4 4 3 2 0

MCGain

Figure 3.6: An example log table

Performance Evaluation We have implemented the algorithm and tested it thoroughly. We first chose two independent set of haplotypes. One of them was based on real sequences of DCP 1 genes that contains lot of similarity among the haplotypes [RTCN99]. There were 7 pairs of haplotypes of 53 SNPs in this set. The other one was generated sequences reflecting true randomness. There were 6 pairs of haplotypes of 90 SNPs in this set. Upon each of these sets of haplotypes, we ran a simulated sequencing operation that generates a set of fragments (i.e. the SNP matrix) from every pair of haplotypes within the set. The simulation program was controlled in several ways. We varied the number of fragment in the matrix as well as the length (number of non-hole SNP sites) of each fragment. We also varied the maximum amount of errors that was introduced in the matrix. Over each of the SNP matrices we ran our algorithm and it performed very well to reconstruct the haplotypes. The worst scenario, that was tested, was a matrix from the generated haplotypes with only 40 fragments each having 60% holes in them and and 40% errors in the alleles. Our algorithm constructed on the average 85% of the haplotypes. This

35

3.2. Individual Haplotyping

percentage is quite a good one considering the true randomness of the SNP matrix and obviously better performing than a genetic algorithm for MEC [WWLZ05]. Besides the correctness, it is lot more efficient in execution time. It took 2.46 sec to process all the 6 matrices of 40X90 dimension. The most successful feature of this search algorithm is its Independence of gap. The position of holes (i.e. gaps) in the SNP matrix will not create any difference to the HMEC algorithm. That’s why the number of gaps was not a control parameter of the sequencing simulation. An algorithmic comparison with the FM algorithm for bipartitioning with minimum cut size reveals that our FM algorithm runs in O(m) time in updating the gain values in one pass where our algorithm runs in O(mk) if no approximation is applied.

Variation A new model of MEC has been proposed to incorporate the genotypic information of the organism, which is more readily available, to increase the correctness of the solution [WWLZ05]. From algorithmic point of view there is another variation of MEC [CITK05]. The problem is open at its complexity status. Problem : Binary-Witness-MEC Input : An SNP matrix M that does not contain any holes. Output : For an input matrix M of size n × m, two haplotypes H1 ,H2 ∈ {0, 1}m minimizing. D(H1 , H2 ) =

X

min(d(r, H1 ), d(r, H2 ))

rows r∈M

3.2.6

Comparison of Different Individual Haplotyping Problems

We described four haplotype assembly problems in this section. Those are MFR, LHR, MSR and MEC. All these problems are very extensively studied over the last few years and several algorithms are established for each of the settings. Several papers are published regarding the complexity status of the problems. Several new models derived from these four problems are also coming by the last couple of years. The problems are themselves varied in nature because of

36

Chapter 3. Haplotyping

the presence of gap in the input SNP matrix. It should be noted here that a hole in a fragment is not necessarily a gap. Gap is a sequence of one or more holes separated by nonempty sequence of alleles. The HMEC that we proposed is significantly better than a Genetic algorithm. Other than these two we did not find any approximation algorithm for the other problems. The most recent compiled state of these four problems (as per our knowledge) is shown in the Table. 3.5

Problem

Gap and hole type

Time

Reference

MFR

UnGapped

O(m2 n + m3 )

[BILR05]

At most k holes

O(22k m2 n + 23k m3 )

[BILR05]

At most 1-gap

NP-Hard

[LBIS01]

UnGapped

O(mn2 )

[BILR05]

At most k holes

O(2k mn2 )

[BILR05]

At most 1-gap

NP-Hard

[BILR05]

UnGapped

O(mn2 + n3 )

[CITK05]

At most k holes

?

At most 1-gap

NP-Hard

[CITK05]

UnGapped

NP-Hard

[CITK05]

At most k holes

NP-Hard

[CITK05]

At most 1-gap

NP-Hard

[CITK05]

MSR

LHR

MEC

Table 3.5: Comparison among the SNP problems

3.3

Population Haplotyping

Sequencing an individual DNA and having all of its haplotypic information is not the major concern of the researchers who deals with disease analysis, drug design, hybridization of crops etc. They need the trend of SNPs in the haplotypes in many individuals of a particular diploid organism. But, it is not possible for even the most powerful sequencer built by today to

37

3.3. Population Haplotyping

generate large scale haplotype data in short time and cost. The more economical method for population haplotype mapping is to generate large scale “genotype” data and compute the map from it. To be more precise, a genotype is not a fragment of a haplotype, rather it is a combined information of the fragments of the haplotypes. A genotype can only say whether an individual is homozygous or heterozygous in some SNP sites. Population haplotyping problems takes on genotypes of a large number of individuals over a fixed number of SNP sites, and infers the haplotypes of this population. Haplotype is a sequence of SNP values on a copy of chromosome of an individual. A genotype vector, or simply genotype, represents two haplotypes as a sequence of unordered pairs of alleles. Each pair represents the nucleotides in a given site. The following Table. 3.6 shows the genotype for a chromosome.

paternal copy:

atc T cat G gat G ctc T ctaa

maternal copy:

atc A cat G gat C ctc T ctaa

Haplotype 1:

T

G

G

T

Haplotype 2:

A

G

C

T

Genotype:

(T,A)

(G,G)

(C,G)

(T,T)

Table 3.6: A chromosome and its genotype Before describing some of the haplotype inference problems, the mathematical terminology is described as usually.

3.3.1

Terminology

For bi-allelic SNPs, haplotypes can also be represented as a string of symbols {0, 1}. Thus a genotype is a string of unordered pairs over the set {0, 1}. Whenever an SNP site is homozygous the genotype pair will be either (0,0) or (1,1), while for the heterozygous site the pair is (0,1). For example two haplotypes of length 3 are 0,1,1 and 1,0,1 which are combined into the genotype (0,1),(0,1),(1,1). Since, only 3 distinct pairs are possible we can use a different alphabet {0, 1, ?}

38

Chapter 3. Haplotyping

to represent genotypes more precisely. For the homozygous pairs we will use 0 and 1 respectively and for the heterozygous pair the ? will be used. Hence, for the previous example the genotype should be ?,?,1. Given a genotype g = {gi }, i = 1, 2, . . . , m, the resolution of g is pair h , k of haplotypes, where h = {hi }, i = 1, 2, . . . , m and k = {ki }, i = 1, 2, . . . , m such that hi = ki = gi if g 6=? and hi , ki ∈ {0, 1}, hi 6= ki if gi =?. We say that h, k resolves g. Given a genotype g and a haplotype h, we say that h is compatible with g if hi = gi whenever g 6=?. If h is compatible with g we can always find a haplotype h0 such that h and h0 resolves g. Here, h0 is called the realization of g by h and denoted as R(g, h). It should be noted here that, h and k can be compatible with g separately but they may not resolve g together. Given a g and h, R(g, h) is a unique haplotype computed as R(g, h) =

    

hi

if gi 6=?, i = 1, 2, . . . , m

1 − hi if gi =?, i = 1, 2, . . . , m

Now, the general problem of haplotype inference is defined below [BVDL03]. “Given a set of genotypes G = {g1 , g2 , . . . , gn }, for each g ∈ G find a pair h, k of haplotypes resolving g.” Preserving this general settings, there are many optimization problems with different models. The following subsections will discuss several of such problems along with their algorithms.

3.3.2

Pure Parsimony Problem

The first optimization problem that comes in mind about this general problem is to minimize the number of haplotypes. The pure parsimony problem states to do so. Problem : Pure Parsimony Haplotyping Problem Input : A set G of genotypes. Output : Find the cardinality-smallest set of haplotypes H such that for each g ∈ G, there is a pair of haplotypes h, h0 ∈ H that resolves g. An algorithm introducing the maximum parsimony approach to haplotyping was described earlier [Cla90]. That was actually an approximation algorithm using the rationale that ho-

3.3. Population Haplotyping

39

mozygous site in one genotype should be common to the other genotype where the site is ?. Later, Lancia et.al proved that the pure parsimony problem is NP-hard for genotype set that has no genotype with more than 3 heterozygous sites. It has been proved that pure parsimony problem can be solved in polynomial time (O(mnlog(n) + n3/2 )) for a genotype set G of no more than 2 heterozygous sites in a single genotype [CITK05]. It is yet to prove the hardness of pure parsimony problem for a genotype set with more than 3 heterozygous sites. The haplotype inference problems has another name called “genotype phasing”. Recent software applications approach this problem for large population set and real time operations. An entropy minimization algorithm can be used for practical implementation [PM06]. Algorithm for a highly reliable and large scale implementation is also available [BZ06].

3.3.3

Maximum Resolution Problem – MR

With the general settings, many optimization problems have been devised by introducing rule of inference over the genotype data. The inference rule is an operation that is performed to expand a known set of haplotypes to make it consistent with a known set of genotypes [BVLD03]. The definition is as follows Definition 3.3.1 (Inference rule) Let G = {g1 , g2 , . . . , gm } be a set of genotypes and let H be a nonempty set of haplotypes. The application of the inference rule to a haplotype h ∈ H compatible with a genotype g ∈ G consists of adding R(g, h) to H and removing g from G. The MR problem deals with the application of inference rule over given (G, H). It states to find the sequence of inference that achieve maximum resolution of G by H. Problem : Maximum Resolution – MR Input : A set G of genotypes and a set H of haplotypes. Output : A maximum cardinality subset G0 of G of genotypes that are removed from G by a sequence of application of the inference rule starting from G and H. An example of application of inference rule will give an insight into the problems. Let G = {g1 = 00??0?, g2 = 00?1?1} and H = {h1 = 001000, h2 = 000001}. The optimal sequence

40

Chapter 3. Haplotyping of application of inference rule in comparison to a non-optimal one is shown in Table. 3.7

Step 0

1

2

Optimal Sequence

Non Optimal Sequence

G = {g1 , g2 }, H = {h1 , h2 }

G = {g1 , g2 }, H = {h1 , h2 }

G = {g1 = 00??0?, g2 = 00?1?1} and

G = {g1 = 00??0?, g2 = 00?1?1} and

H = {h1 = 001000, h2 = 000001}

H = {h1 = 001000, h2 = 000001}

h3 = R(g1 , h1 ) = 000101

h5 = R(g1 , h2 ) = 001100

G = {g2 }, H = {h1 , h2 , h3 }

G = {g2 }, H = {h1 , h2 , h5 }

h4 = R(g2 , h3 ) = 001111

No haplotype in H is compatible with g2

G = {}, H = {h1 , h2 , h3 , h4 }

G = {g2 }, H = {h1 , h2 , h5 }

Table 3.7: Example of optimal application of haplotype inference rule A formal framework to analyze and investigate the computational complexity of haplotype inference problems is presented by stating an optimization problem, whose corresponding decision version is NP-hard [Gus01].

Variation A variation of the MR problem named “single genotype resolution problem” is presented whose computational complexity is yet to determine [BVLD05]. This open problem deals with a single genotype to determine its existence in a population. Problem : Single Genotype Resolution – SGR Input : A set G of genotypes and a set H of haplotypes and a distinguished genotype g ∈ G. Output : A sequence of applications of the inference rule that resolves a subset of G including g.

3.3.4

Perfect Phylogeny Haplotyping Problem – PPH

Difference in SNPs in individuals of a particular population is not a frequent occurrence. A white man and a black man surely have some differences in their SNPs. But, for a population

3.3. Population Haplotyping

41

of white people, living in the same geographic region, having many similar traits, SNPs of individuals should not differ very much except some cases of atomic disaster like that of Hiroshima and Nagasaki which can mutate many SNPs of reproductive cell and contaminate the whole next generation of that individual. A mathematical description of the above mentioned fact will clear the idea. Let, there are n SNPs in the entire human race, so there might be 2n possible SNP configuration if all these SNPs are bi-allelic. But actually only a few of these 2n configurations can be identified so far in humans. All those configuration are very much different from each other and every single configuration is common to some population. This changes in SNPs occur for natural mutations and represents the evolution of humans through a large epoch. Thus haplotypes of a population contains the evolutionary information and they could be arranged in some phylogenetic tree. Thus the very general population haplotyping problem can degenerate into the phylogeny of haplotypes. Let, us define a phylogeny of the haplotypes as haplotype perfect phylogeny.

Definition 3.3.2 (Haplotype Perfect Phylogeny) Let B be an n × m {0, 1} matrix where each row in B is a binary haplotype and each column i is the n vector of the SNP sites of the m haplotypes. A haplotype perfect phylogeny for B, in short hpp, is a rooted tree T with n leaves such that the following properties hold: • each leaf of the tree is labeled by a distinct haplotype from B, that is a distinct row of B. • each internal edge of T can be labeled by at least a SNP site j changing from 0 to 1, while each site labels at most one edge. • for each haplotype leaf h, the unique path from the root of T to h specifies exactly all SNP sites that are 1 in h. An example of an hpp is given in the Fig. 3.7. It should be noted that the edges without label are not the internal edges of the tree. Hence, the example completely agrees the definition. Here, the matrix of haplotypes on the left is arranged in the hpp on the right. Since the hpp intends the use of haplotypes as a matrix, we should define the resolution of a genotype

42

Chapter 3. Haplotyping

3 A B C D E

123456 101000 001011 001010 010000 010100

1

2

5

4

A

D

E

6 B

(a)

C (b)

Figure 3.7: An example of haplotype perfect phylogeny (hpp) matrix. A {0,1} matrix H of haplotypes resolves a {0,1,?} matrix G of genotypes if each row of G is resolved by a pair of rows of H. The perfect phylogeny haplotyping problem is Problem : Perfect Phylogeny Haplotyping – PPH Input : an n × m matrix G over alphabet {0,1,?}. Output : A matrix H which is a realization of matrix G and a haplotype perfect phylogeny for H, or decide that such tree does not exist. Now, the issue of existence of a solution is important in this problem. We need to know how to check that a matrix H has no hpp. In the Fig. 3.8 a matrix that has no hpp is shown. The trees in the figure are possible trials to represent the matrix in hpp, but both failed to represent a haplotype. Such matrix, where three rows and two columns and the rows are (1,1),(0,1) and (1,0) is called the forbidden matrix. It is proved that when a matrix has a sub matrix of any two column and three rows equal to a forbidden matrix, the matrix has no hpp [BVLD05].

3.4

Summary

In this chapter, we described the haplotyping problem, one of the most important SNP related problems. We described both types of haplotyping: individual and population. The individual haplotyping determines the haplotype of an individual from the DNA sequence data of that

43

3.4. Summary

i

j

h1

1

1

h2

1

0

h3

0

1

i j

i

h1 (a)

j

h2 (b)

h1

h3 (c)

Figure 3.8: Forbidden matrix for hpp individual. There are several variation of this generic problem. The MFR, MSR and MEC problem are described along with some graph based computational model and their variations. The fragment conflict graph, SNP conflict graph are examples of such models. A local search algorithm is given for the MEC problem. Population haplotyping determines the haplotypes of a population from the genotypes of a number of individuals. The pure parsimony problem, maximum resolution problem and perfect phylogeny haplotyping problem are described. For each problem, illustrative example is used to elaborate the computation required to solve the problem.

Chapter 4 Phylogenetic Tree Phylogeny is the description of relations among species, population, individuals or genes. This relationship can be of several kinds having the property of ancestor-descendant relationship. For a quick example, we can describe a phylogeny of human languages of different civil societies where “Hindi” and “Bangla” may have same ancestor named “Sanskrit” while “arabic” and “Persian” may have same ancestor. Generally relations that can be expressed as descendantancestor have a close tie with time. That’s why phylogeny is extensively used to research the evolutionary history of biological elements like species, population, genes, etc (generally referred as taxa). In this setting, the most reliable source of information that can be used to relate taxa are cellular information like ribosomal RNAs, gene families, repetitive DNA sequences, protein sequences, etc. The best representation of a phylogeny should be nothing but tree. A tree organization of taxas enables easier analysis and makes it more suitable for applying computer algorithms. A tree representation of a phylogeny is called the Phylogenetic Tree. An example of a phylogenetic tree constructed from RNA comparison is shown in Fig. 4.1 taken from the book Fundamental concepts of bioinformatics by D. E. Krane et. al. (page 113). From the point of view of computer science, phylogenetic tree can be studied in two different phases. • Construction of phylogenetic tree using some distance metric to measure the relationship 44

45

Microsporidia Diplomonads Trichomonads

Bacteria

Flagellates

Green nonsulfur bacteria Cyanobacteria

Ciliates

Purple Bacteria

Plants Fungi Animals Slime molds Entamoebae

Gram−Positive Bacteria Flavobacteria Thermotogales

Eucarya

Euryarchaeota

Crenarchaeota

Archaea

Figure 4.1: A sample phylogenetic tree of living organisms

Chapter 4. Phylogenetic Tree

46

among taxa so that the topology of the tree represent the relation reliably.

• Analysis of information (ancestral sequence, path length, subtree, etc.) from an already constructed phylogenetic tree accurately and visualization the trees in easy shapes.

In this chapter we will discuss two different analysis algorithm to enchant the current trend of research in phylogenetic tree analysis.

4.1

Majority Rule Consensus Tree

Most of the construction problems of phylogenetic trees are NP-hard and run heuristic algorithms to generate phylogenetic tree for smaller set of taxa than required. While looking for building a tree for a set of million taxa, smaller trees generated by these algorithms should be combined reliably and efficiently. A consensus tree for a set of input trees is a single tree which include features on which all or most of the input trees agree. Depending upon the definition of the feature, on which the trees should agree, consensus tree can be different. For example, it can preserve the topological structure of the trees or can preserve the pairwise path lengths of individual taxa. The algorithm that we will discuss in this section builds consensus tree using majority rule. In this problem, a node in a tree is equal to another node in a different tree if they have the same set of taxa in the corresponding subtree rooted at them. The majority rule consensus tree problem states to include all nodes that are present in more than fraction l of the input trees into the consensus tree so that the topology of the consensus tree agrees with majority of the input trees [ACJ03]. Problem : Majority Rule Consensus Tree Problem Input : A set of phylogenetic trees Output : A consensus tree only having all the nodes that appear in majority of the input trees. The algorithm we are presenting is a simplified linear time randomized algorithm.

47

4.1. Majority Rule Consensus Tree

4.1.1

Terminology

Let S = {s0 , s1 , . . . , sn−1 } is a set of taxa and T = {T1 , T2 , . . . , Tt } is a set of phylogenetic trees, each having leaves labeled by s such that s ∈ S. Without loss of generality, let s0 is a distinguished taxon that is present in all the trees Tj and is connected directly to the root of each of the trees. Such taxon is called the outgroup to the rest of the tree. If there is no common outgroup present in T , to continue with the algorithm we will insert s dummy node to each of the trees and make it the outgroup. Now, let i is a node in a tree Tj . If we delete the edge from i towards the root of Tj , we will get a bipartition : a subtree rooted at i and the rest of the tree having s0 . Such a bipartition B can be represented uniquely only by the leaves of the subtree rooted at i and denoted as S(B). For example, in the Fig. 4.2 a bipartition in tree T2 is s2 s3 |s0 s1 s4 and S(s2 s3 |s0 s1 s4 ) = {s2 , s3 }. The cardinality of the bipartition is 2. Therefore, two nodes i and i0 of two different phylogenetic trees Tj and Tk are equal if their respective bipartitions B and B 0 are equal, i.e. S(B) = S(B 0 )

s0

s1 s2

s3

s4

s0 s1

s2

s3

s4

s0 s1

s2

s4

s3

Majority rule consensus tree s0 s1

s2

s3 s4

Figure 4.2: An example of majority rule consensus tree for l =

1 2

In majority rule tree, or Ml tree, includes nodes for exactly those bipartitions which occur in more than some fraction l of the input trees. It has been showed that there exists such tree for any 1/2 < l ≤ 1 [MM81].

48

Chapter 4. Phylogenetic Tree

4.1.2

Algorithm

The algorithm can be divided into two phase. The counting phase counts the number of occurrence of all the bipartition present in T and the construction phase constructs an Ml for some

1 2

< l ≤ 1.

Counting Bipartitions Counting phase counts the the number of occurrences of a bipartition so that we can determine whether the bipartition is a majority one. To count every possible bipartition we need a storage for them. A tree with n leaves can have O(n) distinct bipartition. Because |S|= n, maximum number of distinct bipartitions is O(tn). We can represent each bipartition B by B = (b1 , b2 , . . . , bn ), a bit-string of n-bits, where bit bj carries a 1 if the respective taxon sj ∈ B. The bit-string for a bipartition B of node i can be calculated from the bit-strings of the its children’s bipartition’s B1 , B2 , . . . , Br by just taking a logical OR of them. This is because the Bi , i = 1, 2, . . . , r are disjoint set of taxa.

B=

r _

Bi .

(4.1)

i=1

We will use a hash function that hashes a bipartition to a hash table address. Each location of the hash table contains the record of the running count of the bipartition, the bit-string representation of the bipartition. The hash function for a bipartition B is h(B).

h(B) =

n X

bi ai mod m

(4.2)

i=1

where a = (a1 , a2 , . . . , an ) is a list of random integers in (0, . . . , m − 1) and m is a prime number and m > tn. Such hash function has an advantage of implicit calculation. The hash index for a bipartition B of node i can also be calculated from the bipartitions of the its children’s bipartition’s B1 , B2 , . . . , Br by just adding and taking mod of their hash indices.

49

4.1. Majority Rule Consensus Tree

h(B) =

n X

bi ai mod m =

i=1

r X n X

bji ai mod m =

j=1 i=1

r X

h(Bj ) mod m

(4.3)

j=1

To count bipartitions the algorithm traverses every tree in postorder and calculates the hash index and the bit-string representation from those of its children. Using index the algorithm accesses the hash table which uses separate chaining method to resolve the collision and inserts the bit-string as a new record if it is not in the table and increments its count.

Constructing Majority Rule Tree The construction phase makes use of the following facts. Fact 4.1.1 Let Ba and B are two majority bipartitions. If Ba is an ancestor of B in a tree in T , then Ba is an ancestor of B in Ml . Fact 4.1.2 Let Bp and B are two majority bipartitions. If Bp is the parent of B in Ml , then at least one tree in T has Bp as Bs ancestor. Fact 4.1.3 Let Ba , Bp and B are all majority bipartitions. If Bp is the parent of B and Ba is an ancestor of B other than Bp in Ml , then |Ba | > |Bp |. The Fact 4.1.1 tells that a preorder traversal of the input trees will be congruent with the top-down order of bipartitions in the Ml . The Fact 4.1.2 tells that if we traverse all the input trees in preorder, we won’t miss any edge of Ml . The algorithm starts a preorder traversal of each tree in T . Let, the algorithm is in node i. Each traversal uses a pointer c which always points to the nearest ancestor of i that is majority node. We create a dummy node whose bipartition is S and before each traversal c is initialized to dummy. After the algorithm, the Ml will be the sole subtree of this dummy. Let C be the bipartition of c. At node i, the algorithm calculates the Bi and h(Bi ) using Equ 4.1 and Equ 4.3 and using h(Bi ) it finds the count of Bi . If Bi is not a majority node, it is just ignored. If Bi is a majority

50

Chapter 4. Phylogenetic Tree

node then it is searched in the Ml . If there is no node for Bi in the Ml then a node for Bi is created. Since, c is the ancestor of i in Ml , a node for C obviously exists in Ml and the node for C is made the parent of Bi . If, on the other hand, a node for Bi does exist in Ml , the algorithm finds the current parent P of Bi in Ml . If P has a greater cardinality than C, then C is a strong contender to be the parent of Bi (using Fact 4.1.3). Hence, C is made the parent of Bi in Ml . Otherwise no change is done in Ml .

4.1.3

A Simulated Example

s0

s1 s2

s3

s4

s0

s1

T1

s2

s3

s4

s0

s1

T2 s1s2s3s4

s2 s3

s4

T3 s1s2s3s4

s1s2s3s4 s1s2s3

s2s3 s0

s1 s2

(a)

s3

s4

s1s2s3 s0 s1 s2

(b)

s2s3 s3

s2s3 s4

s0

s1

s2

(c)

s1s2s3s4

Majority rule consensus tree

s1s2s3 s2s3

s1

s2

s4

s1s2s3s4

s1s2s3 s0

s3

s3

(e)

s2s3 s4

s0

s1

s2

s3

s4

(d)

Figure 4.3: Steps of majority rule tree construction In Fig. 4.3 the steps of applying the above algorithm to the trees of the first row is shown. The order of traversal of the tree is < T3 , T2 , T1 >. After finishing the traversal of T3 , the tree would look like (a). Since s2 s3 s4 (third node

4.2. Uniform Sampling of Phylogenetic Trees

51

from the top) of T3 is not a majority node, both s1 s2 and s4 is added under s1 s2 s3 s4 . The Ml would look like (b) after finishing the traversal of s1 s2 s3 (third node from the top) of T2 . Since, Bi = s1 s2 s3 was not in the Ml built so far, a node for it is created and C = s1 s2 s3 s4 is made its parent. When the traversal move on to Bi = s1 the c moves down to s1 s2 s3 . Now, the parent of Bi in Ml is P = s1 s2 s3 s4 and it is greater in cardinality than the C = s1 s2 s3 . Therefore, Bi ’s parent is updated to C = s1 s2 s3 from P = s1 s2 s3 s4 (see (c)). Similarly When the traversal move on to Bi = s2 s3 of T2 , it updates its parent to C = s1 s2 s3 from P = s1 s2 s3 s4 (see (d)). After finishing the traversal of T2 and T1 the Ml would look like (e) which is the desired majority rule consensus tree. Although the example uses more simple binary tree, the algorithm equally performs in case of n-ary tree.

4.2

Uniform Sampling of Phylogenetic Trees

Most of the analysis algorithms proposed earlier were tested on the simulated topological and sequence data. The common reasons behind this are unavailability of biologically obtained phylogenetic trees and the large sizes of the few available ones. With the increase in the study of phylogenetic analysis, the necessity of more correct and reliable phylogenetic trees is also increasing. So, recent trend is to analyze an algorithm with a smaller and representative sample from a real phylogenetic tree. One might think it as a trivial problem by just choosing an arbitrary subtree from a large phylogenetic tree. But it becomes non-trivial when the samples are required to be uniformly selected over some biologically motivated constrained set of induced subtrees. In this section, the general problem of uniform tree sampling is defined and some specific problems are described.

4.2.1

Terminology

Let, T is a phylogenetic tree over a set of taxa S, where |S| = n. If S 0 is a nonempty set of taxa such that S 0 ⊂ S, then the subtree induced by S 0 from T is denoted by T 0 and is obtained

52

Chapter 4. Phylogenetic Tree by the following sequence of steps • deleting all leaves from S\S 0 • deleting all edges connecting those leaves

• deleting all degree-2 internal nodes by merging the edges connected to it into a single one.

If T is a weighted tree then the merging of two weighted edges should be defined. In the Fig. 4.4 the T is shown in (a) where S = {1, 2, . . . , 10} and a induced subtree T 0 for S 0 = {1, 3, 4, 5, 6, 7} is shown in (b). 8

7

7

9 6

10 1

5

6

1

5

2 3

4

(a)

3

4

(b)

Figure 4.4: An induced subtree of a phylogenetic tree Now, if we are given a T ,S and S 0 , a unique T 0 is defined. But if we are given |S 0 | instead of S 0 then a number of T 0 s can be found. This |S 0 | is named as samplesize. The uniform sampling problem states to find a fixed sized sample from T uniformly.

Problem : Uniform phylogenetic tree sampling problem Input : A phylogenetic tree T over a set of taxa S such that |S| = n and the sample size 2 ≤ m < n. Output : A uniformly chosen induced subtree T 0 over all such trees where T 0 has m distinct leaves taken from S and satisfies some constraints.

Although phylogenetic trees can be n-ary tree, the binary representation is the mostly used.

4.2. Uniform Sampling of Phylogenetic Trees

53

In this section we will introduce three sampling problem of binary phylogenetic trees [KMP03].

4.2.2

Sampling Problems

All the problems stated in this section consider trees weighted by positive real numbers. To merge two edges of a degree-2 internal node, the node is just removed and the solitary edge gets a weight equal to the sum of the previous two weights.

Problem : Leaf depth problem Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n and two non-negative real number dmin and dmax . Output : A uniformly chosen induced subtree T 0 over all such trees where T 0 has m distinct leaves taken from S and satisfies dmin < dist(r, v) < dmax for all the leaves v of T 0 and root r of T 0 .

Problem : Edge length problem Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n and two non-negative real number emin and emax . Output : A uniformly chosen induced subtree T 0 over all such trees where T 0 has m distinct leaves taken from S and satisfies emin < |e| < emax for all the edges e of T 0 .

Problem : Pair wise leaf distance problem Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n and two non-negative real number dmin and dmax . Output : A uniformly chosen induced subtree T 0 over all such trees where T 0 has m distinct leaves taken from S and satisfies dmin < dist(v1 , v2 ) < dmax for all pair of distinct leaves v1 and v2 of T 0 . Efficient algorithms for all the above problems are available considering T to be binary [KMP03]. For arbitrary trees, the complexities of these problems are still to be explored.

Chapter 4. Phylogenetic Tree

54

Although every arbitrary phylogenetic tree can be represented as binary phylogenetic tree by adding internal nodes, to our belief determining the complexity of sampling arbitrary tree for the above problems will be a tough task. Since most of the phylogenetic software packages use binary representation, it will also be a good option to generate binary samples from an arbitrary n-ary phylogenetic tree satisfying the above constraints.

4.3

Summary

Phylogeny is one of the most extensively studied area of bioinformatics. The obvious representation of tree and rich theory behind its analysis made phylogeny more advanced than other recent areas. Phylogenetic tree analysis is currently widely used in disease tracing, Biodiversity analysis, customized vaccine research, understanding microbial ecology and many such fields of applied biology. In this chapter we present an algorithm to construct consensus tree using majority rule. The algorithm is carried out in two phases: counting phase and construction phase. We also annotated three phylogenetic tree sampling problems. Uniform sampling of small induced trees from large phylogenetic trees of millions of taxa is relatively new area of study. There could be many variations of this nontrivial uniform sampling problem. Some directions are also discussed at the end of the chapter.

Chapter 5 Conclusion This thesis reveals several uses of graph in generating and representing biological data. We organize the thesis around two specific problems on these two broad classes : haplotyping and phylogeny. Generation of haplotypes for an individual from fragments of SNP sequence can be done using several algorithms. Minimum fragment removal, minimum SNP removal and minimum error correction are such algorithms suitable for different experimental set ups. In this thesis, we described the computational model (fragment conflict graph and SNP conflict graph) for these problems. We proposed a local search based algorithm for MEC problem. We also incorporate variations of these problems, some of which are still open in their complexity and algorithmic status. Generation of haplotypes for a population from their genotype data is another important area. Pure parsimony problem, maximum resolution problem, perfect phylogeny haplotyping are discussed in this thesis. We provide illustrative examples to explain the computation required to solve these problems. Phylogenetic tree is the most widely used tool to represent the parent-child relationships of biological elements, like DNA sequences, external features, physiological processes, etc. We described an algorithm to construct consensus tree from many small candidate trees using majority rule. We also described the uniform sampling problems of large phylogenetic tree. 55

Chapter 5. Conclusion

56

Let us justify the organization of this thesis again. In chapter 1, we defined the research area : “bioinformatics” and provide two different perspectives to look at it : biological and computational perspective. We explicitly defined the scope of this thesis with respect to the possible area of researches. In chapter 2 we define several preliminary ideas to help a reader to grasp the content of this thesis. In chapter 3 we discussed the haplotyping problems. We described the LSMEC, an algorithm for solving minimum error correction problem. The algorithm is an iterative search algorithm and performs quite efficiently even for true random fragment data. Two graphs are defined for the MFR and MSR as computational model. The computational status of different versions with respect to these models are also presented. We also explained the perfect phylogeny haplotyping problem and maximum resolution problem. In chapter 4 we discussed the phylogeny. We explained the steps of the majority rule algorithm to construct consensus tree. The algorithm is illustrated very elegantly for easier understanding. Uniform sampling of phylogenetic tree meeting some defined constraints is a non-trivial sampling problem. Two different constraints on pair-wise leaf distances and on edge lengths of the induced subtrees are defined here. Sampling of binary tree from arbitrary n-ary tree would be an interesting problem in this aspect. The use of graph in bioinformatics as a computational tool is the subject matter of this thesis. Other branches of computer science are also significant in their contribution to bioinformatics and this thesis inspires us to study those areas very soon.

References [ACJ03] N. Amenta, F. Clarke, K. S. Jhon: A Linear-Time majority tree algorithm, Proc. of 3rd International Workshop on Algorithms in Bioinformatics, LNBI 2812, Springer, pp. 216-227, 2003. [BILR05] V. Bafna, S. Istrail, G. Lancia, R. Rizzi: Polynomial and APX-hard cases of the individual haplotyping problem, Theoretical Computer Science, 335, pp. 109-125, 2005. [BVDL03] P. Bonizzoni, G. D. Vedova, R. Dondi, J. Li: The Haplotyping problem: an overview of computational models and solutions, Journal of Computer Science and Technology, 18(6), pp. 675-688, 2003. [BZ06] D. Branza, A. Zelikovsky: 2snp: scalable phasing based on 2-snp haplotypes, Bioinformatics, 22(3), pp. 371-373, 2006. [CIKT05] R. Cilibrasi, L. V. Iersel, S. Kelk, J. Tromp: On the complexity of several haplotyping problems, Proc. of 5th International Workshop on Algorithms in Bioinformatics, LNCS 3692, Springer, pp. 128-139, 2005. [Cla90] A. G. Clark: Inference of haplotypes from PCR-amplified samples of diploid populations, Molecular Biology and Evolution, 7(2), pp. 111-122, 1990. [Die05] R. Diestel: Graph theory, Springer, Heidelberg, 2005. [Gus01] D. M. Gusfield: Inference of haplotypes from samples of diploid populations: Complexity and algorithms, Journal of Computational Biology, 8(3), pp. 305-323, 2001. 57

References

58

[Huf05] Huffner F.: Algorithm engineering for optimal graph bipartization, Proc. of 4th International Workshop on Efficient and Experimental Algorithms, LNCS 3503, Springer, pp. 240-252, 2005. [KMP03] P. Kearney, J. I. Munro, D. Phillips: Efficient generation of uniform samples from phylogenetic trees, Proc. of 3rd International Workshop on Algorithms in Bioinformatics, LNBI 2812, Springer, pp. 177-189, 2003. [LBIS01] G. Lancia, V. Bafna, S. Istrail, R. Schwartz: SNP Problems, complexity and algorithms, Proc. of 9th Annual European Symposium on Algorithms (ESA), 2161, LNCS, Springer, pp. 182-193, 2001. [MM81] T. Margush, F. R. McMorris : Consensus n-trees, Bulletin of Mathematical Biology, 43, pp. 239-244, 1981. [PM06] B. Pasaniuc, I. Mandoiu: Highly scalable genotype phasing by entropy minimization, Proc. of IEEE International Conference of the Engineering in Medicine and Biology Society, pp. 3482-3486, 2006. [RTCN99] M. J. Rieder, S. L. Taylor, A. G. Clark, D. A. Nickrson: Sequence variation in the human angiotensin aonverting enzyme, Nature Genetics, 22, pp. 59-62, 1999. [WWLZ05] R.-S. Wang, L.-Y. Wu, Z.-P. Li, X.-S. Zhang: Haplotype reconstruction from SNP fragments by minimum error correction, Bioinformatics, 21(10), pp. 2456-2462, 2005. [Wes03] D. B. West: Introduction to graph theory, Prentice Hall of India, 2003.

Index allele, 18, 26

gap, 20

heterozygous, 19

GenBank, 2

homozygous, 19

genotype, 37 phasing, 39

bi-allelic, 20

resolution, 38

binary-witness MEC, 35

graph, 15

bioinformatics, 1

bipartite, 22

areas, 1

connected graph, 16

bipartition, 47

cycle, 16

bit-string, 48

degree, 16

BLAST, 2

directed, 15 cardinality, 51

path, 16

chromosome, 7

subgraph, 15

collection, 26

undirected, 15

conflict, 20 consensus tree, 46

haplotype, 10, 19

majority rule tree, 47

construction, 22 length, 23

diploid, 19

haplotype perfect phylogeny, 41

DNA, 7

haplotyping, 19 forbidden matrix, 42

individual, 19

fragment, 26

population, 19, 36

fragment conflict graph, 22

HMEC, 26 hole, 20

gain, 29 59

60

INDEX induced subtree, 52

SNP, 18

inference rule, 39

SNP conflict graph, 24 SNP matrix, 20

LHR, 23 taxa, 12, 44 marker, 14 maximum cumulative gain, 29 MEC, 25 MFR, 22 MR, 39 MSR, 24 NCBI, 2 nucleotide, 7 partition, 26 PDB, 2 perfect graph, 24 personalized drug prediction, 14 phylogenetic Tree, 44 phylogenetic tree, 12 character-based, 13 cladistic, 12 distance-based, 12 phenetic, 12 phylogeny, 12 protein, 9 pure parsimony, 38 replication, 9 RNA, 9

outgroup, 47 transcription, 9 tree, 16 topology, 46

B. Sc. Engg. Thesis Applications of Graphs in ...

3.1 Minimum fragment removal using fragment conflict graph . . . . . . . . . . . . . .... Develop better tools for data generation, capture, and annotation. .... visualizing and sampling of phylogenetic trees are major applications of graph in this branch.

392KB Sizes 11 Downloads 98 Views

Recommend Documents

Scheme B Sc V Sem.pdf
Page 1 of 2. SHRI VAISHNAV VIDYAPEETH VISHWAVIDYALAYA, SVIFS, INDORE. Choice Based Credit System (CBCS) Scheme. B.Sc. / B.Sc.-M.Sc. (Forensic Science). V- SEMESTER. S. No. Subject. Code Name of Subject Course. Code. Teaching. Scheme/week Examination

Scheme B Sc IV Sem.pdf
Page 1 of 2. SHRI VAISHNAV VIDYAPEETH VISHWAVIDYALAYA, SVIFS, INDORE. Choice Based Credit System (CBCS) Scheme. B.Sc. / B.Sc.-M.Sc. (Forensic Science). IV – SEMESTER. S. No. Subject. Code Name of Subject Course. Code. Teaching. Scheme/week Examinat

ALI B. AHMAD _ THESIS - Ali Bawasheakh Ahmad.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. ALI B. AHMAD _ ...

Third Year B. Sc. Examination 2013 Chemistry P-VIII (Physical ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

sc/sc name in attendance register.pdf
sc/sc name in attendance register.pdf. sc/sc name in attendance register.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying sc/sc name in attendance ...

Environmental Engg. Public Health Engg. Water Tech Health Sciences ...
Environmental Engg. Public Health Engg. Water Tech Health Sciences.pdf. Environmental Engg. Public Health Engg. Water Tech Health Sciences.pdf. Open.

. Computer Science and Engg Information Science Engg..pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... . Computer S ... ce Engg..pdf . Computer Sc ... nce Engg..pdf. Open.

REPRESENTATION OF GRAPHS USING INTUITIONISTIC ...
Nov 17, 2016 - gN ◦ fN : V1 → V3 such that (gN ◦ fN )(u) = ge(fe(u)) for all u ∈ V1. As fN : V1 → V2 is an isomorphism from G1 onto G2, such that fe(v) = v′.

Engg-Chemistry.pdf
Amazing Spider-Man 2 - новые приключения отважного героя. Разработчик:Gameloft / Системныетребования:Android 4.0 и выше. We give you the gameto ...

Computer Science and Engg Information Science Engg..pdf
Page 3 of 4. -Computer Science and Engg Information Science Engg..pdf. -Computer Science and Engg Information Science Engg..pdf. Open. Extract. Open with.

. Computer Science and Engg Information Science Engg..pdf ...
Page 3 of 4 . Computer Science and Engg Information Science Engg..pdf . Computer Science and Engg Information Science Engg..pdf. Open. Extract. Open with.

Master of Education Major in Social Science (Non-Thesis).pdf ...
Master of Education Major in Social Science (Non-Thesis).pdf. Master of Education Major in Social Science (Non-Thesis).pdf. Open. Extract. Open with. Sign In.

Master of Education Major in Social Science (Non-Thesis).pdf ...
Whoops! There was a problem loading more pages. Master of Education Major in Social Science (Non-Thesis).pdf. Master of Education Major in Social Science ...

Centrality in valued graphs - Linton C. Freeman
037%8733/91/$03.50. 0 1991 - Elsevier Science Publishers B.V. All rights ... Marriott 1958; Anthonisse 1971; Freeman 1977; Friedkin 1991). Such people can facilitate or inhibit the communication .... In extending the betweenness model to valued graph

Ramanujan graphs in cryptography - Cryptology ePrint Archive
... Research, One Microsoft Way, Redmond, WA 98052, [email protected] .... We begin by recalling some basic facts about isogenies of elliptic curves and ...