RET. Revista de Estudios Transdisciplinarios Vol. 1. Nº 2. Serie verde | Caracas,

julio- diciembre 2009

Cálculos de árboles filogenéticos empleando la metodología DAG en un Grid Phylogenetic tree calculations using the Grid with DAG workflow job

Raúl Isea*,1 Juan L. Chaves , 2 Fernando Blanco, 3 Rafael Mayo 3 . 21

El objetivo del presente trabajo es implementar cálculos de filogenia en una Grid empleando para ello, el programa MrBayes bajo la técnica de Grafo Directo Acíclico (“Directed Acyclic Graphs” abreviado por sus siglas en inglés DAG). Esta metodología realiza los cálculos en forma secuencial agrupados por tareas distribuidas tanto de los datos de entrada como los de salida bajo la filosofía de Computación Grid. Una vez que los cálculos se han terminado de realizar de manera exitosa, son procesados por una serie de scripts escritos en el lenguaje de programación Perl bajo la filosofía de la metodología DAG. Finalmente se corrobora dicha metodología con el cálculo de la filogenia de 121 secuencias (taxas) provenientes del virus del papiloma humano.

The goal of the work is to implement molecular phylogenetic calculations using the Grid paradigm by means of the MrBayes software using Directed Acyclic Graphs (DAG) jobs. In this method, a set of jobs depends on the input or the output of other jobs. Once the runs have been successfully done, all the results can be collected by a specific Perl script inside the defined DAG job. For testing this methodology, we calculate the evolution of papillomavirus with 121 sequences. Palabras clave / Keywords Grafo Directo Acíclico; DAG; MrBayes; Filogenia. / Directed Acyclic Graphs; MrBayes; Grid; Phylogenetic.

Introduction One of the most exciting challenges that are emerging in the computational biology these days, is to determine the evolution history of different species. One method for determination of the relationship among species is Phylogenetics (Woese, 1998; Maher, 2002). As an example, we can mention the work from Korber et al. (2000), where the evolution history of AIDS disease was determined. In this work, the authors deduced that AIDS does not come from a contaminated sample of a polio vaccine in Africa, but had its origin many years earlier (Hooper & Hamilton, 2000). This last result was obtained in a 512 Origin 2000 CPUs cluster called Nirvana sited at Los Alamos Advanced Computing Laboratory by running a modified version of fastDNAml. This program was improved by adding parallel architectures and routines and a reversible base-substitution model (Korber et al., 2000).

According to this, we can conclude that there is a computational method, which is able to build the evolution history of any species. Unfortunately, this achievement was obtained with a high computational time cost and, for most of the cases, the scientific community is not even able to access the supercomputer infrastructures used in Korber et al. (2000). For this reason, it is important to develop alternative computational efficient techniques for faster and reliable phylogeny estimation. Among them, we can currently find the following methods: based on distance, maximum parsimony, maximum likelihood or Bayesian. In this work, the latter will only be used with the help of MrBayes software (available at www.mrbayes.net), although it is relatively new in the construction of phylogenetic trees, as reflected on the pioneering work of Rannala & Yang (1996). This methodology works with the Bayesian statistics previously proposed by Felsentein in 1968 as indicated by Huelsenbeck, a technique for maximizing the subsequent probability (Huel-

RAÚL ISEA, JUAN L. CHAVES, FERNANDO BLANCO AND RAFAEL MAYO.

22

senbeck, Larget, Miller, & Ronquist, 2002). This approach was chosen because it deals with computational methods of higher speed, in this way all the possible values for the generated trees can be considered and dominant values cannot predominate. Up to now, there are several centres involved in Grid computing in the biomedical field, some examples are EGEE (web site at http://www.eu-egee.org/) or EELA (web site at http://www.eu-eela.eu/) and their further phases. The latter aims to disseminate distributed computing technology and to share European and Latin American resources via communication networks built on both continents. As an example, we can mention the alignment of nucleotide sequences by means of the BLAST tools (Hernández et al., 2007). However, there are no applications to our knowledge for calculating phylogeny in the Grid environment by the use of heterogeneous and distributed resources with the submission of a Job Description Language (JDL) with Directed Acyclic Graph (DAG) dependencies (Sato, 2008). Therefore, this works intends to supply this lack since DAG is a proven powerful technique in Bioinformatics. The diversity of papillomavirus (PV) types has been calculated (Villiers, Fauquet, Broker, Bernard, & Hausen, 2004). Four years ago, in the context of a phylogeny, it was necessary to define the term “species”. As an example, species linked to human PV-2 are typically found in common skin warts. PV types that form a species linked to PV-16 are also found in high percentage in cervical cancer and its precursor lesions, i.e. they are considered as “high risk” components (such as human PV-18, human PV-45 and so on). In this context, this research has been done with 121 sequences from different PVs based on the nucleotide sequence of the major capsid gene, the most conserved gene in PV genomes. Finally, it is important to mention that this work is not focused on the selection of the sequences and model in the phylogenetic context, but on the development of a new method which can perform phylogenetic calculations in a more efficient way by means of the Grid technology.

Methods The sequences used in this work were downloaded from GenBank database, and lately, all of them were grouped in a unique Multi-fasta file. The next step was to align these sequences by means of Clustal software (Thompson, Higgins & Gibson, 1997) and to store the output file in nexus format (called ha.nex), which is used by MrBayes. The DAG technology was performed in a small 4 nodes cluster where three of them were used as workstations for three independent calculations. They were labelled as A, B and C. These three calculations or scripts correspond to the three independent calculations of the phylogeny (see figure

1). In other words, we ran three independent calculations identified as hai.run, where i ran from 1 to 3 and corresponds to A, B and C, respectively. These three input scripts are built in a way that all of them are independent from the rest and begin from a random generated tree; by doing this, iterations in the input files are avoided. The calculations start in the nodes A, B and C and their

FIG. 1. Schema of the DAG dependencies used in this work.

corresponding results are then used as input for the node D, which is a central node of this cluster. We have only used three different nodes in order to validate the use of DAG in these phylogenetic Grid calculations, but of course this number can be increased (Fig. 1). This input file in node D contains all the results from the previously generated executions and are stored in a concrete variable called OutputSandboxBase in our example. This variable can be stored and used in an improved GridFTP called Grid Security Infrastructure File Transfer Protocol (GSIFTP). Later, the Perl script developed by Johan Nylander called burntrees.pl (http://www.abc.se/~nylander), is able to manipulate the phylogenetic trees deployed by MrBayes independently and place them together, without losing information. In the last step (identified in Fig. 1 as node E), MrBayes infers a new calculation with a new script called haf.run where the final phylogenetic tree derived from the three previous ones is obtained. This final calculation contains implicitly the three previous results performed in nodes A, B and C. This can be done because a consistency must be achieved; for this reason, the burning variable used in haf. run is exactly the same one used in the burntrees.pl script (i.e. 100). At the same time, a similar calculation was performed in a cluster supporting MPI with the same number and type of processors that were used by MrBayes in the DAG method. The reason is again two-fold: first, to reproduce the same topology and validate the calculation performed on the Grid environment; and second, to make a comparison between

RET. Revista de Estudios Transdisciplinarios Vol. 1. Nº 2. Serie verde | Caracas,

julio- diciembre 2009

the two methods related to the consumed time. The calculations start in the nodes A, B and C and their corresponding results are then used as input for the node D, which is a central node of this cluster. We have only used three different nodes in order to validate the use of DAG in these phylogenetic Grid calculations, but of course this number can be increased (Fig. 1). This input file in node D contains all the results from the previously generated executions and are stored in a concrete variable called OutputSandboxBase in our example. This variable can be stored and used in an improved GridFTP called Grid Security Infrastructure File Transfer Protocol (GSIFTP). Later, the Perl script developed by Johan Nylander called burntrees.pl (http://www.abc.se/~nylander), is able to manipulate the phylogenetic trees deployed by MrBayes independently and place them together, without losing information. In the last step (identified in Fig. 1 as node E), MrBayes infers a new calculation with a new script called haf.run where the final phylogenetic tree derived from the three previous ones is obtained. This final calculation contains implicitly the three previous results performed in nodes A, B and C. This can be done because a consistency must be achieved; for this reason, the burning variable used in haf. run is exactly the same one used in the burntrees.pl script (i.e. 100). At the same time, a similar calculation was performed in a cluster supporting MPI with the same number and type of processors that were used by MrBayes in the DAG method. The reason is again two-fold: first, to reproduce the same topology and validate the calculation performed on the Grid environment; and second, to make a comparison between the two methods related to the consumed time.

23

Results and Discussion The study of molecular phylogenies is the most worldwide method used for classifying the different types of papillomavirus (Villiers et al., 2004); even more, this method of studying phylogeny is the only one available to classify the diversity of PV types, reason why we checked our methodology in a phylogenetic calculation. The final phylogenetic tree generated with the DAG Grid job can be seen in Figure 2, where the high-risk component is just in the same group as it has been identified in human PV-16, human PV-18 and so on. This fact agrees with the paper published by Villiers where the neighbour-joining phylogenetic methodology was used (Villiers et al., 2004). Since our goal is to validate the Grid technology in this kind of calculations, a clear comparison can be made between the two results. As shown, both results are equivalent. By the use of the “comparetree” command, the outputs of the trees can be

FIG. 2. Phylogenetic tree visualization of PVs with ATV software.

compared. Thus, using this command with our Grid DAG result and the previous one, an agreement of 0.99718 (where 1.0 is identical) can be found. Thus, our results are consistent with the previous ones and, as a consequence, the use of the Grid can easily been inferred. In our example, Intel Dual Xeon 2.3 GHz has been used and one node is able to calculate about 1 million steps in 5 hours. Since the Grid method is able to distribute the calculations to many Working Nodes at the same time, a high benefit of this technique is expected when the calculations are extended to a greater number of inputs in the DAG dependencies. It is important to point out that only three nodes haven been used in our Grid job. If we compare the time consumed by the Grid and the local cluster calculations, we find that the ratio of time used for achieving our phylogenetic result has been 1.21, i.e. the

RAÚL ISEA, JUAN L. CHAVES, FERNANDO BLANCO AND RAFAEL MAYO.

24

Grid is slower. The reason for this lays on the fact that some Perl scripts must be performed independently in order to collect the different calculations as if they were being executed in a single batch job. In this way, there is a master node which rules the whole process. Moreover, we must take into account the drawback that for this work we were obliged to monitor the nodes ensure that the program did not crash or stop with a stand-by. The longer periods of time needed, will be avoided in the future by the use of meta-scheduler such as Gridway (Huedo, Montero, & Llorente, 2005). This tool also allows the user to make advances in the load balancing or the assignment of nodes, therefore, we shall be able to avoid the use of new scripts for such kind of tasks.

Conclusions

This research demonstrates the efficient use of Grid Te-

chnologies for performing molecular phylogenetic calculations in many computational resources. For doing this, a DAG script has been developed. It takes advantage of the possibility of deploying distributed calculations in different allocated resources obtaining a unique result. The time used for achieving this result has also been only 1.21 times slower than the one used for the similar calculation in absence of the Grid environment. Finally, initiatives such as EELA2 offer the scientific community: computational time in an easy way via a friendly interface for the researchers, and also allow them to submit from their own computer high-demanding scientific calculations. The easy access is a key factor since most of the researchers are not familiar with the distributed computational techniques such as Grid.

Acknowledgments We are grateful to Johan Hoebeke who provided helpful comments on a previous draft of this paper. Authors thank in particular the support provided by the EELA Project (E-infrastructure shared between Europe and Latin America, http://www.eu-eela.org), contract n° 026409-6th Framework Programme for Research, Technological Development and Demonstration (FP6).

References 1. Hernández, V.; Blanquer, I.; Aparicio, G.; Isea, R.; Chaves, J.L.; Hernández, A.; Mora, H.R.; Fernández, M.; Acero, A.; Montes, E. & Mayo, R. (2007). Advances in the biomedical applications of the EELA Project. Studies in Health Technology and Informatics. 126:31 – 36. 2. Hooper, E. & Hamilton B. (2000). The River: A Journey to the Source of HIV and AIDS. USA: Back Bay Books. 3. Huedo, E.; Montero, R.S. & Llorente, I.M. (2005). The GridWay Framework for Adaptive Scheduling and Execution on Grids, Journal Scalable Computing - Practice and Experience. 6: 1-8. 4. Huelsenbeck, J.P.; Larget, B.; Miller, R.E. & Ronquist F. (2002). Potential Applications and Pitfalls of Bayesian Inference of Phylogeny, Systems Biology. 51: 673-688. 5. Korber, B.; Muldoon, M.; Theiler, J.; Gao, F.; Gupta, R.; Lapedes, A.; Hahn, B.H.; Wolinsky, S. & Bhattacharya, T. (2000). Timing the Ancestor of the HIV-1 Pandemic Strains,Science. 288: 1789-1796. 6. Maher, B.A. (2002). Uprooting the Tree of Life, The Scientist. 18: Sep. 16 7. Rannala, B. & Yang Z. (1996). Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference, Journal Molecular Evolution. 43:304-311. 8. Sato, K.; Mituyama, T.; Asai, K. & Sakakibara Y. (2008). Directed acyclic graph kernels for structural RNA analysis, Bioinformatics. 9: 318-330. 9.Thompson, J.D.; Higgins, D.G. & Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research. 22: 4673-4680. 10. Villiers, E.M.; Fauquet, C.; Broker, T.R.; Bernard, H.U. & Hausen H.Z. (2004). Classification of papillomaviruses. Virology, 324:17-27. 11. Woese, C.R. (1998). The Universal Ancestor, Proceedings of the National Academy of Sciences of the United States of America. 95:6854-6859. 12. Zmasek, C.M. & Eddy S.R. (2001). ATV: display and manipulation of annotated phylogenetic trees, Bioinformatics. 17: 383-384. 13. Cox, J.L., Farrell, R.A., Hart, R.W., and

RET. Revista de Estudios Transdisciplinarios Vol. 1. Nº 2. Serie verde | Caracas,

julio- diciembre 2009

Langham, M.E. The transparency of the mammalian cornea. J. Physiol. 210, 601, 1970.

1 Fundación Instituto de Estudios Avanzados - IDEA, Valle de Sartenejas, Baruta 1080, Venezuela. 2 Parque Tecnológico Mérida, Av.4 Edif. Masini, Mérida 5101, Venezuela. 3 CIEMAT, Avda. Complutense, 22, 28040 Madrid, España (*) Corresponding author: Raúl Isea Fundación Instituto de Estudios Avanzados IDEA Valle de Sartenejas, Hoyo de la Puerta, Baruta, Venezuela. Email: [email protected] Telf. 9035162

25

Cálculos de árboles filogenéticos empleando la metodología DAG en ...

quences by means of the BLAST tools (Hernández et al.,. 2007). However, there are no .... Baruta, Venezuela. Email: [email protected]. Telf. 9035162.

367KB Sizes 0 Downloads 32 Views

Recommend Documents

Panorama de la acreditación en calidad en instituciones de educación superior de criminología en México
El presente es parte de la investigación doctoral “Estudio de Pertinencia de los Proyectos de Formación y Ejercicio Profesional de los Criminólogos”, desarrollada en la Universidad Autónoma de Nuevo León. El objetivo es detectar de todo el universo d

LA IMPORTANCIA DE LA INVESTIGACIÓN EN LAS ...
LA IMPORTANCIA DE LA INVESTIGACIÓN EN LAS UNIVERSIDADES.pdf. LA IMPORTANCIA DE LA INVESTIGACIÓN EN LAS UNIVERSIDADES.pdf. Open.

La necesidad de adoptar el modelo europeo en la criminologia
La necesidad de adoptar el modelo europeo en la criminologia

Bioquimica La importancia de las areas basicas en la odontologia.pdf ...
Page 2 of 2. Bioquimica La importancia de las areas basicas en la odontologia.pdf. Bioquimica La importancia de las areas basicas en la odontologia.pdf. Open.

100-113 tendencias en la variabilidad de la temperatura.pdf ...
Tendencies in the variability of the sea surface temperature in the coast. of Ecuador. Dr. Larry Breaker PhD. MSc. Hans Ruperti Loor. Dr. Dustin Carroll PhD.

El papel de la criminología en la prevención
El papel de la criminología en la prevención

Bioquimica La importancia de las areas basicas en la odontologia.pdf ...
... on dental educa- tion: current concepts, trends, and models for the future J Dent ... of clinical caries preven- tion teaching in U.S. and Canadian dental schools.

Los estudios en materia de prevención de la violencia desde la obra de Herbert Marcuse
Se toma de referencia la obra Cultura y Sociedad de Herbet Marcuse, para articular brevemente la necesidad de crear una licenciatura en estudios enfocados a las formas de criminalidad, y su prevención. El escrito de Marcuse, se enfoca en ideales de v

Revisión de los postulados de Emilio Durkheim en relación con la explicación de la criminalidad y la Política Criminal
El presente expone una breve revisión teórica sobre los postulados principales del sociólogo francés: Emilio Durkheim. A pesar de la antigüedad de estos, como muchos otros tratadistas, siguen siendo vigentes ante el contexto actual, por ello, se hace

La participación activa de la ciudadanía como elemento clave para la reducción de la violencia en México
El presente artículo se divide en tres partes, en la primera se revisa el concepto de seguridad como aspecto integral, contemplando diversas condiciones como la estabilidad social, trabajo, salud, entre otros, dando una vista más allá de la seguridad

Trilce y Altazor, la experiencia de lo real en la lírica de vanguardia.pdf ...
Vicente Huidobro nació en Santiago de Chile en 1893 y falleció en 1948. Estudió. Literatura en la Universidad de Chile. En 1914 dictó la conferencia Non ...

214317783 - Gratis en la piel de grey pdf
214317783 - Gratis en la piel de grey pdf. 214317783 - Gratis en la piel de grey pdf. Open. Extract. Open with. Sign In. Main menu. Displaying 214317783 ...

Variabilidad espacio-temporal de las lluvias en la ...
Basin, the climate of South America, Peru, Ecuador, Bolivia,. Colombia, ENSO ..... se ha llevado a cabo utilizando el software ...... Scientific Publishing Company.

Descargar pdf en la piel de grey
Free download. descargar photoshop sin virus gratisen español.descargar musica mp3 gratis rapido paracelular.descargar gratis download software.descargar.

estimation-de-la-frequence-des-hemorragies-obstetricales-en ...
Retrying... estimation-de-la-frequence-des-hemorragies-obstetricales-en-france.pdf. estimation-de-la-frequence-des-hemorragies-obstetricales-en-france.pdf.

RAMOS-Desencuentro de la Modernidad en America Latina.pdf ...
Page 3 of 225. RAMOS-Desencuentro de la Modernidad en America Latina.pdf. RAMOS-Desencuentro de la Modernidad en America Latina.pdf. Open. Extract.

En el Imperio de la Mediocridad.pdf
Whoops! There was a problem loading this page. Retrying... Page 3 of 50. CBS News - 2018 State of the Union Survey. 11. These days, do you generally ...

Pertinencia de la formación universitaria en Criminología y Criminalística
En este documento se integran lecturas de corrientes educativas a un proyecto de investigación que busca analizar las formas de enseñanza universitaria de la criminología y criminalística en México, para detectar áreas de oportunidad y contribuir con

258 - Lugar de investigación en la Criminalística.pdf
0. 1.- Diccionario de la Real Academia de la Lengua Española. Page 3 of 6. 258 - Lugar de investigación en la Criminalística.pdf. 258 - Lugar de investigación ...

La desmitologización del mestizaje en Honduras de Breny Mendoza ...
otro y la repulsión que causa la otredad del otro, el deseo del otro. ... Por eso la revisión histórica del discurso del mestizaje que Barahona, ... simultáneas que surgen de la situación fronteriza de chicanos y mexicanos en Estados Unidos o de

RETROSPECTIVE-PHOTO-DE-LA-REUSSITE-DU-TRAVAIL-EN ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

GRAFICA DE LA SERIE ONDA CUADRADA EN EXCEL.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. GRAFICA DE LA ...

Descargar en la piel de grey pdf
Descargaren la piel de grey pdf- Download.Descargaren la piel de grey pdf.descargar whatsapp paraiphone 3g ios 4.3.1.descargar programa. para pasar ...