OUP CORRECTED PROOF – FINAL, 2/2/2012, SPi
Codon Evolution Mechanisms and Models EDITED BY
Gina M. Cannarozzi University of Bern, Switzerland
Adrian Schneider University of Utrecht, The Netherlands
1
OUP CORRECTED PROOF – FINAL, 2/2/2012, SPi
3
Great Clarendon Street, Oxford OX2 6DP Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trademark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York © Oxford University Press 2012 The moral rights of the authors have been asserted Database right Oxford University Press (maker) First published 2012 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer British Library Cataloguing in Publication Data Data available Library of Congress Cataloging in Publication Data Library of Congress Control Number: 2011944051 Typeset by SPI Publisher Services, Pondicherry, India Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY ISBN 978–0–19–960116–5 1 3 5 7 9 10 8 6 4 2
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
CHAPTER 9
Simulation of coding sequence evolution Miguel Arenas and David Posada
9.1
Introduction
Computer simulations try to mimic the processes that happen in the real world (Peck, 2004) and are widely used in a large number of different fields. In general, computer simulations allow to study complex systems, including those analytically intractable. Indeed, computer simulations can be considered as experimental systems (Winsberg, 2003), and using appropriate models, they can be more efficient than simple analytic methods in order to gain information about the system. In silico simulations can incorporate stochasticity very easily, which is a typical and inherent property in most biological processes. In molecular evolution, computer simulations have been commonly used to understand the interactions among evolutionary processes, for hypothesis testing, to evaluate and compare different analytical methods, or to estimate evolutionary parameters. Indeed, in order to obtain meaningful biological inferences from simulated data, it is important that the generating models are as realistic as possible. In this chapter, we describe simulation algorithms for the evolution of coding sequences under different evolutionary scenarios, potential uses of these simulations, and current simulation software.
9.2
Simulation of coding sequences
There are different simulation strategies that can be more or less appropriate depending on the question of interest or the purpose of the study. Below the species level, samples of codon sequences can be taken from whole populations simulated forward in time, generation by generation, starting 126
from the sequence of their most recent common ancestor (MRCA). On the other hand, a sample of sequences can be simulated if their genealogy (below the species level) or phylogeny (above the species level) is known. Coalescent theory provides a versatile framework to simulate the genealogy of a sample backward in time. Phylogenies can be given from previous studies, estimated from data, or simulated itself under branching processes like the coalescent or the birth–death process. It is also possible to simulate gene trees within population or species trees. In general, ‘forward’ simulations are extremely flexible, allowing for the simulation of almost any population genetic model. On the other hand, ‘backward’ approaches are computationally much more efficient, especially for short sequences, but have more limitations especially regarding the simulation of selective scenarios and large recombination rates. Both approaches have different advantages and disadvantages (see below) and, in general, they can be considered to complement each other.
9.2.1 Forward simulations Forward-time simulations describe the evolutionary history of a whole population from the past to the present. Traditionally they were constrained by the available computational power, but in recent years there have been important advances in this regard, such as the implementation of more computationally efficient algorithms (e.g. Peng and Kimmel, 2005; Padhukasahasram et al., 2008). Forward simulations often start from an initial population, which constitutes the basis for the descendant generation. Often, generations are considered discrete
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
SIMULATION OF CODING SEQUENCES
and non-overlapping. After introducing mutations at random, individuals contributing to the next generation are randomly chosen, according to their relative fitness. For sexual diploid populations, recombining gametes are produced and mating pairs are identified in order to generate descendants. In addition, migration, demographic, and speciation events or population history can be considered as well. The process continues for a given number of generations and stops at some arbitrary point from which samples are taken. Currently, we are only aware of two programs that allow for the simulation of coding sequences forward in time: GenomePop (Carvajal-Rodriguez, 2008) and SFS_CODE (Hernandez, 2008).
9.2.2 Simulations of coalescent histories Coalescent theory (Kingman, 1982) describes the probability of different genealogical histories of a set of genes sampled from a theoretical population that follows Wright–Fisher neutral model (finite constant population size, non-overlapping generations, random mating). Hudson (1990) described a simple algorithm for the fast simulations of genealogies, implemented in the popular program ms (Hudson, 2002). The algorithm starts from the sample nodes and goes back in time introducing coalescent events. The waiting time to the next coalescent event follows an exponential distribution that depends on the effective population size and the number of active lineages. At each coalescent event, two random lineages are chosen to produce an ancestral lineage. The last coalescent event defines the MRCA of the sample. Several extensions to the standard coalescent have been developed in recent years, considering recombination (Box 9.1), selection, gene flow, or demographic history (e.g. Fearnhead, 2006). At least three different coalescent-based programs can simulate coding sequences: CodonRecSim (Anisimova et al., 2003), Recodon (Arenas and Posada, 2007), and NetRecodon (Arenas and Posada, 2010a). Also, it is possible to constrain the simulation of genealogies upon a given population or species’ history, and several programs implement this possibility (e.g. Liu and Yu, 2010; Maddison and Maddison, 2010).
127
9.2.3 Simulation of codon substitutions Codon substitutions are usually assumed to be independent among different sites. In forward approaches, coding regions can be easily simulated, defining an open reading frame and assuming given genetic code. In this case, particular mutations are introduced every generation, resulting in synonymous and nonsynonymous changes. When the simulation occurs along lineages, i.e. a coalescent genealogy or a phylogenetic tree, continuoustime Markov models of codon substitution are used instead (see Chapter 1), defined by a 61 × 61 matrix Q of instantaneous substitution rates (stop codons are ignored). These codon models define the probability of changing from codon i to codon j along a branch of length t, where t is the product of time and substitution rate. The most popular method to simulate coding sequences using Markov codon models consists of evolving each codon in turn over a rooted tree. For a given site, the codon at the root is usually specified according to the equilibrium frequencies of the chosen model (or directly specified by the user). This codon is evolved towards the tip of the tree, one branch at a time, according to the transition probabilities (P(t) = e Qt ). This strategy is very flexible and allows for heterogeneous processes operating at different sites or in different parts of the tree. This is the method implemented in programs like EVOLVER in the PAML package (Yang, 1997), SeqGen (Rambaut and Grassly, 1997), and Recodon and NetRecodon. Another possibility is to use the exponentially distributed waiting times from the Markov chain (Yang, 2006). Starting from the root, the waiting time to the next substitution event can be calculated according to an exponential distribution whose mean is the inverse of the instantaneous rate matrix for the given site. If the waiting time is longer than the length of the branch, there will be no substitution. Otherwise, sampling from to the jump process of the continuous Markov chain (also called the embedded Markov chain), a transition from state i to state j will occur with probability q ij /q i (Yang, 2006) (where q i is the substitution rate for nucleotide i), and a new substitution will be attempted for the remaining branch length. This approach has several advantages, as it can be
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
128
SIMULATION OF CODING SEQUENCE EVOLUTION
Box 9.1 The coalescent with intracodon recombination
TTC
ATC
TTA
Present
(7) (8)
TTG
TTT
CTT
(2)
(6) (5) (3)
TTG
ATT (4)
(1)
ATG
Past
GMRCA
Figure 9.A.1 An example of codon evolution along an ancestral codon graph with an intracodon recombination event. White and grey circles correspond to coalescence and parental nodes, respectively. Starting from the GMRCA, the codon is evolved between nodes by a recursive function, with substitutions introduced according to the probabilities specified by the codon model and the branch length. The recursion is interrupted if a recombinant node is reached and its corresponding parental node has not been assigned a codon yet; in this case the recursion continues in other direction (note the ‘stop’ in step 3 and continuation in step 4). Only when the two parental nodes have already been assigned a codon, the algorithm combines their codons according to the recombination breakpoint (step 6). The resulting recombinant codon (TTT) continues evolving towards the present (step 7).
applied to the whole sequence, it does not require the calculation of the P matrix and allows for site and context-dependent simulation of insertions and deletions (Fletcher and Yang, 2009). Conveniently, because the embedded Markov chain provides the conditional probabilities of transitioning from state i into state j in one step or generation, the approach suggested by Yang can be used to generate data using codon models in forward simulations (Carvajal-Rodriguez, 2008).
9.3
Uses of simulated coding data
Truth in molecular evolution is usually unknown and simulated data are needed to test the
Simulation of coding sequences with recombination can be specially useful to mimic the evolution of fast-evolving pathogens, like viruses and bacteria, where the recombination rates can be very high. In the past, the coalescent simulation of recombining sequences only allowed for recombination breakpoints between the substitution units, so in the case of codon models, recombination was forced to occur between codons (Anisimova et al., 2003). This problem was solved by Arenas and Posada (2010a) who recently proposed a new coalescent algorithm that allows for intracodon recombination, implemented in the program NetRecodon (http://darwin.uvigo.es/, software section). In this case, not only the codons that reach the sample (sampled ancestral material), but also the codons that contain sites that evolved at some point with ancestral sites (unsampled ancestral material), need to be tracked. In this way, an ancestral codon graph (ACG), containing sampled and unsampled ancestral material, represents the evolutionary history at each codon. Second, for the simulation along the ACGs, a recursion was proposed to solve the loops introduced by intracodon recombination, which basically waits for pairs of parental nodes to be occupied before proceeding further towards the present (Figure 9.A.1).
accuracy and robustness of the different methods used to make evolutionary inferences from codon data. Indeed, simulations have been used to evaluate different estimators of the nonsynonymous/synonymous rate ratio (˘), a central parameter of codon models (e.g. Kosakovsky Pond and Frost, 2005; Nickel et al., 2008), or its relationship with the selection coefficient at the intra-specific level (Kryazhimskiy and Plotkin, 2008). Most estimators of ˘ do not consider the potentially misleading effect of recombination (Wilson and McVean, 2006), and simulations have been very useful to assess this effect and its causes. In general, different studies have shown that recombination does not have an effect on the estimation of the global ˘
Phylo Phylo
Phylo Forw Forw Coal
EvolSimulator SIMGRAM /SIMGENOME ALF GenomePop SFS_CODE CodonRecSim GY94
Codon position Codon position GY94 and MG94
Codon position E. coli spectra GY94
G+I
No No Yes2 No
No No
No No G+I
No No No1
No
Rate variation among codons
Sites
Sites No No Sites
Sites /Branches No No Sites /Branches No No Sites /Branches No No
˘ variation
Software for the simulation of coding sequences
No
Yes No Yes No
No Yes
Yes No No
No Yes Yes
No
Indels
R, D, P, M, L
R R, S, D, P, M R, S, D, P, M, L R3
R, S, P, M S
R – R
R – –
–
Evolutionary scenarios
Linux Source code All Source code + Win All
Source code Source code
Source code All All
All All All
All
OS
‘Category’ includes forward (Forw), phylogenetic (Phylo), and coalescent (Coal) approaches. ‘Substitution model’ indicates how substitution events are modelled: ‘Codon position’ indicates that a different substitution rate is specified for each codon position; ‘Nucleotide’ indicates that mutations are filtered afterwards according to predefined fitness effects, ‘ECM’ means empirical codon model; and ‘"CodonPAM’ means codon PAM distance. ‘Rate variation among codons’ indicates whether different codons are allowed to evolve under different synonymous rates (G: gamma distribution; I: proportion of invariable sites). ‘˘ variation’ specifies whether ˘ can change across codons and/or branches. ‘Indels’ indicates whether the program considers insertion and deletion events. ‘Evolutionary scenarios’ indicates the possibility of simulation under different evolutionary scenarios that can include recombination ‘R’ (considering different strategies), selection pressures ‘S’, demographics ‘D’, population history ‘P’, migration ‘M’, and longitudinal sampling ‘L’. ‘OS’ indicates the availability of executable files and/or source code for different operative systems (‘All’ indicates that Macintosh, Windows, and Linux executables are available). 1 G + I could be available in a newer version (current is 1.03; personal communication). 2 If codons are treated as independent loci. 3 Only intercodon recombination.
Coal
ECM and CodonPAM MG94 Nucleotide GY94
Phylo Phylo Phylo
indel-Seq-Gen vs 2.0 SISSI HyPhy
Recodon /NetRecodon
Nucleotide ECM
Phylo Phylo Phylo
Seq-Gen EvolveAGene3 INDELible
GY94
Phylo
EVOLVER
Substitution model
Category
Program
Table 9.1
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
130
SIMULATION OF CODING SEQUENCE EVOLUTION
for a given alignment (e.g. Anisimova et al., 2003; Kosakovsky Pond et al., 2008; but see Shriner et al., 2003). Our own simulations indicate that this is also the case when ˘ is constant along the alignment but varies among lineages (unpublished data). Importantly, recombination can also result in the generation of false positively selected sites and/or spurious inference of selection when likelihood-ratio tests (LRTs) are used (Anisimova et al., 2003; Shriner et al., 2003; Arenas and Posada, 2010a). On the other hand, recombination can seriously bias the reconstruction of ancestral sequences, in particular when dealing with coding sequences (Arenas and Posada, 2010b).
9.4
Software implementations
Different computer programs have been developed for the simulation of coding sequences. Tables 9.1 and 9.2 show the most relevant programs and their main characteristics and possibilities, like their general approach (e.g. backward), substitution model, possibility of synonymous and nonsynonymous rate heterogeneity, ability to simulate indels, possible evolutionary scenarios, and operative system. Only a few of these programs use a forward approach. GenomePop (Carvajal-Rodriguez, 2008) implements recombination, population structure and demographics, diploid/haploid models, and fitness-based selection. In addition, it implements complex codon models, like GTR × MG94. A very friendly forward simulator is SFS_CODE (Hernandez, 2008), which implements a wide variety of mutation models, including indels and many different scenarios, such as population expansions and bottlenecks, multiple populations, distribution of selective effects, and consideration of recombination hotspots for crossing-over and gene conversion events. On the other hand, most codon simulators evolve the sequences along phylogenetic trees. EVOLVER from the PAML package (Yang, 1997) was one of the fist programs capable of generating coding data. More recently, INDELible (Fletcher and Yang, 2009) implements complex substitution models that include indels, and ˘ rate variation among sites and/or branches. The program indel-Seq-Gen 2.0 (Strope et al., 2009), an extension of Seq-Gen (Ram-
Table 9.2
URLs for the different coding-sequence simulators
Program
URL
EVOLVER
http://abacus.gene.ucl.ac.uk/software/ paml.html http://tree.bio.ed.ac.uk/software/ seqgen/ http://web.me.com/barryghall/ Software/Software.html http://abacus.gene.ucl.ac.uk/software/ indelible/ http://rec.vga.edu/software/upp/isg/ http://www.cibiv.at/software/sissi/ http://www.hyphy.org/current/index.php www.bioinformatics.org.au/evolsim http://biowiki.org/SimulationTools
Seq-Gen EvolveAGene3 INDELible indel-Seq-Gen vs 2.0 SISSI HyPhy EvolSimulator SIMGRAM /SIMGENOME ALF GenomePop SFS_CODE
CodonRecSim
Recodon/ NetRecodon
http://www.cbrg.ethz.ch/alf http://webs.uvigo.es/acraaj/ GenomePop.htm http://sfscode.sourceforge. net/SFS_CODE/SFS _CODE_home/SFS_CODE_home.html http://fisher.berkeley.edu/cteg/software. html# codonrecsim/http://people.binf.ku.dk/ rasmus/webpage/CodonRecSim.html http://darwin.uvigo.es/
All websites last accessed September 2011.
baut and Grassly, 1997), allows simulation under a wide variety of indel models and includes other options like pseudogene simulation or motif conservation. EvolveAGene (Hall, 2005; 2008) evolves a real coding sequence along a given tree using a mutation spectrum experimentally determined from Escherichia coli, and including indels. Other tools like SISSI (Gesell and Von Haeseler, 2006) implement the simulation of codons with structural dependency among sites. The program HyPhy (Kosakovsky Pond et al., 2005) implements a wide range of codon models that can be suited to particular user requirements using its flexible scripting language. Among the coalescent-based programs CodonRecSim (Anisimova et al., 2003) and Recodon can simulate coding sequences with recombination but forcing breakpoints to occur between codons. This stringent assumption was avoided in NetRe-
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
REFERENCES
codon, where coding sequences are evolved on the ancestral recombination graph using a more sophisticated algorithm. At the genomic scale, EvolSimulator (Beiko and Charlebois, 2007) simulates the evolution of gene families under non-stationary models, including lateral transfer and gene gains and losses. The suite GSIMULATOR/SIMGRAM/SIMGENOME (Varadarajan et al., 2008) can simulate coding genomic regions using models trained in the PANDIT database (Whelan et al., 2003). Indeed, some programming libraries offer some functionalities to evolve codon sequences, like Darwin (Gonnet et al., 2000) and Bio + + (Dutheil and Boussau, 2008). Finally, A recent package for simulating genome evolution is ALF (Dalquen et al., in press), which models evolution both at gene and genome level. It can simulate coding sequences according to parametric M-series codon models (Yang et al., 2000) as well as the empirical codon models ECM (Kosiol et al., 2007) and CodonPAM (Schneider et al., 2005). In addition, it has various other options (see Table 9.1). The application is already written in Darwin.
Acknowledgements This work was funded by the Spanish Ministry of Education and Science (MEC) through an FPI fellowship BES-2005-9151 to M.A., and a research grant BIO2007-61411 to D.P. We thank Gina Cannarozzi and an anonymous reviewer for their comments.
References Anisimova, M., Nielsen, R., and Yang, Z. (2003). Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics, 164, 1229–36. Arenas, M. and Posada, D. (2007). Recodon: coalescent simulation of coding DNA sequences with recombination, migration and demography. BMC Bioinformatics, 8, 458. Arenas, M. and Posada, D. (2010a). Coalescent simulation of intracodon recombination. Genetics, 184, 429–37. Arenas, M. and Posada, D. (2010b). The effect of recombination on the reconstruction of ancestral sequences. Genetics, 184, 1133–1139.
131
Beiko, R.G. and Charlebois, R.L. (2007). A simulation test bed for hypotheses of genome evolution. Bioinformatics, 23, 825–31. Carvajal-Rodriguez, A. (2008). GENOMEPOP: a program to simulate genomes in populations. BMC Bioinformatics, 9, 223. Dalquen, D.A., Anisimova, M., Gonnet, G.H., and Dessimoz C. (in press): ALF-A simulation framework for genome evolution, Mol Biol Evol. Dutheil, J. and Boussau, B. (2008). Non-homogeneous models of sequence evolution in the Bio + + suite of libraries and programs. BMC Evol Biol, 8, 255. Fearnhead, P. (2006). Perfect simulation from nonneutral population genetic models: variable population size and population subdivision. Genetics, 174, 1397–406. Fletcher, W. and Yang, Z. (2009). INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol, 26, 1879–88. Gesell, T. and Von Haeseler, A. (2006). In silico sequence evolution with site-specific interactions along phylogenetic trees. Bioinformatics, 22, 716–22. Gonnet, G.H., Hallet, M.T., Korostensky, C., and Bernardin, L. (2000). Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics, 16, 101–103. Hall, B.G. (2005). Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol, 22, 792–802. Hall, B.G. (2008). Simulating DNA coding sequence evolution with EvolveAGene 3. Mol Biol Evol, 25, 688–95. Hernandez, R.D. (2008). A flexible forward simulator for populations subject to selection and demography. Bioinformatics, 24, 2786–7. Hudson, R.R. (1990). Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7, 1–44. Hudson, R.R. (2002). Generating samples under a Wright– Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338. Kingman, J.F.C. (1982). The coalescent. Stochastic Processes and their Applications, 13, 235–248. Kosakovsky Pond, S.L. and Frost, S.D. (2005). Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol, 22, 1208–22. Kosakovsky Pond, S.L., Frost, S.D., and Muse, S.V. (2005). HyPhy: Hypothesis testing using phylogenies. Bioinformatics, 21, 676–679. Kosakovsky Pond, S.L., Poon, A.F.Y., Zárate, S., Smith, D.M., Little, S.J., Pillai, S.K., et al., (2008). Estimating
OUP CORRECTED PROOF – FINAL, 30/1/2012, SPi
132
SIMULATION OF CODING SEQUENCE EVOLUTION
selection pressures on HIV-1 using phylogenetic likelihood models. Statistics in Medicine, 27, 4779–4789. Kosiol, C., Holmes, I., and Goldman, N. (2007). An empirical codon model for protein sequence evolution. Mol Biol Evol, 24, 1464–79. Kryazhimskiy, S. and Plotkin, J.B. (2008). The population genetics of dN/dS. PLoS Genet, 4, e1000304. Liu, L. and Yu, L. (2010). Phybase: an R package for species tree analysis. Bioinformatics, 26, 962–3. Maddison, W.P. and Maddison, D.R. (2010). Mesquite: a modular system for evolutionary analysis, Version 2.74, http://mesquiteproject.org, last accessed September 2011. Nickel, G.C., Tefft, D.L., Goglin, K., and Adams, M.D. (2008). An empirical test for branch-specific positive selection. Genetics, 179, 2183–93. Padhukasahasram, B., Marjoram, P., Wall, J.D., Bustamante, C.D., and Nordborg, M. (2008). Exploring population genetic models with recombination using efficient forward-time simulations. Genetics, 178, 2417–27. Peck, S.L. (2004). Simulation as experiment: a philosophical reassessment for biological modeling. Trends Ecol Evol, 19, 530–4. Peng, B. and Kimmel, M. (2005). simuPOP: a forward-time population genetics simulation environment. Bioinformatics, 21, 3686–7. Rambaut, A. and Grassly, N.C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosciences, 13, 235–238. Schneider, A., Cannarozzi, G.M., and Gonnet, G.H. (2005). Empirical codon substitution matrix. BMC Bioinformatics, 6, 134.
Shriner, D., Nickle, D.C., Jensen, M.A., and Mullins, J.I. (2003). Potential impact of recombination on sitewise approaches for detecting positive natural selection. Genetical Research, 81, 115–121. Strope, C.L., Abel, K., Scott, S.D., and Moriyama, E.N. (2009). Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol Biol Evol, 26, 2581–93. Varadarajan, A., Bradley, R.K., and Holmes, I.H. (2008). Tools for simulating evolution of aligned genomic regions with integrated parameter estimation. Genome Biol, 9, R147. Whelan, S., De Bakker, P.I. and Goldman, N. (2003). Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics, 19, 1556–63. Wilson, D.J. and McVean, G. (2006). Estimating diversifying selection and functional constraint in the presence of recombination. Genetics, 172, 1411–25. Winsberg, E. (2003). Simulated experiments: methodology for a virtual world. Philosophy of Science, 70, 105–125. Yang, Z. (1997). PAML: a program package for phylogenetic analysis by maximum likelihood. Computer Applications in the Biosciences, 13, 555–556. Yang, Z. (2006). Computational molecular evolution. Oxford University Press. Yang, Z., Nielsen, R., Goldman, N. and Pedersen, A.-M.K. (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155, 431–449.