Genetics and population analysis

PopABC: a program to infer historical demographic parameters Joao S. Lopes1,*, David Balding2, and Mark A. Beaumont1 1

School of Biological Sciences, University of Reading, Whiteknights, PO Box 228, Reading RG6 6AJ, UK Department of Epidemiology and Public Health, Imperial College, St Mary’s Campus, Norfolk Place, London W2 1PG, UK

2

Associate Editor: Dr. Jeffrey Barrett

ABSTRACT Summary: PopABC is a computer package for inferring the pattern of demographic divergence of closely related populations and species. The software performs coalescent simulation in the framework of approximate Bayesian computation (ABC). PopABC can also be used to perform Bayesian model choice to discriminate between different demographic scenarios. The program can be used either for research or for education and teaching purposes. Availability and Implementation: Source code and binaries are freely available at http://www.reading.ac.uk/~sar05sal/software.htm. The program was implemented in C and can run on UNIX, MacOSX and Windows operating systems. Contact: [email protected] Supplementary information: Software and supplementary data are available in http://www.reading.ac.uk/~sar05sal/software.htm.

1

INTRODUCTION

With the development of new techniques in molecular biology it is possible to scan the entire genomes of multiple individuals within a population in a dense map of SNP markers, microsatellite markers or DNA sequence data. These data can then be used to infer aspects of the history of a population, such as population sizes and migration rates, recombination rates, and selection coefficients. The Bayesian statistical paradigm offers an efficient framework for such inferences, because it allows maximal extraction of information from data under the specified model, and background information can be incorporated via prior distributions. The spread of the use of MCMC (Markov Chain Monte Carlo) methods in the early 1990s made possible accurate statistical inferences from molecular genetic data (Wilson and Balding, 1998). Further increases in the volume of data as well as the ambitions of researchers have quickly exceeded the capabilities of these computer-intensive methods. In the past few years, some developments in computational statistics have arisen that pushed back the boundaries of the models that can be analysed at the cost of some approximation (Pritchard et al, 1999; Marjoram and Tavaré, 2006). These developments typically consist of the characterization of the data by summary statistics and the use of Monte Carlo simulations that avoid the need for an explicit likelihood function. Bayesian versions of these computational methods have come to be known as ABC (Approximate Bayesian Computation: Beaumont et al, 2002; Marjoram et al, 2003; Sisson et al, 2007). These studies suggest that ABC can compare well with methods based on fully specified likelihood functions We present a computer package that enables the user to perform ABCbased inference on problems in population genetics that involve multiple populations, with or without migration events. The model allows multilocus microsatellites or sequence data to be used either separately or

jointly. Recombination events can also be incorporated. Inferences regarding the choice between different branching histories (topologies) of population models can be performed automatically for up to 4 populations. The developed software is especially aimed for phylogeographic problems with well defined panmictic populations. The species can have any ploidy, sexual or asexual reproduction, short or long generation times. The package includes a coalescent simulator, based on routines described in Beaumont (2008). We have tested these routines against theoretical expectations from a coalescent model, and by performing comparisons with established coalescent simulators (ms, Hudson, 2002; simcoal, Laval and Excoffier, 2004 [using rejector, Jobin and Mountain, 2008]) (See Supplementary Information). ABC techniques are becoming more widely used in population genetics and molecular biology, and some general packages are currently available (Hickerson, 2007; Cornuet et al, 2008). Indeed the ABC approach is easily ‘pipelined’, as discussed in section 2, and a number of simulation packages have been expanded to allow them to be used for inference (e.g. ms, Hudson, 2002; Serial Simcoal, Anderson et al, 2005 [based on Simcoal, Laval and Excoffier, 2004]).The ‘pipelining’ approach is extremely flexible, and the preferred option for more computationally-adept users. However there is a strong case for friendlier packages, which has motivated the development of the DIY-ABC package by Cornuet et al (2008). PopABC complements DIY-ABC in a number of ways. It has the option to use a menubased console application; the possibility to include recombination events either in linked microsatellites or in sequence data; the option to add migration events to the population models.

2

MODELS

2.1

Statistical model

Although there are a number of different flavours of ABC (Sisson et al, 2007; Blum and Francois, 2008), PopABC follows the standard rejection/regression approach, which has been frequently tested against fulllikelihood alternatives (Tallmon et al, 2004; Excoffier et al, 2005; Hickerson et al, 2006; Sousa et al 2009). The base algorithm (Pritchard et al, 1999) can be summarized as: (1)

Sample parameters, Φ, from the priors: Φi ~ p(Φ);

(2)

Simulate data, D, given Φi: Di ~ p(D | Φi);

(3)

Summarize Di with a set of chosen summary statistics to obtain Si; go to (1) until N sample points from the joint distribution p(S,Φ) have been created;

(4)

Accept the points whose Si is within a distance λ from s’, the real data summarized by the same set of summary statistics, |Si – s’| < λ;

To whom correspondence should be addressed.

*

© 2009 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

1

J.Lopes et al.

PopABC implements these four base steps of the ABC algorithm. In addition the output files have been designed so that other subsequent steps such as the use of a weighting scheme and a local linear regression to correct the simulated points (Beaumont et al, 2002) can be later applied by the use of simple R scripts (available in Beaumont’s webpage http://www.rubic.rdg.ac.uk/~mab/). Although this latter step still involves an element of pipelining it has the advantage of allowing the user to apply more recently developed approaches such as the use of Sequential Monte Carlo sampling (Sisson et al, 2007; Beaumont et al, 2009) and non-linear regression (Blum and Francois, 2008).

2.2

Population models

PopABC is based on the Isolation-with-Migration model (Nielsen and Wakeley, 2001; Hey and Nielsen, 2004), and encompasses population vicariance without migration, and also an equilibrium migration model with some choice of migration matrix. In principle any number of populations, which may comprise related species, can be analysed, but in practice this is limited by the number of summary statistics that need to be calculated.

2.3

Evolutionary models

Two different mutation models can be used according to the DNA data type considered: The Infinite Sites model (Kimura, 1969) for DNA sequences; and the Stepwise Mutation Model (Kimura and Ohta, 1978) for microsatellites. The microsatellite loci can be assumed to be either segregating independently or linked. A recombination rate can be assumed between the linked microsatellites. The DNA sequence loci are assumed to be segregating independently, but recombination events within a locus can be modelled. Mitochondrial DNA, nuclear DNA, Y- or X-linked data, can be used separately or jointly.

2.4

Parameters

The program allows for the estimation of demographic parameters (effective population size; time of splitting events between sister populations; migration rates and topology of populations trees) and genetic parameters (mutation and recombination rates).

2.5

Model-choice

It is possible to estimate the Bayesian posterior probability for different models (i.e. models with or without migration; models with or without recombination and between topologies of population models). Note that Bayesian model choice can be highly dependent on the priors chosen for the parameters in the respective models.

2.6

Summary Statistics

We chose summary statistics that have had extensive use in population genetic inference. For microsatellite data: the heterozygosity; variance and kurtosis of allele length; the number of different alleles; an index of gene diversity; and an FST-based estimator of the number of migrants. For DNA sequences: the mean pairwise difference; the number of segregating sites; the number and frequency of private segregating sites; the number of different haplotypes; an index of haplotype diversity; the mean and standard deviation of the mutation frequency spectrum; and an FST-based estimator of the number of migrants. Summary statistics are computed for each population, and then for all pairs of populations combined as described in Beaumont (2008).

3

ASSUMPTIONS

If assuming “Isolation with Migration” the following should be verified (Nielsen and Wakeley, 2001): there should not be other populations that are more closely related to the sampled populations than they are to each other; and there should not be unsampled populations exchanging genes with the studied populations or their own ancestors.

2

Other assumptions come from the Coalescent Method employed: the variation within the DNA data has to be neutral; free recombination between DNA sequence loci is assumed; and mutations follow the mutation model considered.

ACKNOWLEDGEMENTS We would like to thank E. Bazin, L. Chikhi, J.-M. Cornuet, K. Dawson, V. Sousa, and D. Welch for helpful advice and discussions. We acknowledge three anonymous reviewers and editor for very useful comments which highly contributed to raise the quality of the software. Funding for development of this program and website was received from Engineering and Physical Sciences Research Council and Fundacao Ciencia e Tecnologia.

REFERENCES Anderson, C.N.K., et al. (2005) Serial SimCoal: A population genetics model for data from multiple populations and points in time. Oxford Univ. Press, 1733–1734. Beaumont, M.A., et al. (2002) Approximate Bayesian computation in population genetics, Genetics, 162, 2025–2035. Beaumont, M.A., (2008) Joint determination of topology, divergence time, and immigration in population trees. In Matsura, S., Forster, P. and Renfrew, C. (eds), Simulations, Genetics and Human Prehistory. McDonald Institute for Archaeological Research, Cambridge, 135-154. Beaumont, M.A., et al. (2009) Adaptive approximate Bayesian computation, Biometrika. Blum, M.G.B. and Francois, O. (2008) Highly tolerant likelihood-free Bayesian inference: An adaptive non-linear heteroscedastic model, Stat. Comput. Cornuet, J.M. et al. (2008) Inferring population history with DIYABC: a user-friendly approach to Approximate Bayesian Computation, Bioinformatics 24, 2713. Excoffier, L. et al (2005) Bayesian analysis of an admixture model with mutations and arbitrarily linked markers, Genetics, 169, 1727-1738. Hey, J. and Nielsen, R. (2004) Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis, Genetics, 167, 747-760. Hickerson, M.J. et al. (2006) Test for simultaneous divergence using approximate Bayesian computation, Evolution, 60, 2435–2453. Hickerson, M.J. et al (2007) msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation, BMC Bioinformatics, 8, 268. Hudson, R.R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation, Bioinformatics, 18, 337–338. Kimura, M. (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations, Genetics, 61, 893–903. Kimura, M. and Ohta, T. (1978) Stepwise mutation model and distribution of allelic frequencies in a finite population, Proc. Natl. Acad. Sci. U. S. A., 75, 2868–2872. Jobin, M. J., Mountain, J. L. (2008) REJECTOR: software for population history inference from genetic data via a rejection algorithm. Bioinformatics 24, 2936. Laval, G. and Excoffier, L. (2004) SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Oxford Univ. Press, 2485–2487. Marjoram, P., et al (2003) Markov chain Monte Carlo without likelihoods, Proc. Natl. Acad. Sci., 100, 15324-15328. Marjoram, P. and Tavare, S. (2006) Modern computational approches for analysing molecular genetic variation data, Nat. Rev. Genet., 7, 759-770. Nielsen, R. and Wakeley, J. (2001) Distinguishing migration from isolation: a Markov chain Monte Carlo approach, Genetics, 158, 885–896. Pritchard, J.K., et al. (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites, Mol. Biol. Evol., 16, 1791–1798. Sisson, S.A., et al. (2007) Sequential Monte Carlo without likelihoods, Proc. Nat. Acad. Sci. U. S. A., 104, 1760. Sousa, V.C., et al (2009) Approximate Bayesian computation without summary statistics: the case of admixture, Genetics. Tallmon, D.A., et al (2004) Comparative evaluation of a new effective population size estimator based on approximate Bayesian computation, Genetics, 167, 977-988. Wilson, I.J. and Balding, D.J. (1998) Genealogical inference from microsatellite data, Genetics, 150, 499–510.