De Novo Variant Caller using Google Genomics API Subhodeep Moitra Carnegie Mellon University [email protected] September 5, 2014 Abstract This tool identifies de novo genetic mutations in a family trio (child, mother and father). De novo genetic mutations are defined as those mutations which occur in the child but do not occur in the genomes of the parents. This tool queries the Google Genomics API in order to search and retrieve genetic variants and reads. It implements a Bayesian inference algorithm for identifying de novo mutations from the retrieved data. This document describes theory, design and early experiments for the de novo caller.

1

Background

De novo genetic mutations are a set of rare genetic mutations that occur in the genomes of a child but do not occur in the genome of the child’s parents. These mutations are being increasingly implicated to play an important role in a variety of diseases such as Autism [2] and Schizophrenia. With the advent of Whole Genome Sequencing, a large number of these mutations are being discovered at both the exonic and the intronic regions. Detecting de novo mutations is a challenging task. Some of the issues that affect de novo mutation detection are : • Rare event: A de novo mutation is a very rare event (∼1 in 108 bp). • Sequencing errors: Sequencing errors occur at the rate of 1 in 100 bp. This makes it much more likely to observe a sequencing error rather than a de novo mutation. • Read Depth: Since the reads mapped to a particular position are essentially random, it is possible that the denovo mutation is completely skipped by inadequate coverage at that position. • Misalignment: Variant detection pipelines assume a high quality alignment of reads. In areas of poor alignment, variant detection suffers. • Expensive Validation: Candidates have to be verified through expensive Sanger sequencing. • Structural Variation: Complications arise through structural variations in the chromosome through insertions, deletions, inversions, copy number variations, translocations, etc. The Google Genomics API [5] provides services to import and query variants and reads data. This tool assumes that variants and reads are available for the trio under study and have been imported into the Genomics API. The tool provides two callers (1) Variant Caller (2) Read Caller. The variant caller walks over the trio of genomes and examines SNPs avaiable through the variants API. After filtering for variants based on quality cutoffs and mendelian inheritance rules, a set of candidate calls are generated and passed onto the Read Caller for finer discrimination. The Read Caller makes requests to the reads API to fetch mapped reads for these candidate position and makes a finer call using a bayesian inference algorithm. Validation experiments against NA12878 trio from Platinum genomes [6] were also run and compared against experimentally validated de novo and somatic mutations [4] ; results are reported in section 3. 1

GF

GM

GC DF

DM

DC Figure 1: The hidden variables GF , GM and GC correspond to the genotypes of the father, mother and child respectively. The genotypes take one among 10 values from AA, AC, AT, AG, CC, CT, CG, GG, GT and T T . The trio of genotypes values can thus have 103 possible values. The observed variables (DF , DM and DC ) are the bases present at the candidate position for a particular read

1.1

Bayes Net

We employ a probabilistic inference framework using bayes nets for de novo variant calling. A bayes net belongs to the class of mathematical models known as graphical models which are a marriage between graph theory and probability theory [3]. A Bayes Net is a directed acyclic graph which defines conditional independence relationships between random variables in a probability distribution. A Bayes net can be used for sampling, marginal inference and decoding of the hidden variables. We follow the bayes net implementation for de novo variant calling as in Li et al [1]. They model a bayes net to encode the diploid Mendelian relationship between the members of the trio. See Figure 1 for an illustration of the bayes net. The hidden variables GF , GM and GC correspond to the genotypes of the father, mother and child respectively. The genotypes take one among 10 values from AA, AC, AT, AG, CC, CT, CG, GG, GT and T T . The trio of genotypes values can thus have 103 possible values. The observed variables (DF , DM and DC ) are the reads obtained at a candidate position. We fix the baseline sequencing error rate seq and the baseline de novo mutation rate denovo . In equation 1, we apply Bayes rule to calculate the probability of a particular genotype trio given observed reads. P (DF , DM , DC |GF , GM , GC )P (GF , GM , GC ) GF ,GM ,GC P (DF , DM , DC |GF , GM , GC )P (GF , GM , GC )

P (GF , GM , GC |DF , DM , DC ) = P

(1)

The term P (DF , DM , DC |GF , GM , GC ) is the data likelihood and is calculated as in equation 2. P (DF , DM , DC |GF , GM , GC ) = P (DF |GF )P (DM |GM )P (DC |GM )     |DF | |DM | |DC | Y Y Y P (Rf |GF )  P (Rm |GM )  P (Rc |GC ) = m=1

f =1

(2) (3)

c=1

The term P (GF , GM , GC ) encodes the mendelian inheritance relationship in the trio and can be factorized further according to equation 4. Furthermore, this likelihood can be expressed in terms of seq and denovo as in equation 5 where Ndenovo corresponds to the number of GC cases that are de novo given the parents ; similarly Nmendelian corresponds to the number of GC cases that are mendelian given the parents.

2

P (GF , GM , GC ) = P (GC |GF , GM )P (GF )P (GM )  P (GC |GF , GM ) =

denovo /Ndenovo (1 − denovo )/Nmendelian

if GC is a denovo mutation otherwise

(4) (5)

The base likelihood P (R|G) conditioned on the genotype of the sample is calculated according to equation 6.  1 − seq if G is homozygous and R ∈ G    seq /3 if G is homozygous and R ∈ /G P (R|G) = (6) 0.5(1 − 2 /3) if G is heterozygous and R ∈G  seq   seq /3 if G is heterozygous and R ∈ /G

1.2

Bayesian Inference Strategies

In the previous section we defined the bayes net model. In this section we will discuss the inference strategies for determining if a candidate position is a de novo variant. We define the function Idenovo (GF , GM , GC ) as an indicator function when the child genotype is a de novo mutation. We also calculate the likelihoods of all the de novo and the mendelian inheritance cases as described in equation 7.

Ldenovo =

X

P (DF , DM , DC |GF , GM , GC )P (GF , GM , GC )Idenovo (GF , GM , GC )

(7)

P (DF , DM , DC |GF , GM , GC )P (GF , GM , GC )Imendelian (GF , GM , GC )

(8)

GF ,GM ,GC

Lmendelian =

X GF ,GM ,GC

1.2.1

Maximum A Posteriori (MAP)

Call a de novo variant if condition 9 holds. Idenovo ( arg max P (GF , GM , GC |DF , DM , DC ))

(9)

GF ,GM ,GC

1.2.2

Bayesian Risk Minimizer (Bayes)

Call a de novo variant if condition 10 holds. This inference rule minimizes the empirical bayesian risk. Ldenovo > 0.5 Ldenovo + Lmendelian 1.2.3

(10)

Likelihood Ratio Test (LRT)

Call a de novo variant if condition 11 holds where T is a user specified likelihood ratio threshold. Note that LRT is the same as Bayes if T is equal to 1. Ldenovo >T Lmendelian

3

(11)

Variant Caller Variants .VCF files

import

Varstore

API Calls

Buffering

Variant Stream

Variant Buffer

Mendel Filter

Filtered Candidates

Read Caller Reads .BAM files

import

Readstore

API Calls

Read Stream

Bayesian Inference MAP Bayes LRT

Final Candidates Figure 2: De novo variant caller design : There are two main components (1) Variant Caller interacting with Varstore (2) Read Caller interacting with Readstore. The Variant caller makes API calls to Varstore using a Variant Stream object that interfaces with a variants buffer. The buffer manages overlapping variants from the trio. Once a set of suitable variants have been obtained it is filtered using mendelian inheritance checks. These candidates are then passed onto the the Read Caller which obtains reads for the candidate position from readstore. The candidates are filtered using one of three Bayesian inference procedures (1) MAP Maximum a posteriori (2) Bayes Rule (3) LRT - Likelihood Ratio test. Successful candidates are written onto disk.

4

2

Design and Implementation

This tool is designed as a standalone command line tool for performing inference. This tool assumes that variants and reads are available for the trio under study and have been imported into the Genomics API. There are two main components (1) Variant Caller interacting with Varstore (2) Read Caller interacting with Readstore. See Figure 2 for an illustrative diagram. The Variant caller makes API calls to Varstore using a Variant Stream object that interfaces with a variants buffer. The buffer manages overlapping variants from the trio. Once a set of suitable variants have been obtained it is filtered using mendelian inheritance checks. These candidates are then passed onto the the Read Caller which obtains reads for the candidate position from readstore. The candidates are filtered using one of three Bayesian inference procedures (1) MAP - Maximum a posteriori (2) Bayes Rule (3) LRT - Likelihood Ratio test. Successful candidates are written onto disk.

2.1

Concurrency / Multi-threading

Profiling indicated that a lot of time was spent in network I/O delays. To speed up the execution, requests to the variants API and the reads API can be parallelized using multiple threads. Performance testing showed that multi-threading achieves a linear speedup with the number of threads(Graphs not shown). For this purpose the Variant Caller and the Read Caller implements a helper class SimpleDenovoRunnable and BayesDenovoRunnable respectively. Their behavior can be controlled by the --num_threads option.

2.2

Commandline usage

This tool is predominantly meant to be used from the commandline. Typical commandline use See below for typical commandline usage : java -jar target/denovo-variant-caller-0.1.jar --caller full \ --client_secrets_filename ${HOME}/Downloads/client_secrets.json \ --dataset_id 14004469326575082626 \ --dad_callset_name NA12891 \ --mom_callset_name NA12892 \ --child_callset_name NA12878 \ --chromosome chr1 \ --start_position 1 \ --end_position 14000000 \ --log_level debug \ --num_threads 25 \ --output_file NA12878_full.calls All options See below for all available options: Usage: DenovoMain [flags...] --caller [VARIANT | READ | FULL] --child_callset_name --chromosome

: The caller mode : Child’s callset name e.g. NA12879 : specify the chromosomes to search (specify multiple times for multiple chromsomes)

5

--client_secrets_filename --dad_callset_name --dataset_id --denovo_mut_rate

: : : :

--end_position

:

--inference_method [MAP | BAYES | LRT] --input_calls_file --log_file --log_level [ERROR | INFO | DEBUG] --lrt_threshold

: : : : :

--max_api_retries --max_variant_results

: :

--mom_callset_name --num_threads

: :

--output_dir --output_file --seq_err_rate

: : :

--start_position

:

3

Path to client_secrets.json Dad’s callset name e.g. NA12877 Dataset id Specify the denovo mutation rate (default 1e-8) end position ( usually set automatically ) Inference method (map | bayes | lrt) File to read from specify the log file specify the logging level likelihood ratio test significance level (default 1. ;higher the stricter) max api retry count (default 5) max variants returned per request (default 10000) Mom’s callset name e.g. NA12878 Specify the number of threads (default 1 ; 1 to 50 suggested) File to write results File to write results Specify the sequence error rate (default 1e-2) start position ( usually 1 )

Experiments

We ran validation experiments against the NA12878 trio (NA12878 child, NA12891 father, NA12892 mother) from Platinum genomes [6] and compared them against experimentally validated de novo and somatic mutations [4]. The results are shown in Table 3. The tool has high recall ∼ 96%. The precision is less than desirable ; however it is only a lower bound since the NA12878 trio is derived from cell lines and hence has accumulated a lot of somatic cell mutations. A blood line dataset may reveal more accurate numbers. The compute time is favorable ; 3 hours to go over 3 Billion base pairs. Figure 3 shows a false de novo mutation candidate at chr10:23,662,774 in NA12878 trio. The top two tracks correspond to parents (NA12891 and NA12892) while the bottom track (NA12878) corresponds to the child genome. Father Genotype from variants API is AA, mother genotype is AA while child is AG. Variant is called by variant caller but correctly rejected by read caller due to insufficient evidence from reads. The Bayesian inference reads caller observes that there have been occurrences of some G bases in the parents which diminishes the likelihood of a denovo mutation. Figure 4 shows a true experimentally validated germline mutation at chr1:75,884343 in NA12878 trio. The top two tracks correspond to parents (NA12891 and NA12892) while the bottom track (NA12878) corresponds to the child genome. Inferred father Genotype is T T , mother genotype is T T while child is CT . De novo mutation is called correctly by both variant and read caller.

4

Future Work and Conclusion

This tool identifies de novo genetic mutations in a family trio (child, mother and father) by exercising the Google Genomics API to fetch variants and reads which is subsequently . It provides concurrency options 6

Filter Variant Variant + BayesNet

Num Calls Filter 4090 2207

Precision 22.12%* 40.55%*

Recall 96.69% 95.62%

Time ∼3 hours (20 threads) ∼30 mins(20 threads)

Throughput ∼3 mbps(20 threads) ∼ 1.5 mbps (20 threads)

Table 1: Results on NA12878 trio (NA12878 child, NA12891 father, NA12892 mother) from Platinum genomes [6] compared against experimentally validated de novo and somatic mutations [4]. The tool has high recall ∼ 96%. * The precision is inconclusive ; most likely is a lower bound since the NA12878 trio is derived from somatic cell lines. A follow up blood line experiment will likely reveal more accurate numbers. The compute time is favorable; 3 hours to go over 3 Billion base pairs.

Figure 3: Shows a false candidate for a denovo mutation at chr10:23662774 in NA12878 trio. The top two tracks correspond to parents (NA12891 and NA12892) while the bottom track corresponds to the child genome (NA12878). Father Genotype variant is AA, mother genotype is AA and child is AG. Variant is called by variant caller but correctly rejected by read caller due to insufficient evidence from reads. The bayesian inference reads caller observes that there have been occurrences of some G bases (brown lines) in the parents which diminishes the likelihood of a denovo mutation. to speed up execution. Experiments were run against the NA12878 trio which showed high recall ∼ 96% against experimentally validated germline and cell line mutations. Precision scores were inconclusive and will need follow up experiments against blood derived genomes.

References [1] Li B et. al, A likelihood-based framework for variant calling and de novo mutation detection in families, PLoS Genetics, 2012 Volume 8, Number 10, Pages e1002944 1.1 [2] Michaelson et al, Whole-Genome Sequencing in Autism Identifies Hot Spots for De Novo Germline Mutation, Cell, 2012, Volume 151, Number 7, Pages 1431 - 1442 1

7

Figure 4: Shows a true experimentally validated germline mutation at chr1:75,884343 in NA12878 trio. The top two tracks correspond to parents (NA12891 and NA12892) while the bottom track (NA12878) corresponds to the child genome. Inferred father Genotype is T T , mother genotype is T T while child is CT . Variant is called by correctly by both variant and read caller [3] Wainwright, Jordan, Graphical Models, Exponential Families, and Variational Inference, Foundations and Trends in Machine Learning, 2008, Volume 1, Issue 1-2, pp 1-305 1.1 [4] Conrad et al., Variation in genome-wide mutation rates within and between human families, Nature Genetics, July 2011 Volume 43, Number 7, pp 712-715 1, 3, 1 [5] Google Genomics API, https://developers.google.com/genomics/ 1 [6] Platinum Genomes - Illumina Sequencing, http://www.illumina.com/platinumgenomes/ 1, 3, 1

8

De Novo Variant Caller using Google Genomics API - GitHub

Sep 5, 2014 - fetch mapped reads for these candidate position and makes a finer call using a bayesian inference algorithm. Validation experiments against ...

321KB Sizes 13 Downloads 283 Views

Recommend Documents

Variant A Variant B Variant C Variant D - GitHub
2017-05-15 13:15 /home/wingel/eagle/diff-probe/diff-probe.sch (Sheet: 1/1). Variant A. Variant B. Variant C. Variant D. LMH6702. 0. Ω. /N. F. NF. 1. 5. 0. Ω. GND. GND. 2. 2. 0. Ω. N. F. CON-1X3-TAB-AMP-29173-3-NARROW. V. +. V. -. GND. V. +. V. -.

API - GitHub
Dec 5, 2014 - •http://www.thoughtworks.com/insights/blog/rest-api- ... Slim http://coenraets.org/blog/2011/12/restful-services-with-jquery-php-and- ... Page 24 ...

StackMap API Specification - GitHub
domain is the specific StackMap installation for your library. POST Data. The POST ... A node with the name of the library to search for the holding. ▫ Attributes.

Building Single Page Applications using Web API and ... - GitHub
This book is the pdf version of the online post in chsakell's Blog and ..... For our application there will be only the Admin role (employees) but we will discuss later the scalability options we have in ...... not only to be authenticated but also b

BamTools API Tutorial - GitHub
Mar 23, 2011 - https://github.com/pezmaster31/bamtools/wiki/BamTools-1x_PortingGuide.pdf ... adjust how your app locates the shared library at runtime.

NOVO CRONOGRAMA DE DATAS.pdf
Page 1 of 1. NOVO CRONOGRAMA DE DATAS.pdf. NOVO CRONOGRAMA DE DATAS.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying NOVO ...

Commands for EV3-API - GitHub
char. Color of text. 1: black text. 0: wihteteext with black background. .... LED color cannot be changed while warning is set. ... (1: true, 0: false).

Sliceable Network Management API - GitHub
virtualizing all layer 2 functions the API distributes resource management such ... can be categorized as Infrastructure as a Service (IaaS) in the cloud computing.

StackMap JSON API Specification - GitHub
o The text of the call number of the holding. ▫ “library” o The text ... o An decimal, the x position of the center of the range on the map, in pixels. ▫ “y” o An decimal ...

Chatter REST API Developer Overview - GitHub
Building an application outside the Salesforce platform. • Pull feed and social graph out into another application. • Push notifications and activity into the feed.

Building a motor simulation de novo: Observation of ...
Mar 10, 2006 - Corresponding author. Fax: +1 603 646 1181. .... Powerbook G4 Laptop computer running PsyScope software. (Cohen et al., 1993). ..... Supported by the. James S. McDonnwell Foundation and Public Health Service grant.

Novo-calendario-vacinal-de-2017.pdf
1 dose aos 15 meses. ou até 4 anos, 11. meses e 29 dias. Mudou. ANTES. Page 5 of 27. Main menu. Displaying Novo-calendario-vacinal-de-2017.pdf. Page 1 ...

Maternal Famine, De Novo Mutations, and Schizophrenia
pared with the health outcomes of unexposed persons born in the same cities but in gestation before or after the fam- ine. This birth cohort experienced increased risk of con- genital anomalies of the central nervous system, including neural tube def

De novo sequencing of antimicrobial peptides isolated ...
2 Center for Experimental Bioinformatics, Department of Biochemistry and Molecular Biology, ... zoic acid (DHB) were obtained from Sigma-Aldrich (St. Louis ...

Using py-aspio - GitHub
Jan 17, 2017 - Load ASP program and input/output specifications from file ..... Programs can be loaded from files or strings: .... AI Magazine, 37(3):53–68.

novo-catalogo-hinode-2016-novo-preco-aumento.pdf
Page 1 of 4. Sons oftheanarchy s06e02.Drag you to hell.56789418536 - Download Portal video game.MorningComes nikki.I forever shallalso measure.

novo-catalogo-hinode-2016-novo-preco-aumento.pdf
There was a problem loading more pages. Retrying... novo-catalogo-hinode-2016-novo-preco-aumento.pdf. novo-catalogo-hinode-2016-novo-preco-aumento.

Using FeatureExtraction - GitHub
Oct 27, 2017 - Venous thrombosis. 3. 20-24. 2. Medical history: Neoplasms. 25-29. 4. Hematologic neoplasm. 1. 30-34. 6. Malignant lymphoma. 0. 35-39. 7. Malignant neoplasm of anorectum. 0. 40-44. 9. Malignant neoplastic disease. 6. 45-49. 11. Maligna

Building a motor simulation de novo: Observation of ...
Mar 10, 2006 - Introduction. A little girl who ... dance classes and learns the movements she first saw professional .... Protection of Human Subjects at Dartmouth College. ... Powerbook G4 Laptop computer running PsyScope software.

Sylvia Day - Novo trecho de #TodoSeu!.pdf
Page 1 of 3. #TODOSEU. @SiteCrossfireBR. “Você está desistindo?” O olhar incrédulo de Mark Garrity se ergueu da minha carta de demissão. para encontrar o meu. Meu estômago deu um nó com a expressão no rosto do meu chefe. "Sim. Me desculpe,

Using SqlRender - GitHub
6. 3.5 Case sensitivity in string comparisons . .... would have had no clue the two fields were strings, and would incorrectly leave the plus sign. Another clue that.

De Morgan's Laws - GitHub
While it is legit to place a ! outside parentheses to negate an expression, this is considered to be bad form because it makes it difficult for future programmers to ...

Integration and Visualisation of Genomics and Proteomics Data - GitHub
This enables co-visualization of genomics, transcriptomics, and proteomics data using the Integrative ... The software is available via the GitHub code repository:.