Full-stack genomics pipelining with

GATK4 + WDL + Cromwell

Data Sciences Platform

Kate Voss, Jeff Gentry, Geraldine Van der Auwera

Genome Analysis Toolkit v.4

Workflow Definition Language

Cromwell Execution Engine

GATK4 is the newest version of our variant discovery package, featuring major performance enhancements and expanded scope of operation.

WDL is a workflow language designed to express tasks and workflows in a user-friendly way, making it easy to build sophisticated pipelines without advanced engineering experience.

Cromwell is a workflow management system designed to execute scientific analysis pipelines on any computing platform, local or cloud.

Cromwell was designed to be able to cover as many use cases as possible, without making any assumptions about the nature or domain of the workflows it runs. It is simple to get started with small workloads and can easily scale up to a large production environment. Several preconfigured backends are available out of the box (local, SGE, Google, TES). Work is underway to add backends for AWS and for Alibaba Cloud, as well as to add support for CWL pipelines. Cromwell development aims to offer choice and flexibility to a wide user community.

task task_A {

File my_input String name

File in String id

task definition

Its expanded scope of operation includes Best Practices workflows for somatic SNPs and Indels as well as copy number variation in both germline and somatic cases. Work is underway to add structural variation workflows.

workflow myWorkflowName {

workflow definition

The GATK engine was completely rewritten with a focus on speed, scalability and versatility, with support for key Big Data technologies including Apache Spark and cloud platforms, and optimized algorithm implementations in flagship tools such as HaplotypeCaller.

{wdl}

call task_A { input: in=my_input, id=name

} call task_B { input: in=task_A.out

} } task task_A {

...

}

task task_B {

...

}

command { do_stuff I=${in} O=${id}.ext

} output { File out= “${id}.ext”

} }

Running WDL workflows through Cromwell on local infrastructure or on cloud

In the Broad Institute’s Data Sciences Platform, we are responsible for developing GATK tools and workflows, and operating large production pipelines in which the GATK Best Practices workflows are applied to the Broad’s massive amounts of genome sequencing data (24TB/day). This diagram illustrated the two strategies we use: execution on local infrastructure (SGE backend and NFS filesystem) for small-scale development and testing, and on Google Cloud Platform for large-scale testing and production.

Storage

GOOGLE CLOUD gs:// localize delocalize pull container image

Operators submit workflows to a Cromwell execution service. This can be a persistent server set up locally or on cloud (submission via curl to API endpoints), or simply spun up on the user’s machine (e.g. laptop) at time of submission.

Cromwell parses the WDL, generates individual jobs and dispatches them for execution via the specified backend.

Compute Engine

Genomics Pipelines API

{wdl}

LOCAL INFRASTRUCTURE localize

SGE

WDL script and list of inputs submitted to Cromwell

NFS Cromwell Execution Service

Storage

delocalize Server/Cluster

Resources and upcoming developments GATK4 is currently available as a beta version from https://software.broadinstitute.org/gatk/download/beta We expect to release GATK version 4.0 into general availability later this summer. We plan to publish all our Best Practices workflows as WDL scripts accompanied by all necessary example data at https://software.broadinstitute.org/gatk/best-practices

The WDL website is the best place to go for more information on WDL and Cromwell, including a quick start guide and many tutorials and example scripts at https://www.github.com/broadinstitute/wdl/scripts For many additional real-world analysis scripts, see also https://www.github.com/broadinstitute/wdl/scripts For visualizing WDL pipelines we recommend the Pipeline Builder package developed by the external group EPAM, available at http://pb.opensource.epam.com

Curious to try this out on the cloud?Take Cromwell for a spin and start running WDL workflows through Cromwell using the Google Genomics WDL Runner as described in this step-by-step tutorial: https://cloud.google.com/genomics/v1alpha2/gatk You can also try running WDLs on our cloud analysis platform, FireCloud, which provides advanced data and methods management functionality along with a GUI workspace environment for executing workflows. https://www.firecloud.org

201707-BOSC-Full_stack_genomics_pipelining.pdf

execution service. This can be a persistent. server set up locally or on cloud (submission. via curl to API endpoints), or simply spun up. on the user's machine (e.g. laptop) at time of. submission. localize. delocalize. localize. delocalize. Genome Analysis Toolkit v.4 Workflow Definition Language Cromwell Execution Engine.

326KB Sizes 0 Downloads 144 Views

Recommend Documents

No documents