Full-stack genomics pipelining with
GATK4 + WDL + Cromwell
Data Sciences Platform
Kate Voss, Jeff Gentry, Geraldine Van der Auwera
Genome Analysis Toolkit v.4
Workflow Definition Language
Cromwell Execution Engine
GATK4 is the newest version of our variant discovery package, featuring major performance enhancements and expanded scope of operation.
WDL is a workflow language designed to express tasks and workflows in a user-friendly way, making it easy to build sophisticated pipelines without advanced engineering experience.
Cromwell is a workflow management system designed to execute scientific analysis pipelines on any computing platform, local or cloud.
Cromwell was designed to be able to cover as many use cases as possible, without making any assumptions about the nature or domain of the workflows it runs. It is simple to get started with small workloads and can easily scale up to a large production environment. Several preconfigured backends are available out of the box (local, SGE, Google, TES). Work is underway to add backends for AWS and for Alibaba Cloud, as well as to add support for CWL pipelines. Cromwell development aims to offer choice and flexibility to a wide user community.
task task_A {
File my_input String name
File in String id
task definition
Its expanded scope of operation includes Best Practices workflows for somatic SNPs and Indels as well as copy number variation in both germline and somatic cases. Work is underway to add structural variation workflows.
workflow myWorkflowName {
workflow definition
The GATK engine was completely rewritten with a focus on speed, scalability and versatility, with support for key Big Data technologies including Apache Spark and cloud platforms, and optimized algorithm implementations in flagship tools such as HaplotypeCaller.
{wdl}
call task_A { input: in=my_input, id=name
} call task_B { input: in=task_A.out
} } task task_A {
...
}
task task_B {
...
}
command { do_stuff I=${in} O=${id}.ext
} output { File out= “${id}.ext”
} }
Running WDL workflows through Cromwell on local infrastructure or on cloud
In the Broad Institute’s Data Sciences Platform, we are responsible for developing GATK tools and workflows, and operating large production pipelines in which the GATK Best Practices workflows are applied to the Broad’s massive amounts of genome sequencing data (24TB/day). This diagram illustrated the two strategies we use: execution on local infrastructure (SGE backend and NFS filesystem) for small-scale development and testing, and on Google Cloud Platform for large-scale testing and production.
Storage
GOOGLE CLOUD gs:// localize delocalize pull container image
Operators submit workflows to a Cromwell execution service. This can be a persistent server set up locally or on cloud (submission via curl to API endpoints), or simply spun up on the user’s machine (e.g. laptop) at time of submission.
Cromwell parses the WDL, generates individual jobs and dispatches them for execution via the specified backend.
Compute Engine
Genomics Pipelines API
{wdl}
LOCAL INFRASTRUCTURE localize
SGE
WDL script and list of inputs submitted to Cromwell
NFS Cromwell Execution Service
Storage
delocalize Server/Cluster
Resources and upcoming developments GATK4 is currently available as a beta version from https://software.broadinstitute.org/gatk/download/beta We expect to release GATK version 4.0 into general availability later this summer. We plan to publish all our Best Practices workflows as WDL scripts accompanied by all necessary example data at https://software.broadinstitute.org/gatk/best-practices
The WDL website is the best place to go for more information on WDL and Cromwell, including a quick start guide and many tutorials and example scripts at https://www.github.com/broadinstitute/wdl/scripts For many additional real-world analysis scripts, see also https://www.github.com/broadinstitute/wdl/scripts For visualizing WDL pipelines we recommend the Pipeline Builder package developed by the external group EPAM, available at http://pb.opensource.epam.com
Curious to try this out on the cloud?Take Cromwell for a spin and start running WDL workflows through Cromwell using the Google Genomics WDL Runner as described in this step-by-step tutorial: https://cloud.google.com/genomics/v1alpha2/gatk You can also try running WDLs on our cloud analysis platform, FireCloud, which provides advanced data and methods management functionality along with a GUI workspace environment for executing workflows. https://www.firecloud.org