Working with PanCancer Data on AWS

The International Cancer Genome Consortium (ICGC) PanCancer dataset generated by the Pancancer Analysis of Whole Genomes (PCAWG) study is now available on Amazon Web Services (AWS), giving cancer researchers access to over 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors. The project’s BAM files (a compressed binary format for sequence alignment data) and VCF files (a text format encoding somatic variants) are available in the US East (N. Virginia) region via Amazon S3 and can be accessed securely from Amazon EC2 instances. Users can search for files using the ICGC Data Portal and access individual or related sets of alignment and variant files through the ICGC Storage Client. A Docker-based PanCancer Launcher, with the alignment workflow pre-installed, is also available, allowing users to align their own sequence data on AWS identically to the method used for PanCancer genomes. More information can be found on the ICGC on the Cloud page.

Accessing PanCancer Data on AWS

Users can search for files using the ICGC Data Portal and access individual or related sets of alignment and variant files through the ICGC Storage Client. The alignments and a selection of Sanger somatic variant calls are currently available in Amazon S3. Further variant calls will be released following additional quality checking, validation, and analysis.

To download PanCancer data, users need:

  • a valid Amazon AWS account, with access to the us-east-1 (North Virginia) region
  • a running Amazon EC2 instance in us-east-1 (North Virginia) region
  • a DACO approved account with “Cloud” access on ICGC Data Portal

Using the Data Repositories section of ICGC Data Portal, users can search for data hosted in AWS Virginia by selecting the corresponding AWS repository.

ICGC Data Portal - Data Repositories filtered
ICGC Data Portal - Data Repositories filtered on “AWS - Virginia”

Users can narrow down their search by selecting additional Donor and file filters such as “Data Type”, “File Format” or “Primary Site”, using facets on the left side of the screen.

Once the data repository section have been filtered and contains only files to be downloaded, a manifest is generated and used with the ICGC Storage Client.

ICGC Data Portal - Download Manifest
ICGC Data Portal - Download Manifest

After configuring the ICGC Storage Client with an authorization token generated from DCC Data Portal, users can begin downloading files to their Amazon EC2 instances for further analysis.

ICGC Storage Client - Downloading objects
ICGC Storage Client - Downloading objects

The ICGC Storage Client utilizes multi-part and resume capabilities, which provides a fast and reliable file download experience. It also supports slicing, which allows users to extract regions of interest from BAM files, and a FUSE filesystem mode, which enables working with the files in S3 a though they are local files. The latter feature enables a range of tools to directly work with BAM and VCF files without having to first download data to an Amazon EC2 instance.

For more information see the ICGC on the Cloud page and ICGC Storage Client documentation.

Using PanCancer Workflows on AWS

The PanCancer dataset was aligned and quality checked over the course of 2015 using a variety of cloud and high performance compute cluster (HPC) environments, in both commercial and academic settings. We used Docker to encapsulate our analytical pipelines for alignment and variant calling. The latter encompassed current best practice variant calling pipelines from four academic organizations: the German Cancer Research Center (DKFZ), the European Molecular Biology Laboratory (EMBL) in Heidelberg, the Wellcome Trust Sanger Institute, and the Broad Institute.

Analysis at the scale of the PCAWG project necessitated the creation of a framework that could create cloud instances, enqueue analysis work, trigger Docker-based analysis pipelines, and clean up cloud resources as tasks were completed. To accomplish this, we created the PanCancer Launcher, which is based on the Consonance project. Using this tool, users can analyze genomes identically to the PCAWG project, enabling direct comparison and co-analysis with the larger PanCancer dataset. In addition, the Launcher can be configured to use the Amazon EC2 spot instance facility, which greatly reduces costs. Running a >30x whole genome alignment is relatively fast and inexpensive, with a turnaround of roughly 4 days and ~$10 on a single m4.2xlarge instance.

The PanCancer Launcher - An Overview of the Workflow Launcher and Instance Fleet
The PanCancer Launcher - An Overview of the Workflow Launcher and Instance Fleet

To set up a PanCancer Launcher users need:

The PanCancer Launcher is command-line based and does not have a graphical user interface. However, we have attempted to make the complexities of the system as streamlined as possible. We provide detailed documentation in our Launcher HOWTO Guide but, at a high level, the process of analyzing genomes using PCAWG workflows includes the following steps:

  1. prepare genomes as unaligned BAM files, one per specimen
  2. create an Amazon S3 bucket for unaligned BAM inputs, and another bucket for results
  3. upload input genomes to Amazon S3 (the bucket for unaligned BAM input files)
  4. launch an Ubuntu 14.04 instance on Amazon EC2 in the us-east-1 (North Virginia) region
  5. log in to the instance and use the PanCancer Launcher bootstrap script to configure various settings and dependencies
  6. select the BWA-mem alignment workflow and generate an “INI” file template, this is how to parameterize a run of the workflow
  7. create one or more INI files, each corresponding to a genome to analyze
  8. trigger the running of these workflows, monitor their progress, and examine any failure logs if necessary; the system will scale up the fleet of Amazon EC2 instances as needed
  9. worker instances will be terminated when the workflows successfully finish
  10. alignment results can be retrieved from Amazon S3 for downstream analysis

A more detailed, step-by-step guide is provided in the PanCancer Launcher HOWTO Guide.

Currently, the PanCancer Launcher includes the BWA-mem-based alignment pipeline and its associated quality control steps. Future releases will be expanded to include the core variant calling pipelines created by the PCAWG project.

For More Information

Additional documentation can be found on the ICGC on the Cloud page and updated releases of our software infrastructure and tools will be announced on the PCAWG project page.

ICGC In the Cloud logo