Distributed exome analysis pipeline with CloudBioLinux and CloudMan

A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.

Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:

CloudBioLinux -- A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.
CloudMan -- Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.

Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:

The underlying analysis software is from CloudBioLinux.
CloudMan provides an SGE cluster managed via a web front end.
RabbitMQ is used for communication between cluster nodes.
An automated pipeline, written in Python, organizes parallel processing across the cluster.

Below are instructions for starting a cluster on Amazon EC2 resources to run an exome sequencing pipeline that processes FASTQ sequencing reads, producing fully annotated variant calls.

Start cluster with CloudBioLinux and CloudMan

Start in the Amazon web console, a convenient front end for managing EC2 servers. The first step is to follow the CloudMan setup instructions to create an Amazon account and set up appropriate security groups and user data. The wiki page contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:

[youtube=http://www.youtube.com/watch?v=eS8vmKIXIB4]

Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:

cm-b53c6f1223f966914df347687f6fc818/shared/2012-07-23--19-23

This screencast shows all of the details, including starting an additional node on the SGE cluster:

[youtube=http://www.youtube.com/watch?v=4kIRI1m0g7Y]

Configure AMQP messaging

Edit: The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the 'Run Analysis' section to start processing the data immediately.

With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:

$ ssh -i your-keypair ubuntu@ec2-50-19-177-134.compute-1.amazonaws.com

Edit the /export/data/galaxy/universe_wsgi.ini configuration file to add the internal hostname. After editing, the AMQP section will look like:

[galaxy_amqp]
host = ip-10-125-10-182.ec2.internal
port = 5672
userid = biouser
password = tester

Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:

$ sudo rabbitmqctl add_user biouser tester
creating user "biouser" ...
...done.
$ sudo rabbitmqctl add_vhost bionextgen
creating vhost "bionextgen" ...
...done.
$ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*"
setting permissions for user "biouser" in vhost "bionextgen" ...
...done.

Run analysis

With messaging in place, we are ready to run the analysis. /export/data contains a ready to run example exome analysis, with FASTQ input files in /export/data/exome_example/fastq and configuration information in /export/data/exome_example/config. Start the fully automated pipeline with a single command:

 $ cd /export/data/work
 $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml
                                   /export/data/exome_example/fastq
                                   /export/data/exome_example/config/run_info.yaml

distributed_nextgen_pipeline.py starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:

Alignment with BWA
Preparation of merged alignment files with Picard
Recalibration and realignment with GATK
Variant calling with GATK
Assessment of predicted variant effects with snpEff
Preparation of summary PDFs for each sample with read details from FastQC alongside alignment, hybrid selection and variant calling statistics from Picard

Monitor the running process

The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs qstat command will tell you where the analysis servers are running on the cluster:

$ qstat
ob-ID  prior   name   user  state submit/start at   queue
----------------------------------------------------------------------------------
1 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-125-10-182.ec2.int
2 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-86-254-105.ec2.int
3 0.55500 automated_ ubuntu  r  08/14/2011 18:16:47 all.q@ip-10-125-10-182.ec2.int

Listing files in the working directory will show our progress:

$ cd /export/data/work
$ ls -lh
drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments
-rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17 automated_initial_analysis.py.o11
drwxr-xr-x 2 ubuntu ubuntu   33 2011-08-13 20:43 log
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o10
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o9
drwxr-xr-x 8 ubuntu ubuntu  102 2011-08-13 21:06 tmp

The files that end with .o* are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:

$ less nextgen_analysis_server.py.o10
INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane
  8; reference genome hg19; researcher ; analysis method SNP calling
INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner
INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2']
INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK

Retrieve results

The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:

$ upload_to_galaxy.py /export/data/galaxy/post_process.yaml
                      /export/data/exome_example/fastq
                      /export/data/work
                      /export/data/exome_example/config/run_info.yaml

For each sample, this script copies:

A BAM file with aligned sequeneces and original FASTQ data
A realigned and recalibrated BAM file, ready for variant calling
Variant calls in VCF format.
A tab delimited file of predicted variant effects.
A PDF summary file containing alignment, variant calling and hybrid selection statistics.

into an output directory for the flowcell: /export/data/galaxy/storage/100326_FC6107FAAXX:

$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
-rw-r--r-- 1 ubuntu ubuntu  38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam
-rw-r--r-- 1 ubuntu ubuntu  22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig
-rw-r--r-- 1 ubuntu ubuntu  72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam
-rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv
-rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf
-rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd

As suggested by the name, the script can also integrate the data into a Galaxy instance if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the UCSC browser.

Learn more

All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through YAML configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.

The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.

If you'd like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I'll be discussing it at the AWS Genomics Event in Seattle on September 22nd.

Blue Collar Bioinformatics

Community built tools for biological data analysis