CloudBioLinux: progress on bioinformatics cloud images and data

My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we've had amazing interest from the community and made great progress with:

A permanent web site at cloudbiolinux.org
Additional software and genomic data
New user documentation
A community coding session: Codefest 2010

New software and data

The most exciting changes have been the rapid expansion of installed software and libraries. The goal is to provide an image that experienced developers will find as useful as their custom configured servers. A great group of contributors have put together a large set of programs and libraries; the configuration files have all the details on installed programs as well as libraries for Python, Perl, Ruby, and R. Another addition is support for non-packaged programs which provides software not yet neatly wrapped in a package manger or library-specific install system: next-gen software packages like Picard, GATK and Bowtie are installed through custom scripts.

To improve accessibility for developers who prefer a desktop experience, a FreeNX server was integrated with the provided images. Tim Booth from the NEBC Bio-Linux team headed up the integration of FreeNX, and the user experience looks very similar to a locally installed Bio-Linux desktop.

In addition to the software image, a publicly available data volume is now available that contains:

Genome sequences pre-indexed for search with next-gen aligners like Bowtie, Novoalign, and BWA.
LiftOver files for mapping between sequence coordinates.
UniRef protein databases, indexed for searching with BLAST+.

Coupled with the software images, this volume makes it easy to do next-gen analyses. Start up an Amazon AMI, attach the genome data volume, transfer your fastq file to the instance, and kick off the analysis. The overhead of software installation and genome indexing is completely removed. Thanks to the work of Enis Afgan and James Taylor of Galaxy, the data volume plugs directly into Galaxy's ready to use cloud image. Coupling the data and software with Galaxy provides a familiar web interface for running tools and developing biological workflows.

The data volume preparation is fully automated via a fabric install script, similar to the software install script. Additional data sources are easily integrated, and we hope to expand the available datasets based on feedback from the community.

Documentation and presentations

The software and data volumes are only as good as the documentation which helps people use them:

Bela Tiwari of the NEBC Bio-Linux team has written an excellent introduction to Amazon EC2 and CloudBioLinux. This breaks down the process of signing up for an account, creating a software image, associating data volumes and setting up a graphical server. It's a great place to get started with CloudBioLinux.
Ntino Krampis, from the JCVI Cloud Bio-Linux project, gave a presentation on CloudBioLinux explaining the motivation behind the project and providing usage examples.
My presentation on the open source community behind CloudBioLinux from Amazon's Genomic Data workshop. This details the project goals and automated code organization.

Community: Codefest 2010

The CloudBioLinux community had a chance to work together for two days in July at Codefest 2010. In conjunction with the Bioinformatics Open Source Conference (BOSC) in Boston, this was a free to attend coding session hosted at Harvard School of Public Health and Massachusetts General Hospital. Over 30 developers donated two days of their time to working on CloudBioLinux and other bioinformatics open source projects.

Many of the advances in CloudBioLinux detailed above were made possible through this session: the FreeNX graphical client integration, documentation, Galaxy interoperability, and many library and data improvements were started during the two days of coding and discussions. Additionally, the relationships developed are the foundation for better communication amongst open source projects, which is something we need to be continually striving for in the scientific computing world.

It was amazing and inspiring to get such positive feedback from so many members of the bioinformatics community. We're planning another session next year in Vienna, again just before BOSC and ISMB 2011; and again, everyone is welcome.

Summary

Go to the CloudBioLinux website for the latest publicly available images and data volumes, which are ready to use on Amazon EC2. With Amazon's new micro-images you can start analyzing data for only a few cents an hour. It's an easy way to explore if cloud resources will help with computational demands in your work. We're very interested in feedback and happy to have other developers helping out; please get in touch on the CloudBioLinux mailing list.

Blue Collar Bioinformatics

Community built tools for biological data analysis

CloudBioLinux: progress on bioinformatics cloud images and data

New software and data

Documentation and presentations

Community: Codefest 2010

Summary