Automated build environment for Bioinformatics cloud images

Brad Chapman bio photo By Brad Chapman Comment

Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I'm happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine Image (AMI) containing packages integrated from several existing efforts. The hope is to consolidate the community's open source work around a single, continuously improving, machine image.

This image incorporates software from several existing AMIs:

  • JCVI Cloud BioLinux -- JCVI's work porting Bio-Linux to the cloud.
  • bioperl-max -- Fortinbras' package of BioPerl and associated informatics tools.
  • MachetEC2 -- An InfoChimps image loaded with data mining software.

Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I'm extremely grateful to the authors for their code, documentation and discussions.

The current AMI is available for loading on EC2 -- search for 'CloudBioLinux' in the AWS console or go to the CloudBioLinux project page for the latest AMIs. Automated scripts and configuration files with contained packages are available as a GitHub repository.

Contributions encouraged

This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed on their work machines. If their favorite package is missing it should be quick and easy to add, making the improvement available to future developers.

Achieving these goals requires help and contributions from other programmers utilizing the cloud -- everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work. For instance, the Python and R libraries are off to a good start. I'd like to extend an invitation to folks with expertise in other areas to help improve the coverage of this AMI:

  • Programmers: help expand the configuration files for your areas of interest:
    • Perl CPAN support and libraries
    • Ruby gems
    • Java libraries
    • Haskell hackage support and libraries
    • Erlang libraries
    • Bioinformatics areas of specialization:
      • Next-gen sequencing
      • Structural biology
      • Parallelized algorithms
    • Much more... Let us know what you are interested in.
  • Documentation experts: provide cookbook style instructions to help others get started.
  • Porting specialists: The automation infrastructure is dependent on having good ports for libraries and programs. Many widely used biological programs are not yet ported. Establishing a Debian or Ubuntu port for a missing program will not only help this effort, but make the programs more widely available.
  • Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We'd like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.
  • Testers: Check that this runs on open source Eucalyptus clouds, additional linux distributions, and other cloud deployments.

If any of this sounds interesting, please get in contact. The Cloud BioLinux mailing list is a good central point for discussion.

Infrastructure overview

In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the MachetEC2 project, packages to be installed are entered into a set of easy to edit configuration files in YAML syntax. There are three different configuration file types:

  • main.yaml -- The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.
  • packages.yaml -- Defines debian/ubuntu packages to be installed. This leans heavily on the work of DebianMed and Bio-Linux communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.
  • python-libs.yaml, r-libs.yaml -- These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation from the Python package index, and R library installation from CRAN and Bioconductor. This will be expanded to include support for other languages.

The Fabric remote automated deployment tool is used to build AMIs from these configuration files. Written in Python, the fabfile automates the process of installing packages on the cloud machine.

We hope that the straightforward architecture of the build system will encourage other developers to dig in and provide additional coverage of program and libraries through the configuration files. For those comfortable with Python, the fabfile is very accessible for adding in new functionality.

If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out Codefest 2010; it'll be two enjoyable days of cloud informatics development. I'm looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.

comments powered by Disqus