The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other interval file formats through Tribble, and variant data in VCF format.
Here I'll show how to utilize the GATK API from Clojure, a functional, dynamic programming language that targets the Java Virtual Machine. We'll:
- Write a GATK walker that plots variant quality scores using the Map/Reduce API.
- Create a custom annotation that adds a mean neighboring base quality metric using the GATK VariantAnnotator.
- Use the VariantContext API to parse and access variant information in a VCF file.
The Clojure variation library is freely available and is part of a larger project to provide variant assessment capabilities for the Archon Genomics XPRIZE competition.
Map/Reduce GATK Walker
GATK's well documented Map/Reduce API eases the development of custom programs for processing BAM and VCF files. The presentation from Eli Lilly is a great introduction to developing your own custom GATK Walkers in Java. Here we'll follow a similar approach to code these in Clojure.
We'll start by defining a simple Java base class that extends the base GATK walker and defines an output and input variable. The output is a string specifying the output file to write to, and the input is any type of variant file the GATK accepts. Here we'll be dealing with VCF input files:
[sourcecode language="java"]
public abstract class BaseVariantWalker extends RodWalker {
@Output
public String out;
@ArgumentCollection
public StandardVariantContextInputArgumentCollection invrns = new StandardVariantContextInputArgumentCollection();
}
[/sourcecode]
This base class is all the Java we need. We implement the remaining walker in Clojure and will walk through the fully annotated source in sections. To start, we import the base walker we wrote and extend this to generate a Java class, which the GATK will pick up and make available as a command line walker:
[sourcecode language="clojure"]
(ns bcbio.variation.vcfwalker
(:import [bcbio.variation BaseVariantWalker])
(:gen-class
:name bcbio.variation.vcfwalker.VcfSimpleStatsWalker
:extends bcbio.variation.BaseVariantWalker))
[/sourcecode]
Since this is a Map/Reduce framework, we first need to implement the map function. GATK passes this function a tracker, used to retrieve the actual variant call values and a context which describes the current location. We use the invrns
argument we defined in Java to reference the input VCF file supplied on the commandline. Finally, we extract the quality score from each VariantContext
and return those. This map function produces a stream of quality scores from the input VCF file:
[sourcecode language="clojure"]
(defn -map
[this tracker ref context]
(if-not (nil? tracker)
(for [vc (map from-vc
(.getValues tracker (.variants (.invrns this))
(.getLocation context)))]
(-> vc :genotypes first :qual))))
[/sourcecode]
For the reduce part, we take the stream of quality scores and plot a histogram. In the GATK this happens in 3 functions: reduceInit
starts the reduction step and creates a list to add the quality scores to, reduce
collects all of the quality scores into this list, and onTraversalDone
plots a histogram of these scores using the Incanter statistical library:
[sourcecode language="clojure"]
(defn -reduceInit
[this]
[])
(defn -reduce
[this cur coll]
(if-not (nil? cur)
(vec (flatten [coll cur]))
coll))
(defn -onTraversalDone
[this result]
(doto (icharts/histogram result
:x-label "Variant quality"
:nbins 50)
(icore/save (.out this) :width 500 :height 400)))
[/sourcecode]
We've implemented a full GATK walker in Clojure, taking advantage of existing Clojure plotting libraries. To run this, compile the code into a jarfile and run like a standard GATK tool:
$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
-r test/data/grch37.fa --variant test/data/gatk-calls.vcf --out test.png
which produces a plot of quality score distributions:
Custom GATK Annotation
GATK's Variant Annotator is a useful way to add metrics information to a file of variants. These metrics allow filtering and prioritization of variants, either by variant quality score recalibration or hard filtering. We can add new annotation metrics by inheriting from GATK Java interfaces. Here we'll implement Mean Neighboring Base Quality (NBQ), a metric from the Atlas2 variation suite that assesses the quality scores in a region surrounding a variation.
We start walking through the full implementation by again defining a generated Java class that inherits from a GATK interface. In this case, InfoFieldAnnotation
:
[sourcecode language="clojure"]
(ns bcbio.variation.annotate.nbq
(:import [org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation]
[org.broadinstitute.sting.utils.codecs.vcf VCFInfoHeaderLine VCFHeaderLineType])
(:require [incanter.stats :as istats])
(:gen-class
:name bcbio.variation.annotate.nbq.MeanNeighboringBaseQuality
:extends org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation))
[/sourcecode]
The annotate
function does the work of calculating the mean quality score. We define functions that use the GATK API to:
- Retrieve the pileup at the current position.
- Get the neighbor qualities from a read at a position.
- Combine the qualities for all reads in a pileup.
With these three functions, we can use the Clojure threading macro to cleanly organize the steps of the operation as we retrieve the pileup, get the qualities and calculate the mean:
[sourcecode language="clojure"]
(defn -annotate
[_ _ _ _ contexts _]
(letfn [(get-pileup [context]
(if (.hasExtendedEventPileup context)
(.getExtendedEventPileup context)
(.getBasePileup context)))
(neighbor-qualities [[offset read]]
(let [quals (-> read .getBaseQualities vec)]
(map #(nth quals % nil) (range (- offset flank-bp) (+ offset flank-bp)))))
(pileup-qualities [pileup]
(map neighbor-qualities (map vector (.getOffsets pileup) (.getReads pileup))))]
{"NBQ" (->> contexts
vals
(map get-pileup)
(map pileup-qualities)
flatten
(remove nil?)
istats/mean
(format "%.2f"))}))
[/sourcecode]
With this in place we can now run this directly using the standard GATK command line arguments. As before, we create a jar file with the new annotator, and then pass the name as a desired annotation when running the VariantAnnotator
, producing a VCF file with NBQ annotations:
$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
-A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
--variant test/data/gatk-calls.vcf -o annotated-file.vcf
Access VCF variant information
In addition to extending the GATK through walkers and annotations you can also utilize the extensive API directly, taking advantage of parsers and data structures to handle common file formats. Using Clojure's Java interoperability, the variantcontext module provides a high level API to parse and extract information from VCF files. To loop through a VCF file and print the location, reference allele and called alleles for each variant we:
- Open a VCF source providing access to the underlying file inside a
with-open
statement to ensure closing of the resource. - Parse the VCF source, returning an iterator of
VariantContext
maps for each variant in the file. - Extract values from the map: the chromosome, start, reference allele and called alleles for the first genotype.
[sourcecode language="clojure"]
(use 'bcbio.variation.variantcontext)
(with-open [vcf-source (get-vcf-source "test/data/gatk-calls.vcf")]
(doseq [vc (parse-vcf vcf-source)]
(println (:chr vc) (:start vc) (:ref-allele vc)
(-> vc :genotypes first :alleles)))
[/sourcecode]
This produces:
MT 73 #<Allele G*> [#<Allele A> #<Allele A>]
MT 150 #<Allele T*> [#<Allele C> #<Allele C>]
MT 152 #<Allele T*> [#<Allele C> #<Allele C>]
MT 195 #<Allele C*> [#<Allele T> #<Allele T>]
I hope this tour provides some insight into the powerful tools that can be rapidly built by leveraging the GATK from Clojure. The full library contains a range of additional functionality including normalization of complex MNPs and support for phased haplotype comparisons.
comments powered by Disqus