Genomics X Prize public phase: reference genome preparation and comparisons to Illumina and Complete Genomics

Background

The Archon Genomics X Prize, presented by Express Scripts, is a 10 million dollar competition to establish highly accurate clinical grade sequencing and variation detection methods. Our group at Harvard School of Public Health works with the EdgeBio team on developing the infrastructure for the competition: identify variations in the grading genomes and provide software to compare these reference variation sets against a competitor's list of variations.

The exciting aspect of the Genomics X Prize is that it enables open comparisons between sequencing technologies and variant calling methodologies. Sequencing genomes to the high degree of accuracy sufficient for clinical usage is a difficult, open, problem. Here I'll present detailed numbers comparing variants called by different sequencing technologies and variant callers.

The public phase of the Genomics X Prize starts today, August 15th. The goal of this six month period is to have an open dialog with everyone working in the sequencing and variant calling communities. We want to refine our methods to provide the most accurate and fair variant calling for the reference genomes. To start the discussion we've prepared:

Variant calls for a HapMap individual: NA19239, a Yoruba male from Ibadan, Nigeria. We sequenced haploid fosmids from NA19239 with two technologies: Illumina and SOLiD; and called variants with three different methods: GATK Unified Genotyper, FreeBayes, and SAMtools mpileup. We combined these calls into a unified final call set, NA19239 version 0.1, that I discuss in detail below.
Fully documented methods, access to all data files used, and a public scoring site. The X Prize Validation wiki contains detailed information about sequencing, variant calling, validation and scoring. The validationprotocol.org website provides a simple way for anyone to compare their variant calls against the public reference genomes. It encourages submission and analysis in public tools like Galaxy through transparent interoperability with GenomeSpace.
An automated variant analysis infrastructure built on top of the Broad's Genome Analysis Toolkit (GATK) that performs comparisons as well as variant unification. This is a generally useful toolkit of functionality to manipulate variants, and we presented an overview at the Bioinformatics Open Source Conference last month. This is an open-source community developed project, and has received great contributions from the Genome in a Bottle Consortium at the National Institute of Standards and Technology.

The goal of this writeup, and the X Prize public phase, is to iterate over calling and unification methods to improve our algorithms and approaches. Rather than promoting or disparaging any particular technology or calling method, we're instead providing full transparency and a good-faith effort to combining approaches. Our hope is that this will help engage the community, encourage feedback, and result in a unbiased and accurate set of reference genomes for the competition.

Unification of variant calls

For the August 15th public phase kickoff, we prepared a reference data set of NA19239 based on pooled sequencing of haploid fosmid clones. The callable regions of these clones totaled 129,513,026 total bases, covering ~4% of the 3.1 billion bases in the human genome. We use fosmid clones to obtain complete regional haplotype coverage and focus on partial genome coverage to achieve high coverage depth and accuracy for assessed regions.

Version 0.1 of the NA19239 reference set uses variant calls from two technologies: Illumina and SOLiD; and three callers: GATK's Unified Genotyper, FreeBayes and SAMtools. To move from these data to a unified call set we:

Align to GRCh37 reference genome with Novoalign.
Perform post-processing and indel realignment with GATK's IndelRealigner.
Perform variant calling with GATK's UnifiedGenotyper, FreeBayes and samtools mpileup.
Do pairwise comparisons between all technology/caller approaches.
Generate the union of all possible calls and merge with initial GATK calls, recalling any no-call positions at expected sites.
Use validation information on variants found in multiple technologies, plus metrics associated with common variants, to filter the full call set to a final set of trusted calls.

The challenging decisions begin when merging and filtering the final call set. This requires careful bookkeeping and variant representation to ensure identical variants are directly comparable, followed by setting cutoffs for variant inclusion.

Comparison details

The details of variant comparisons introduce an additional layer of complexity during assessment. The approach we've taken is create a normalized set of variants so all comparison differences are due to actual call differences rather than variant representation. We split multiple nucleotide polymorphisms into individual calls, split complex indel-variant combinations, and left-align remaining variants.

For haploid/diploid comparisons, we establish haplotype blocks for the diploid sequence based on phasing provided in the input variant file, and then compare the best matching haplotype to our fosmid reference. Single nucleotide polymorphisms and indels less than 30bp require exact machines between two comparison genomes. Larger indels and structural variations receive more flexible matching with confidence intervals around start and end coordinates.

The goal of the normalized, compared variants is to reflect real underlying differences in calling approaches relative to how well we can currently resolve variation endpoints.

Comparisons between variation callers

For a concrete example of two different variant calling approaches, below is a table comparing GATK variants against samtools calls for the NA19239 sample, using identically aligned and post-processed BAMs:


concordant: total	160851
concordant: SNPs	136146
concordant: indels	24705
GATK discordant: total	13925
GATK discordant: SNPs	1315
GATK discordant: indels	12610
samtools discordant: total	25368
samtools discordant: SNPs	17247
samtools discordant: indels	8121

The number of discordant variant calls is high, making up 8% of the GATK calls and 14% of the samtools calls, and samtools calls almost 16,000 additional SNPs compared to GATK. As a result, a large percentage of variants require making hard decisions: are those additional calls interesting, real variants in samtools and false negatives in the GATK calls? Or conversely, are they false positives in samtools that GATK correctly excludes?

Comparisons between sequencing technologies

There is a similar level of discrepancy when comparing variant calls between Illumina and SOLiD sequencing. Below is a comparison between GATK Unified genotyper calls on the two technologies:


concordant: total	135263
concordant: SNPs	122267
concordant: indels	12996
Illumina discordant: total	39491
Illumina discordant: unique	7079
Illumina discordant: SNPs	15188
Illumina discordant: indels	24303
SOLiD discordant: total	16022
SOLiD discordant: unique	3800
SOLiD discordant: SNPs	3908
SOLiD discordant: indels	12114

Unique coverage explains some differences: 4% of the Illumina variants (7079) and 2.5% (3800) of the SOLiD variants were uniquely covered by the technologies. However, the remaining variant discordant calls are on the order of those seen in the technology comparisons. Adding to the complexity, we find only 84% of the total concordant variants compared to the Illumina only GATK/samtools comparison.

Unified call set

The level of discrepancy between calling methods and sequencing approaches introduces complexity in the preparation of the final call set: How much evidence does a variant need for inclusion? Can single calls be true positives if supported by high confidence values? This will require extensive refinement throughout the public phase. For the initial version 0.1 release of NA19239, we took the following high level approach to filtering:

Retain variants found in 4 out of 6 calling/technology methods (including genotyping data).
Retain variants identified across multiple technologies.
Retain variants found in both more stringent (GATK) and more lenient (FreeBayes, samtools) callers.
Assess remaining variants using a Support Vector Machine with quality score, read depth and variant distance from read ends metrics, training the classifier on likely true and false positives from the pairwise overlap comparisons.

The result is a unified call set of 171,009 variants derived from all technologies and callers, that we're releasing as NA19239 version 0.1.

Comparisons with whole genome datasets

To assess the quality of the unified call set, we compared to two public genomes:

Complete Genomics's NA19239 variants from their public whole genome datasets,
An Illumina whole genome dataset for NA19239 at 30x coverage.

This provides us with three independent call sets to assess variability between approaches. To provide a baseline, here is the comparison of the Illumina and Complete Genomics calls in our assessment regions:


Overall genotype concordance	98.47
concordant: total	205868
concordant: SNPs	186365
concordant: indels	19503
Illumina discordant: total	31267
Illumina discordant: SNPs	19334
Illumina discordant: indels	11933
Complete Genomics discordant: total	15174
Complete Genomics discordant: SNPs	9586
Complete Genomics discordant: indels	5510

We see familiar discordance rates: 13% of the Illumina calls and 7% of the Complete Genomics calls differ. Since it's diploid versus diploid, this comparison includes all heterozygous variant matches. As a result the numbers in this comparison will be higher, but it is a good order of magnitude approximation for looking at our fosmid reference set versus each individual technology.

Illumina

The comparison against the Illumina whole genome variant calls contains 12% discordant calls in our fosmid reference set, with 79% of those being indel differences. Indels are notoriously more difficult to identify and assess, so this will be an area of increased focus as we move forward:


concordant: total	150420
concordant: SNPs	132604
concordant: indels	17816
fosmid discordant: total	19624
fosmid discordant: SNPs	4165
fosmid discordant: indels	15459
Illumina discordant: total	5475
Illumina discordant: SNPs	2952
Illumina discordant: indels	2523

Complete Genomics

The Complete Genomics comparison has 17% discordant calls including 2x more discordant SNP calls. This highlights another key area of call set refinement: identifying and correcting for technology specific calls.


concordant: total	139559
concordant: SNPs	126296
concordant: indels	13263
fosmid discordant: total	29883
fosmid discordant: SNPs	10162
fosmid discordant: indels	19721
Complete Genomics discordant: total	7571
Complete Genomics discordant: SNPs	5542
Complete Genomics discordant: indels	1965

Summary

The initial NA19239 public genome for the Genomics X Prize provides unified variant calls based on two sequencing technologies and three calling methods. I've delved into a lot of details on our approaches, challenges and goals with the hopes of encouraging suggestions from other researchers working on these problems. We're especially interested in feedback on these areas of ongoing research:

Digging deeper into potential false positives and negatives: By combining comparison information between the unified callset and external resources, we can identify 17654 fosmid variants (10%) not found in both the Complete Genomics and Illumina datasets. These require additional in-depth analysis to classify as uniquely identified fosmid calls or potential false positives. Similarly, Illumina and Complete Genomics combine to call 1228 variants (0.7%) that are not in the fosmid call set. These need examination to classify as fosmid false negatives, or false positive calls in the individual technologies.
Additional public genomes: We're actively working with teams like the Genome in a Bottle Consortium and Genome Research Consortium to compare with their reference sets and approaches. Our next target public genome is NA12878, used in both of these projects and widely studied.
Improve variant representation and assessment: The variation software framework works hard to make variant representations as uniform as possible. Indels are especially challenging and we welcome practical examples of regions that need additional standardization.
Refine approaches to unifying variant calls: What we learn from the additional inspection of discordant variants can help inform improved approaches to filtering. This is a great opportunity to develop generalized, reusable methods for combining variants from multiple approaches.

The call sets used here are available as public data folders on GenomeSpace:

Public/chapmanb/xprize/NA19239-v0_1 – The combined final call set along with training true/false positives and Illumina/Complete Genomics comparison based potential false positives and negatives.
Public/EdgeBio/PublicData/Release1 – All of the raw input data, including fastq files, BAM alignments and individual variant calls.

Combined with the open source code and configurations, we hope this will provided interested researchers with all the raw materials needed to reproduce and extend these analyses. Your feedback and suggestions are very welcome.

Blue Collar Bioinformatics

Community built tools for biological data analysis