Genomics X Prize public phase: reference genome preparation and comparisons to Illumina and Complete Genomics

Brad Chapman bio photo By Brad Chapman Comment

Background

The Archon Genomics X Prize, presented by Express Scripts, is a 10 million dollar competition to establish highly accurate clinical grade sequencing and variation detection methods. Our group at Harvard School of Public Health works with the EdgeBio team on developing the infrastructure for the competition: identify variations in the grading genomes and provide software to compare these reference variation sets against a competitor's list of variations.

The exciting aspect of the Genomics X Prize is that it enables open comparisons between sequencing technologies and variant calling methodologies. Sequencing genomes to the high degree of accuracy sufficient for clinical usage is a difficult, open, problem. Here I'll present detailed numbers comparing variants called by different sequencing technologies and variant callers.

The public phase of the Genomics X Prize starts today, August 15th. The goal of this six month period is to have an open dialog with everyone working in the sequencing and variant calling communities. We want to refine our methods to provide the most accurate and fair variant calling for the reference genomes. To start the discussion we've prepared:

The goal of this writeup, and the X Prize public phase, is to iterate over calling and unification methods to improve our algorithms and approaches. Rather than promoting or disparaging any particular technology or calling method, we're instead providing full transparency and a good-faith effort to combining approaches. Our hope is that this will help engage the community, encourage feedback, and result in a unbiased and accurate set of reference genomes for the competition.

Unification of variant calls

For the August 15th public phase kickoff, we prepared a reference data set of NA19239 based on pooled sequencing of haploid fosmid clones. The callable regions of these clones totaled 129,513,026 total bases, covering ~4% of the 3.1 billion bases in the human genome. We use fosmid clones to obtain complete regional haplotype coverage and focus on partial genome coverage to achieve high coverage depth and accuracy for assessed regions.

Version 0.1 of the NA19239 reference set uses variant calls from two technologies: Illumina and SOLiD; and three callers: GATK's Unified Genotyper, FreeBayes and SAMtools. To move from these data to a unified call set we:

  • Align to GRCh37 reference genome with Novoalign.
  • Perform post-processing and indel realignment with GATK's IndelRealigner.
  • Perform variant calling with GATK's UnifiedGenotyper, FreeBayes and samtools mpileup.
  • Do pairwise comparisons between all technology/caller approaches.
  • Generate the union of all possible calls and merge with initial GATK calls, recalling any no-call positions at expected sites.
  • Use validation information on variants found in multiple technologies, plus metrics associated with common variants, to filter the full call set to a final set of trusted calls.

The challenging decisions begin when merging and filtering the final call set. This requires careful bookkeeping and variant representation to ensure identical variants are directly comparable, followed by setting cutoffs for variant inclusion.

Comparison details

The details of variant comparisons introduce an additional layer of complexity during assessment. The approach we've taken is create a normalized set of variants so all comparison differences are due to actual call differences rather than variant representation. We split multiple nucleotide polymorphisms into individual calls, split complex indel-variant combinations, and left-align remaining variants.

For haploid/diploid comparisons, we establish haplotype blocks for the diploid sequence based on phasing provided in the input variant file, and then compare the best matching haplotype to our fosmid reference. Single nucleotide polymorphisms and indels less than 30bp require exact machines between two comparison genomes. Larger indels and structural variations receive more flexible matching with confidence intervals around start and end coordinates.

The goal of the normalized, compared variants is to reflect real underlying differences in calling approaches relative to how well we can currently resolve variation endpoints.

Comparisons between variation callers

For a concrete example of two different variant calling approaches, below is a table comparing GATK variants against samtools calls for the NA19239 sample, using identically aligned and post-processed BAMs:

concordant: total 160851
concordant: SNPs 136146
concordant: indels 24705
GATK discordant: total 13925
GATK discordant: SNPs 1315
GATK discordant: indels 12610
samtools discordant: total 25368
samtools discordant: SNPs 17247
samtools discordant: indels 8121

The number of discordant variant calls is high, making up 8% of the GATK calls and 14% of the samtools calls, and samtools calls almost 16,000 additional SNPs compared to GATK. As a result, a large percentage of variants require making hard decisions: are those additional calls interesting, real variants in samtools and false negatives in the GATK calls? Or conversely, are they false positives in samtools that GATK correctly excludes?

Comparisons between sequencing technologies

There is a similar level of discrepancy when comparing variant calls between Illumina and SOLiD sequencing. Below is a comparison between GATK Unified genotyper calls on the two technologies:

concordant: total 135263
concordant: SNPs 122267
concordant: indels 12996
Illumina discordant: total 39491
Illumina discordant: unique 7079
Illumina discordant: SNPs 15188
Illumina discordant: indels 24303
SOLiD discordant: total 16022
SOLiD discordant: unique 3800
SOLiD discordant: SNPs 3908
SOLiD discordant: indels 12114

Unique coverage explains some differences: 4% of the Illumina variants (7079) and 2.5% (3800) of the SOLiD variants were uniquely covered by the technologies. However, the remaining variant discordant calls are on the order of those seen in the technology comparisons. Adding to the complexity, we find only 84% of the total concordant variants compared to the Illumina only GATK/samtools comparison.

Unified call set

The level of discrepancy between calling methods and sequencing approaches introduces complexity in the preparation of the final call set: How much evidence does a variant need for inclusion? Can single calls be true positives if supported by high confidence values? This will require extensive refinement throughout the public phase. For the initial version 0.1 release of NA19239, we took the following high level approach to filtering:

  • Retain variants found in 4 out of 6 calling/technology methods (including genotyping data).
  • Retain variants identified across multiple technologies.
  • Retain variants found in both more stringent (GATK) and more lenient (FreeBayes, samtools) callers.
  • Assess remaining variants using a Support Vector Machine with quality score, read depth and variant distance from read ends metrics, training the classifier on likely true and false positives from the pairwise overlap comparisons.

The result is a unified call set of 171,009 variants derived from all technologies and callers, that we're releasing as NA19239 version 0.1.

Comparisons with whole genome datasets

To assess the quality of the unified call set, we compared to two public genomes:

This provides us with three independent call sets to assess variability between approaches. To provide a baseline, here is the comparison of the Illumina and Complete Genomics calls in our assessment regions:

Overall genotype concordance 98.47
concordant: total 205868
concordant: SNPs 186365
concordant: indels 19503
Illumina discordant: total 31267
Illumina discordant: SNPs 19334
Illumina discordant: indels 11933
Complete Genomics discordant: total 15174
Complete Genomics discordant: SNPs 9586
Complete Genomics discordant: indels 5510

We see familiar discordance rates: 13% of the Illumina calls and 7% of the Complete Genomics calls differ. Since it's diploid versus diploid, this comparison includes all heterozygous variant matches. As a result the numbers in this comparison will be higher, but it is a good order of magnitude approximation for looking at our fosmid reference set versus each individual technology.

Illumina

The comparison against the Illumina whole genome variant calls contains 12% discordant calls in our fosmid reference set, with 79% of those being indel differences. Indels are notoriously more difficult to identify and assess, so this will be an area of increased focus as we move forward:

concordant: total 150420
concordant: SNPs 132604
concordant: indels 17816
fosmid discordant: total 19624
fosmid discordant: SNPs 4165
fosmid discordant: indels 15459
Illumina discordant: total 5475
Illumina discordant: SNPs 2952
Illumina discordant: indels 2523

Complete Genomics

The Complete Genomics comparison has 17% discordant calls including 2x more discordant SNP calls. This highlights another key area of call set refinement: identifying and correcting for technology specific calls.

concordant: total 139559
concordant: SNPs 126296
concordant: indels 13263
fosmid discordant: total 29883
fosmid discordant: SNPs 10162
fosmid discordant: indels 19721
Complete Genomics discordant: total 7571
Complete Genomics discordant: SNPs 5542
Complete Genomics discordant: indels 1965

Summary

The initial NA19239 public genome for the Genomics X Prize provides unified variant calls based on two sequencing technologies and three calling methods. I've delved into a lot of details on our approaches, challenges and goals with the hopes of encouraging suggestions from other researchers working on these problems. We're especially interested in feedback on these areas of ongoing research:

  • Digging deeper into potential false positives and negatives: By combining comparison information between the unified callset and external resources, we can identify 17654 fosmid variants (10%) not found in both the Complete Genomics and Illumina datasets. These require additional in-depth analysis to classify as uniquely identified fosmid calls or potential false positives. Similarly, Illumina and Complete Genomics combine to call 1228 variants (0.7%) that are not in the fosmid call set. These need examination to classify as fosmid false negatives, or false positive calls in the individual technologies.
  • Additional public genomes: We're actively working with teams like the Genome in a Bottle Consortium and Genome Research Consortium to compare with their reference sets and approaches. Our next target public genome is NA12878, used in both of these projects and widely studied.
  • Improve variant representation and assessment: The variation software framework works hard to make variant representations as uniform as possible. Indels are especially challenging and we welcome practical examples of regions that need additional standardization.
  • Refine approaches to unifying variant calls: What we learn from the additional inspection of discordant variants can help inform improved approaches to filtering. This is a great opportunity to develop generalized, reusable methods for combining variants from multiple approaches.

The call sets used here are available as public data folders on GenomeSpace:

  • Public/chapmanb/xprize/NA19239-v0_1 – The combined final call set along with training true/false positives and Illumina/Complete Genomics comparison based potential false positives and negatives.
  • Public/EdgeBio/PublicData/Release1 – All of the raw input data, including fastq files, BAM alignments and individual variant calls.

Combined with the open source code and configurations, we hope this will provided interested researchers with all the raw materials needed to reproduce and extend these analyses. Your feedback and suggestions are very welcome.

comments powered by Disqus