Overview
The post discusses work validating multiple cancer variant callers in bcbio-nextgen using a synthetic reference call set from the ICGC-TCGA DREAM challenge. We've previously validated germline variant calling methods, but cancer calling is additionally challenging. Tumor samples have mixed cellularity due to contaminating normal sample, and consist of multiple sub-clones with different somatic variations. Low-frequency sub-clonal variations can be critical to understand disease progression but are more difficult to detect with high sensitivity and precision.
Publicly available whole genome truth sets like the NA12878 Genome in a Bottle reference materials don't currently exist for cancer calling, but other groups have been working to prepare standards for use in evaluating callers. The DREAM challenge provides a set of synthetic datasets that include cellularity and multiple sub-clones. There is also on ongoing DREAM contest with real, non-simulated data. In addition, the ICGC has done extensive work on assessing the challenges in cancer variant calling and produced a detailed comparison of multiple variant calling pipelines. Finally, Bina recently released a simulation framework called varsim that they used to compare multiple callers as part of their cancer benchmarking. I'm excited about all this community benchmarking and hope this work can contribute to the goal of having fully consented real patient reference materials, leading to a good understanding of variant caller tradeoffs.
In this work, we evaluated cancer tumor/normal variant calling with synthetic dataset 3 from the DREAM challenge, using multiple approaches to detect SNPs, indels and structural variants. A majority voting ensemble approach combines inputs from multiple callers into a final callset with good sensitivity and precision. We also provide a prioritization method to enrich for somatic mutations in tumor samples without matched normals.
Cancer variant calling in bcbio is due to contributions and support from multiple members of the community:
- Miika Ahdesmaki and Justin Johnson at AstraZeneca collaborated with our group on integrating and evaluating multiple variant callers. Their financial support helped fund our time on this comparison, and their scientific contributions improved SNP and indel calling.
- Luca Beltrame integrated a number of cancer variant callers into bcbio and wrote the initial framework for somatic calling.
- Lorena Pantano integrated variant callers and performed benchmarking work. Members of our group, the Harvard Chan school Bioinformatics core, regularly contribute to bcbio development and validation work.
- The Wolfson Wohl Cancer Research Centre supported our work on validation of somatic callers.
- James Cuff, Paul Edmon and the team at Harvard FAS research computing provided compute infrastructure and support that enabled the large number of benchmarking runs it takes to get good quality calling.
I'm continually grateful to the community for all the contributions and support. The MuTect commit history for bcbio is a great example of multiple distributed collaborators working towards the shared goal of integrating and validating a caller.
Variant caller validation
We used a large collection of open source variant callers to detect SNPs, Indels and structural variants:
- MuTect (1.1.5) – A SNP only caller, built on GATK UnifiedGenotyper, from the Broad Institute. MuTect requires a license if used for commercial purposes.
- VarDict (2014-12-15) – A SNP and indel caller from Zhongwu Lai and the oncology team at AstraZeneca.
- FreeBayes (0.9.20-1) – A haplotype aware realigning caller for SNPs and indels from Erik Garrison and Gabor Marth's lab.
- VarScan (2.3.7) – A heuristic/statistic based somatic SNP and indel caller from Dan Koboldt and The Genome Institute at Washington University.
- Scalpel (0.3.1) – Micro-assembly based Indel caller from Giuseppe Narzisi and the Schatz lab. We pair Scalpel with MuTect to provide a complete set of small variant calls.
- LUMPY (0.2.7) – A probabilistic structural variant caller incorporating both split read and read pair discordance, developed by Ryan Layer in Aaron Quinlan and Ira Hall's labs.
- DELLY (0.6.1) – An integrated paired-end and split-end structural variant caller developed by Tobias Rausch.
- WHAM (1.5.1) – A structural variant caller that can incorporate association testing, from Zev Kronenberg in Mark Yandell's lab at the University of Utah.
bcbio runs these callers and uses simple ensemble methods to combine small variants (SNPs, indels) and structural variants into final combined callsets. The new small variant ensemble method uses a simplified majority rule classifier that picks variants to pass based on being present in a configurable number of samples. This performs well and is faster than the previous implementation that made use of both this approach and a subsequent support vector machine step.
Using the 100x whole genome tumor/normal pair from DREAM synthetic dataset 3 we evaluated each of the callers for sensitivity and precision on small variants (SNPs and indels). This synthetic dataset contains 100% tumor cellularity with 3 subclones at allele frequencies of 50%, 33% and 20%.
In addition to the whole genome results, the validation album includes results from running against the same dataset limited to exome regions. This has identical patterns of sensitivity and precision. It runs quicker, so is useful for evaluating changes to filtering or program parameters.
We also looked at structural variant calls for larger deletions, duplications and inversions. Here is the precision and sensitivity for duplications across multiple size classes:
The full album of validation results includes the comparisons for deletion and inversion events. These comparisons measure the contributions of individual callers within an ensemble approach that attempts to maximize sensitivity and specificity for the combined callset. Keep in mind that each of the individual results make use of other caller information in filtering. Our goal is to create the best possible combined calls, rather than a platform for unbiased comparison of structural variant callers. We're also actively working on improving sensitivity and specificity for individual callers and expect the numbers to continue to evolve. For example, Zev Kronenberg added WHAM's ability to identify different classes of structural changes, and we're still in the process of improving the classifier.
Improvements in filtering
Our evaluation comparisons show best effort attempts to provide good quality calls for every caller. The final results often come from multiple rounds of improving sensitivity and precision by adjusting program parameters or downstream filtering. The goal of tightly integrating bcbio with validation is that the community can work on defining a set of parameters and filters that work best in multiple cases, and then use these directly within the same framework for processing real data.
In presenting the final results only, it may not be clear that plugging a specific tool into a custom bash script will not always produce the same results we see here. As an example, here are the improvements in FreeBayes sensitivity and precision from our initial implementation, presented over the exome regions of synthetic dataset 3:
The original implementation used a vcfsamplediff based approach to filtering, as recommended on the FreeBayes mailing list. The current, improved, version uses a custom filter based on genotype likelihoods, based on the approach in the speeseq pipeline.
In general, you can find all of the integration work for individual callers in the bcbio source code, broken down by caller. For instance, here is the integration work on MuTect. The goal is to make all of the configuration settings and filters fully transparent so users can understand how they work when using bcbio, or transplant them into their own custom pipelines.
Tumor-only prioritization
The above validations were all done on cancer calling with tumor and normal pairs. The filters to separate pre-existing germline mutations from cancer specific somatic mutations rely on the presence of a normal sample. In some cases, we don't have matched normal samples to do this filtering. Two common examples are FFPE samples and tumor cell lines. For these samples, we'd like to be able to prioritize likely tumor specific variations for followup using publicly available resources.
We implemented a prioritization strategy from tumor-only samples in bcbio that takes advantage of publicly available resources like COSMIC, ClinVar, 1000 genomes, ESP and ExAC. It uses GEMINI to annotate the initial tumor-only VCF calls with external annotations, then extracts these to prioritize variants with high or medium predicted impact, not present in 1000 genomes or ExAC at more than 1% in any subpopulation, or identified as pathenogenic in COSMIC or ClinVar.
Validating this prioritization strategy requires real tumor samples with known mutations. Our synthetic datasets are not useful here, since the variants do not necessarily model standard biological variability. You could spike in biologically relevant mutations, as done in the VarSim cancer simulated data, but this will bias towards our prioritization approach since both would use the same set of necessarily imperfect known variants and population level mutations.
We took the approach of using published tumor data with validated mutations. Andreas Sjödin identified a Hepatoblastoma exome sequencing paper with publicly available sample data and 23 validated cancer related variations across 5 samples. This is a baseline to help determine how stringent to be in removing potential germline variants.
The prioritization enriches variants of interest by 35-50x without losing sensitivity to confirmed variants:
sample | caller | confirmed | enrichment | additional | filtered |
---|---|---|---|---|---|
HB2T | freebayes | 6 / 7 | 44x | 1288 | 56046 |
HB2T | mutect | 6 / 7 | 48x | 1014 | 47755 |
HB2T | vardict | 6 / 7 | 36x | 1464 | 52090 |
HB3T | freebayes | 4 / 4 | 46x | 1218 | 54997 |
HB3T | mutect | 4 / 4 | 49x | 961 | 46894 |
HB3T | vardict | 4 / 4 | 35x | 1511 | 51404 |
HB6T | freebayes | 4 / 4 | 43x | 1314 | 56240 |
HB6T | mutect | 4 / 4 | 51x | 946 | 47747 |
HB6T | vardict | 3 / 4 | 35x | 1497 | 51625 |
HB8T | freebayes | 6 / 6 | 42x | 1364 | 57121 |
HB8T | mutect | 6 / 6 | 47x | 1053 | 48639 |
HB8T | vardict | 6 / 6 | 35x | 1542 | 52642 |
HB9T | freebayes | 2 / 2 | 41x | 1420 | 57582 |
HB9T | mutect | 2 / 2 | 44x | 1142 | 49858 |
HB9T | vardict | 2 / 2 | 36x | 1488 | 53098 |
We consistently missed one confirmed mutation in the HB2T sample. This variant, reported as a somatic mutation in an uncharacterized open reading frame (C2orf57), may actually be a germline mutation in the study sub-population. The variant is present at a 10% frequency in the East Asian population but only 2% in the overall population, based on data from both the ExAC and 1000 genomes projects. Although the ethnicity of the original samples is not reported, the study authors are all from China. This helps demonstrate the effectiveness of large population frequencies, stratified by population, in prioritizing and evaluating variant calls.
The major challenge with tumor-only prioritization approaches is that you can't expect to accurately filter private germline mutations that you won't find in genome-wide catalogs. With a sample set using a small number of validated variants it's hard to estimate the proportion of 'additional' variants in the table above that are germline false positives versus the proportion that are additional tumor-only mutations not specifically evaluated in the study. We plan to continue to refine filtering with additional validated datasets to help improve this discrimination.
Practically, bcbio automatically runs prioritization with all tumor-only analyses. It filters variants by adding a ``LowPriority`` filter to the output VCF so users can readily identify variants flagged during the prioritization.
Future work
This is a baseline for assessing SNP, indel and structural variant calls in cancer analyses. It also prioritizes impact variants in cases where we lack normal matched normals. We plan to continue to improve cancer variant calling in bcbio and some areas of future focus include:
- Informing variant calling using estimates of tumor purity and sub-clonal frequency. bcbio integrates CNVkit, a copy number caller, which exports read count segmentation data. Tools like THetA2, phyloWGS, PyLOH and sciClone use these inputs to estimate normal contamination and sub-clonal frequencies.
- Focusing on difficult to call genomic regions and provide additional mechanisms to better resolve and improve caller accuracy in these regions. Using small variant calls to define problematic genome areas with large structural rearrangements can help prioritize and target the use of more computationally expensive methods for copy number assessment, structural variant calling and de-novo assembly.
- Evaluating calling and tumor-only prioritization on Horizon reference standards. They contain a larger set of validated mutations at a variety of frequencies.
As always, we welcome suggestions, contributions and feedback.