BAM file size reduction and quality score binning
We have a large upcoming whole genome sequencing project with Illumina, and they approached us about delivering BAM files with reduced resolution base quality scores. They have a white paper describing the approach, which involves binning scores to reduce resolution. This reduces the number of scores describing the quality of a base from 40 down to 8.
The advantage of this approach is a significant reduction in file size. BAM files use BGZF compression, and the underlying gzip DEFLATE algorithm compresses based on shared text regions. Reducing the number of quality values increases shared blocks and improves compression. This reduces BAM file sizes by 25-35%: an exome BAM file reduced from 5.7Gb to 3.7Gb after quality binning.
The potential downside is that the reduction in quality resolution may impact alignment and variant calling approaches that rely on base quality scores. To assess this, I implemented quality score binning as part of the bcbio-nextgen analysis pipeline using the CRAM toolkit and ran alignment, recalibration, realignment and variant calling on:
- The original unbinned 40-resolution base quality BAM from an NA12878 exome.
- The BAM binned into 8-resolution base qualities before alignment.
- The BAM binned into 8-resolution base qualities before alignment and binned again following base quality score recalibration.
A comparison of alignment and variant calls from the three approaches indicates that binning has nearly no impact on alignment and a small impact on variant calls, primarily in low depth regions.
Alignment differences
We aligned 100bp paired end reads with Novoalign, a quality aware aligner. Comparison of mapped reads showed nearly no impact on total mapped reads. The plot below shows a generic delta of changes in mapped reads across the 22 autosomes alongside the increase in unmapped pairs. Out of 73 million total reads, the changes account for ~0.003% of the total reads. There also did not appear to be any worrisome patterns of loss for specific chromosomes. Overall, there is a minimal impact of quality score binning on the ability to align the reads.
Variant call differences
We called variants using the GATK Unified Genotyper following the best practice recommendations for exomes and then compared calls from original and binned quality scores. Both approaches for binning -- pre-binning, and pre-binning plus post-quality recalibration binning -- showed similar levels of concordance to non-binned quality scores: 99.81 and 99.78, respectively. Since the additional binning after recalibration provides a smaller prepared BAM file for storage and has a similar impact to pre-binning only, we used this for additional analysis of discordant variants.
The table below shows the discordant differences between the 40 quality score resolution and binned, 8 quality score resolution BAMs. 40 quality discordant variants are those called with full quality score resolution but not called, or called differently, after binning to 8 quality score resolution. Conversely, the 8-quality discordants are those called uniquely after quality binning:
Overall genotype concordance | 99.78 |
concordant: total | 117887 |
concordant: SNPs | 109144 |
concordant: indels | 8743 |
40-quality discordant: total | 821 |
40-quality discordant: SNPs | 759 |
40-quality discordant: indels | 62 |
8-quality discordant: total | 1289 |
8-quality discordant: SNPs | 1240 |
8-quality discordant: indels | 49 |
het/hom discordant | 259 |
We investigated the discordant variants further since 1.5% of the total variant calls change as a result of binning, Of the 1851 unique discordant variants, approximately half (928) fall into reproducible variants identified by looking at ensemble combinations of replicates. Of these potentially problematic discordant variants more than half are in low coverage regions with less than 10 reads:
The major influence of quality score binning is resolution of variants in low coverage regions. This manifests as differences in heterozygote and homozygote calling, indel representation and filtering differences related to quality and mappability. To assess the potential impact, we looked at the loss in callable bases on a 30x whole genome sequence when moving from a minimum of 5 reads to a minimum of 10, using GATK's CallableLoci tool. Regions with read coverage of 5 to 9 make up 4.7 million genome positions, 0.17% of the total callable bases.
5 read minimum | 10 read minimum | |
---|---|---|
Callable bases | 2,775,871,235 | 2,771,109,000 |
Percent callable | 96.90% | 96.73% |
Low coverage | 17,641,980 | 22,404,215 |
No coverage/ poor mapping | 71,272,008 | 71,272,008 |
In conclusion, quality score binning provides a useful reduction in input file sizes with minimal impact on alignment. For variant calling, use additional caution in low coverage regions with less than 10 supporting reads. Given the rapid increases in read throughput that are driving the need for file size reduction, quality score binning is a worthwhile tradeoff for high-coverage recalling work.