SNP Discovery in NGS Data, Atlas-SNP2, and VarScan

A paper in this month’s Genome Research sheds light on predictors of sequencing error in next-generation sequencing. Using data from both 454 and Illumina platforms, Shen et al applied logistic regression models to identify sequence- and platform-related factors that contribute to substitution (SNP) errors.

atlas-snp2-paper

The results, I think, offer new insight into the challenge of accurate variant detection in massively parallel sequencing data. On the 454 platform, four factors were significantly correlated with erroneous SNP calls:

1.) Base quality of the substitution. No surprise there. Quality scores are inversely correlated with sequencing error rate.

2.) Dinucleotide polymorphism events (DNPs). Runs of 2+ consecutive substitutions are enriched for erroneous calls. In particular, so-called “swap-base” events, in which two adjacent positions invert their nucleotides, are evidently the result of “loss of synchrony” in 454 pyrosequencing.

3.) Neighboring quality standard (NQS). Requiring Q>20 at the SNP position and Q>15 at flanking positions (+/- 5bp) improved SNP calling accuracy. This is not a novel idea, as tools like ssahaSNP incorporated NQS into their algorithms a few years ago. However, I’m intrigued by the application of quality “windows” to reduce false positives in next-gen sequencing.

4.) Distance of base from 3′ end of read, normalized against the read length. This is an interesting source of erroneous SNP calls that I personally have observed in both 454 and Illumina datasets.

Not Significant: Homopolymers (!?), GC Content, Sequence Context

What I find most intriguing about this list is what’s missing: homopolymers, which have long plagued 454 pyrosequencing. The authors noted that the new Titanium base-caller had much improved performance calling homopolymers, so much improvement, in fact, that homopolymer-associated errors failed to reach statistical significance in their analysis. Other factors that were examined but not significant included GC content, sequence context, and substitution type. The Illumina platform had the same significant factors with one exception: swap-base events were not a significant source of error.

Variant Caller Comparison: Atlas-SNP2, Maq, and VarScan

The authors applied their linear regression models on E. coli and built sets of prior probabilities to tune their SNP caller, Atlas-SNP2. Then, they compared the performance of Atlas-SNP2 on to that of two other tools: VarScan and Maq. The comparison datasets included 454 data (about 1/2 region) and Illumina data (1 lane of 76-bp reads) from a highly-characterized strain of S. aureus. The variant calls from each tool were compared to known SNP positions to determine sensitivity and specificity.

DataSet	Caller	Sensitivity	Specificity
454	Atlas-SNP2	97.6%	88.4%
454	VarScan	97.6%	96.8%
Illumina	Atlas-SNP2	98.8%	99.9%
Illumina	VarScan	85.7%	99.9%
Illumina	Maq	4.8%-88.1%*	99.9%
* Maq was applied with various values of -D, from 100 (default) to 618

Please correct me if I’m wrong, but it seems that VarScan out-performed Atlas-SNP2 on 454 data. Baylor is (in my opinion) a leader in 454 sequencing, and published the first whole genome with that platform, so I’m pretty pleased. As one would expect, Atlas-SNP did shine in one area, the high-read-depth Illumina datasets. The authors’ tuning eventually produced a higher sensitivity than VarScan or Maq while maintaining similar specificity. While I acknowledge this achievement, I’m a little wary of any results that report 99.9% specificity for SNP calling in Illumina data.

Will Atlas-SNP2 Be Widely Adopted?

There do appear to be benefits of tuning variant detection software to account for platform-specific sources of error. It’s my opinion, however, that Atlas-SNP2 must overcome two obstacles prior to widespread adoption by the NGS community:

Ruby implementation. The software package comes as a set of Ruby scripts with little documentation. It will take a bioinformatician to run it. That limits the potential of expanding the tool, and/or integrating it with existing pipelines. Also, scripting languages like Perl and Ruby, while easy to write code in, are unlikely to scale to where NGS throughput is headed. No performance data was included in the study, so we don’t know how long it takes to train Atlas-SNP2 on a considerable dataset.
Archaic read mapping/alignment pipeline. This is the central limitation of Atlas-SNP2. The authors rely upon the same pipeline – anchoring with BLAT, then genome partitioning, then local alignment with cross_match – that was utilized in the Watson/454 paper. Come on, guys! This may have been feasible for their limited dataset and small bacterial genomes. But here’s the problem with using old-school aligners: BLAT and cross_match won’t scale to the large genomes and Illumina-like throughput (~20m reads per lane). They also cannot use read pairing information. It looks like there may be preliminary support for alternative alignments in SAM format, but only for 454 data.

The paper itself is well-written and clear, but there’s a lot missing. The methods are brief, and don’t describe the versions or parameters of any software that was used. The sequence datasets don’t appear to be publicly available. The numbers of expected and observed SNPs for S. aureus that were used for sensitivity/specificity calculations are not clear. It’s my understanding that there’s no limit on supplemental materials, so this information should have been provided. I’m surprised, in fact, that this publication made it to Genome Research. The novelty and scope are limited, and probably more appropriate for something like BMC Genomics. I say this, admittedly, with just a touch of envy ;-).

Insights into Next-Generation Sequencing

The publication was informative, however, in correlating several of the key factors contributing to sequencing error. While some of these (quality score, for example) were obvious, the swap-base phenomenon and NQS window approach are worth consideration for those of us developing variant detection algorithms. The linear regression training model may also have some value, especially as we continue to generate variant calls in large datasets and submit them for validation.

References
Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, & Yu F (2009). A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome research PMID: 20019143

Comments

MB says

December 29, 2009 at 1:47 pm

I have not read the paper, but someone believe NQS is useful mainly because the base quality is inaccurate. When base quality is accurate, we may largely take each base independent of each other. As to the adoption of a software package, it seems to me that to publish a tool is easier than to let it adopted widely because: a) a lot more efforts are required to make the tool bug-free and user friendly; b) it is easy to make a program work well on one data set, but it is much harder to make it work well on most data sets, which is particularly true for variant callers.
Fuli Yu @ BCM-HGSC says

February 3, 2010 at 2:09 pm

Thank you for this blog entry on Atlas-SNP2. We have updated the URL for download with the most recent version of software and detailed documentation. Please check again. We hope that the new released version is easier to use.

http://www.hgsc.bcm.tmc.edu/cascade-tech-software_atlas_snp-ti.hgsc