VarScan 2 Released on SourceForge

February 4, 2010 by Dan Koboldt

Accurate variant detection in massively parallel sequencing data is a significant bioinformatics challenge. Not only do new sequencers offer unprecedented breadth (whole genome) and depth (30x or more), but they suffer coverage biases and error rates that make variant calling difficult. Last year, we published VarScan, our in-house algorithm for SNP and indel detection on next-generation platforms. NGS analysis has changed somewhat since that time; SAM/BAM format was widely adopted, for one thing, and data throughput has skyrocketed.

To address these issues as well as the requests of many users, we have released VarScan 2 on SourceForge.net. The new version features many improvements and enhancements, including:

SAM/BAM compatibility. Rather than reading various native alignment formats, VarScan now accepts as input the “pileup” format of SAMtools. Since most widely used aligners can be converted to SAM format or output it directly, this makes VarScan compatible with a wide array of tools.
Java implementation. To increase speed and performance in the face of ever-increasing sequencing throughput, we’ve implemented the new VarScan in Java. This also means that it can run on any operating system through the Java virtual machine (VM).
New filtering and comparison tools. VarScan 2 now has commands to limit variants to a list of positions or chromosomal regions, which is useful for targeted sequencing projects. It also has a comparison tool that intersects or merges two sets of variants.
Somatic variant detection. This is the flagship feature of VarScan 2 – given pileup files from a tumor sample and matched control (normal), VarScan calls variants (SNPs and indels) and determines their somatic status (Germline, Somatic, LOH) using heuristic and statistical approaches.

Software and Documentation on SourceForge.net

VarScan joins the ranks of some of the most widely used tools for NGS analysis – Bowtie, Maq, BWA, SAMtools, and Picard – that are hosted on sourceforge.net. The download page, user’s manual, and Java documentation for VarScan are already online. There’s a new wiki site and discussion forum for VarScan as well, to help us developers keep in touch with users and the NGS community. The project page will have information about known issues and new software releases.

You’ll find it all at http://varscan.sourceforge.net.

References

Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, & Ding L (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25 (17), 2283-5 PMID: 19542151

Making the Leap: Maq to BWA

December 23, 2009 by Dan Koboldt

Like most curmudgeons I fought the change as stubbornly as I could. Leave Maq behind for something else? Never! Yet over the past few months I have come to realize that BWA, as it’s called, is not bad. At our genome center we still generate both Maq and BWA alignments for Illumina data; thanks to the hard work of our automated pipeline (apipe) group, I never have to run either on my own. Admittedly, even without a learning curve to daunt me, I found myself reluctant to use the BWA results over Maq.

How I Came to Love BWA

Speed was obviously the most compelling argument. Whereas Maq can take 1-2 days to map a single lane of Illumina data to the human genome, BWA can do it in a few hours. Also, BWA outputs directly into SAM format, which we’ve universally (and gratefully) adopted as the standard for next-gen sequencing data. There are, as it turns out, many differences between Maq and BWA, which I’ve summarized in the table below:

Aligner:	Maq	BWA
Algorithm:	Hash lookup table	Burrows-Wheeler Transform
Input format:	Binary FastQ	FastQ
Output format:	MAP file	SAM file
Gapped Alignment:	Paired-end only; one mate must map	Single-end and paired-end
Maximum Read Length:	63 bp	200 bp (aln) or 1,000 bp (dbtwsw)
Multi-threading:	No	Yes
Assembly / SNP calling:	Tools included	None. Use SAMtools.
Simulation Tools:	Tools included.	None.
Continued development:	No	Yes

There are other differences, but these are the ones I find most important. The ability to take standard FastQ input is a plus for BWA, but its native SAM output is critical. Granted, it’s possible to convert the output of several aligners (Maq, Bowtie, NovoCraft, even BLAST) to SAM format, but a direct output is convenient. Compared to Maq, BWA is lean and mean – no assembly, SNP calling, or simulation tools – with the single purpose of mapping reads to a reference, and a reliance on companion software (SAMtools) for subsequent analysis.

Two new features in BWA – gapped alignment and the ability to take longer reads – are a devastating blow to other next-generation sequencing aligners. Now, it’s possible to map single-end reads with gaps (Novoalign) and to map longer 454 reads (SSAHA2). Like Maq, BWA has color-space functions as well.

The take-home message: BWA can map data from all three available next-generation sequencers – Illumina/Solexa, Roche/454, and ABI SOLiD.

The Heng Li Factor

No software tool can remain competitive without continued development. Thus, when I heard that Heng Li did not plan any further releases of Maq, I knew its days were numbered. Some of us had already gotten the impression that Heng was focusing on BWA more than Maq anyway, but that made it official. Maq is still an incredible piece of software, and represented a technological leap for NGS analysis that helped bring about the era of whole-genome sequencing. We were among the first to climb aboard the Maq train, and it took us far. Now, sadly but with eager anticipation, we transferred to the express train, the bullet train. BWA.

References
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics, 25 (14), 1754-1760 DOI: 10.1093/bioinformatics/btp324

Crossbow: NGS Informatics in the Cloud

November 23, 2009 by Dan Koboldt

Just online at Genome Biology is a new paper from the Steven Salzberg lab (UMD) on searching for SNPs with cloud computing. Using $85 of computing time rented from Amazon’s EC2, Langmead et al processed an entire human genome – 3.3 billion reads totaling 38x coverage – in three hours.

logo_aws

The “Cloud” Can Be Nebulous

Cloud computing is a term bandied about often these days. What it boils down to is this: places with huge banks of computers (Providers, i.e. Amazon) rent out processing time to people who need it (Users). The “cloud” refers to a software layer between providers and users that acts like a virtual operating system – it loads any software needed by the user, and also provides an access point for running highly parallelized tasks on the cluster. Next-gen sequencing data is well suited to this kind of processing, since a large NGS dataset can usually be broken into smaller subsets (i.e. Illumina lanes) and processed at the same time on different computers, without affecting the results.

Map, Sort, Reduce

Crossbow – the cloud computing software featured in this publication – cleverly breaks down the analysis into a series of map, sort, and reduce steps. It takes a large sequencing dataset, breaks the reads into subsets, and maps them to the human genome using Bowtie (map). Then, it divides the 3.2 gigabase human genome into 1,600 non-overlapping 2-megabase partitions and assigns every mapped read to a bin (sort). The SNP caller, in this case SOAPsnp, is applied to each of these smaller bins rather than to the entire genome (reduce).

The Need for Parallelization

The CHB dataset is ~3.3 billion reads, with an average read length of 35 bp. Even with Bowtie’s multi-threading and incredible speed, this massive dataset would take months to process on a single computer. However, the authors divided the input reads into smaller subsets and aligned them in parallel, then processed the 2-Mbp genome “bins” in parallel as well. Throw all of these parallel tasks at Amazon’s Elastic Compute Cloud (EC2), and it eats them up. The high-performance EC2 cluster (40 nodes, each with 8 CPUs and 7 GB of RAM) finished all of the tasks in about 3 hours.

Digging into the Numbers

There are a couple of inconsistencies in the numbers that need to be ironed out. For example, the BGI study reported 36X coverage from 3.3 billion reads (2.02 billion single-end, 658 million paired-end), whereas Langmead et al downloaded 2.7 billion reads from the “YanHuang Site” and noted that it represented 38X coverage. Where did that extra 2X come from? Langmead et al do cite the Nature paper by Wang et al, and I believe it’s the same dataset.

At first I was concerned that the Salzberg group had only downloaded the mapped reads and run them, which would have been a biased test of alignment performance. However, I don’t believe this is the case. Instead, I believe they meant to say that they’d downloaded 2.02 billion single-end reads, and they’d also downloaded 657 million read pairs (1.314 billion paired-end reads). This would yield the correct total of 3.3 billion reads. I realize this is nitpicky.

More of a concern and hopefully less nitpicky are the SNP calling numbers. Langmead et al reported over 21% more SNPs (3.73 million) than BGI did (3.07 million) on the same dataset, and attributed the difference to less stringent filtering. Yet both groups used the same SNP caller, so is it possible that the Bowtie alignment, not the SNP filters, were responsible for what we presume are false positives? This is an important question that Heng Li and others are already considering.

Whole-genome Sequencing Analysis for the Masses

I like the Salzberg group because they’re all about the small lab, about putting NGS processing capabilities into the hands of people without substantial computing resources. Bowtie made it possible to map a lane of Illumina/Solexa data in a few hours, using only a laptop with 4 GB of RAM. Now, Crossbow offers anyone with $85 in their budget to run entire WGS datasets on borrowed (or rented) CPU time. There’s no need to purchase, maintain, or continuously upgrade expensive computing hardware. Even the storage space can be rented (i.e. from Amazon S3, which the authors used). It is literally now possible for someone to analyze an entire human genome while sitting on their laptop at the local coffee house.

References
Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg (2009). Searching for SNPs with cloud computing Genome Biology, 10 (R134) : doi:10.1186/gb-2009-10-11-r134

Short Read Aligners and Variant Detection

November 6, 2009 by Dan Koboldt

In recent weeks I’ve had conversations with many people in the NGS community who are attempting to call variants, accurately, in Illumina/Solexa data. Part of it stems from VarScan, my SNP and indel caller for next-gen sequencing data that works with Bowtie, Novoalign, cross_match, and other aligners.

Another part of it stems from my involvement in 1,000 Genomes Pilot 3, for which several participants have applied their own variant detection pipelines to the same dataset. Last month, Goncalo Abecasis, with input from David Craig, Heng Li, Gerton Lunter, and Fiona Hyland, proposed an exercise comparing several read mappers on real and simulated ABI SOLiD and Illumina/Solexa data. The initial list of aligners – Maq, BWA, Stampy, BFAST, BioScope, and KARMA – demonstrated just how rapidly the field has grown since my aligner comparison last year at AGBT. I’d looked at Maq and BFAST, and knew about (but hadn’t tried) BWA, but the others on the list (Stampy, BioScope, and KARMA) were ones I’d never heard of.

I proposed adding three aligners to the list: Bowtie and Novoalign for Illumina data, and SHRiMP for SOLiD data. My suggestions were politely declined by Richard Durbin (WTSI), who said “In our hands Bowtie doesn’t seem accurate enough for variant calling. It is a great tool for fast assignment of reads for some other purposes. Novoalign is accurate and good, but perhaps a little slow. SHRiMP is also I think very slow.”

Personally, I think that Bowtie works very well for variant calling, I know of several groups who are using it for that exact purpose. And while Novoalign *is* a bit slow, in my experience it’s just as fast as Maq, one of the two aligners out of Durbin’s lab that were already on the list. Of course, Maq remains the most widely used tool for Illumina data (for now), and that’s an important consideration. Most NGS analysts know and love Maq as much as I do.

Balancing Speed and Sensitivity

However, these assessments bring into focus the key issue surrounding short read alignment for variant detection – finding the balance between speed and sensitivity. Bowtie and Novoalign exemplify this well. Bowtie is ultrafast – the fastest short read aligner I’ve used – and maps an entire lane (~15m reads) in just 1-2 hours. Yet in my experience, it places slightly fewer reads than BWA/Maq. And it performs only ungapped alignments, so indels won’t be detected. In contrast, Novoalign typically maps more reads than Maq and BWA, seems very accurate, and remains one of the few aligners to allow gaps in fragment-end reads. In general, my comparisons demonstrated that Novoalign speed is comparable to Maq on typical datasets. However, longer reads and lower-quality data can make Novoalign very slow indeed. The ultimate short read aligner, in my opinion, would have Bowtie-like speed, Novoalign-like sensitivity, and the widespread community support that Maq enjoys.

Ask the Guru: Heng Li

Heng Li, who led development of both Maq and BWA, told me that he’s not worried about sensitivity. “Most aligners nowadays are sensitive enough,” Heng wrote to me in an e-mail this week. “For detecting variations, specificity is of more importance. Nonetheless, how much wrong alignments may contribute to wrong SNPs is an open question. As long as alignment errors are random, more wrong alignments may not necessarily lead to worse SNP calls.” Clearly, he has already given some thought to these issues. If we’re lucky, Heng Li may begin to address these open questions in his new post at the Broad Institute.

Underlying Causes of False Positives

Read mis-alignment would not be a serious problem if it occurred randomly across the genome. The trouble is that wrong alignments don’t seem to be random, at least in my experience. In projects like TCGA Ovarian, we see numerous false positives (particularly in tumors) that seem to arise from read mis-alignment. These typically manifest as clusters of variants, often present in each of a subset of reads whose true alignment is probably a paralogous region of the genome. It’s also possible that they’re caused by an indel, which (as Kiran Garimella of the Broad Institute recently showed) sometimes manifest as clusters of substitutions at several positions near one another. We can aggressively filter these by looking for clusters of predicted SNPs, but even better would be to remove the mis-alignments before variant calling even begins.

Read Mis-Alignment and Paired-End Sequencing

Here at WashU, we have a growing concern that the alignment scores for short reads are continually over-estimated. Often our manual reviewers find that reads supporting false-positives have mate pairs that align to a different chromosome altogether. In the absence of translocation events, when this occurs, one of the two reads is incorrectly placed, and any variant it supports is probably not real. Personally, I’d rather remove both reads in such situations, and rely on correctly mapped read pairs for detection of small variants.

The pervasive spread of paired-end sequencing is beginning to reveal just how often short aligners can get it wrong. The corollary here is that taking read pair information into account during alignment is of critical importance, and those hopeful short read aligners that don’t do it yet (crossmatch, for example) are destined for inferiority.

High-Throughput Sequencing: Speed Matters

Yet what I’m learning from discussions with others in the community – particularly the growing surge of users making the leap from Maq to BWA – is that speed matters. With Illumina machines cranking out 20 gigabases in a single run, and projects like the 1,000 Genomes generating terabytes of sequence over the course of months, we can’t afford to be using the slower aligners, no matter their sensitivity. At worst, we might apply a two-stage approach to alignment that rapidly maps reads that precisely match the reference, and passes only the variant reads to a more sensitive aligner for mapping.

Of course, as a colleague of mine recently joked, by the time we write the perfect aligner, Pac Bio will have come along and sequenced the entire genome, kilobases at a time.