As promised, I’m sharing the results of my short read aligners poster from AGBT 2009. There was a healthy amount of interest in this topic at the meeting; I spoke with numerous people who largely were using Maq, but wanted to explore other options for short read alignment. Some people were interested in colorspace-capable aligners, presumably for SOLiD data, while others wanted a tool that could align across gaps (indels) in single-end mode.
The Short Read Aligner Comparison
In the end I presented a comparison of ten short read aligners: BFAST, Bowtie, CELL (CLCbio), cross_match, Maq, Novoalign (Novocraft), RMAP, SeqMap, SHRiMP, and SOAP. These are, of course, only a subset of the tools currently available – several people at AGBT asked about other aligners – but they’re a good sampling of different approaches to the same problem.
I focused on three data sets, all of which were based on 36-bp Illumina/Solexa paired-end libraries sequenced here as part of the 1,000 Genomes Project:
- 1 million simulated read pairs from C. elegans
- 1 million simulated read pairs from Hs36
- 1 million real Illumina/Solexa read pairs from a YRI sample
Speed and Accuracy of Aligners
One trend was immediately apparent: the aligners that used Burrows-Wheeler Transformation indexing of the reference sequence (Bowtie & SOAP) were consistently faster, especially at the human genome scale. It’s not surprising that Heng Li is now focusing on BWA as opposed to Maq. Another trend that I find rather frightening is this: when I introduced SNPs and indels, almost every aligner mis-placed (as in, placed uniquely but to the wrong location) 18% of the reads. This has important implications for variant detection, since even single base changes can have a dramatic effect.
Aligners that Disappointed
I noticed very quickly that SHRiMP was the slowest aligner, and given that it’s slated towards SOLiD (a platform we’re not heavily invested in), it was easy to drop SHRiMP from consideration. Another downer we knew about in advance was RMAP, whose authors abandoned their project after getting the publication out.
Aligners that Surprised Me
There’s a close relationship between our genome center and David Gordon / Phil Greene, which no doubt accounts for why cross_match continues to be used. Months ago Gordon came and touted the latest CM revision as something that was faster than Maq. Of course, he pointed out that you had to adjust several parameters from their default settings to make this happen. Did it turn out to be true? Not necessarily – in a few tests CM was faster than Maq, in a few it was slower, though they were always comparable. Where cross_match shined was sensitivity at mutated sites – in the C. elegans simulation, it successfully placed more reads in single-end mode than other aligners.
Most Promising Aligners
Maq is our current tool of choice, but we’re looking closely at Novoalign (a Maq-like tool with more sensitivity), as well as Bowtie/SOAP for speed considerations. For details, see the “Short Read Aligners” section of my blog.
Matt says
Very helpful, thanks! Did you compare memory requirements?
Also, would you expect each of the aligners to scale by the same factor with larger dataset size?
Ben Langmead says
This is great, Dan! Very fair and timely. I continue to promise paired-end mode “soon” for Bowtie, though it won’t have escaped your notice that I didn’t make my AGBT deadline… I’m working on it right now, though.
Thanks,
Ben
Ben Berman says
Just wanted to thank you very very much ! It will be useful to include memory requirements, as we often run multiple processes in parallel and need to know how many can be multiplexed. Thanks again!
Ben Berman says
Did you make an attempt to use mapping scores when available? For instance in Maq , repetitive or ambiguous reads get reported with low mapping quality scores, often 0. Did you count these in the “ambiguous” count? Reads that are “somewhat” ambiguous will have scores between 1 and 10. I know Maq is a bit different in that it combines quality of mapping with uniqueness of mapping in one score.
Ben Berman says
Hey Dan, sorry for all the comments. I was thinking about your scary 18% misalignment rate since it seems to be in such disagreement with Heng Li’s published simulation data from the X chromosome (he reports like a 0.01 misalignment rate with Q30 mapping quality cutoff, and he introduced SNPs at 0.001 rate along with indels).
I re-read your description and noticed that you actually had the Maq simulation introduce its own variation, including SNPs at the 0.001 rate. So your SNP rate is already 0.002, correct, since you introduced your own. And also presumably your graph where you show that most of the misalignments are of “unknown” origin might actually be from SNPs introduced in the simulation. It’s kind of hard to imagine that doubling the SNP rate would explode the mis-alignment rate, but it’s possible since Maq and some of the other aligners barf if the read has more than 2 mismatches. In any case, it would be interesting to try your simulation without including extra SNPs in the simulation (Maq simulation has parameters for this).
Anyway, I really appreciate this work, it’s great! A good way to do this benchmarking might be to organize a kind of blind test, where you prepare a blind simulation dataset where the actual locations are masked out, and then you send the reads to each development group and they can do the processing themselves since they are experts with the parameters.
thanks,
ben.
MB says
1. My major concern is the “incorrect” category. I guess this includes reads that cannot be mapped confidently, but in practice we seldom use these reads and therefore whether they are incorrect does not matter. If this category does not include repetitive hits, then the error rate seems too high to me. Maybe C. elegans is special?
2. On real data, another measurement is the fraction of PE reads than can be mapped consistently.
3. I am a little surprised that CLC is no faster than maq. In its white paper, it says it is much faster than SOAPv1 and maq.
4. Additional note on SOAPv2: in the PE mode, only reads that can be properly paired are output, while for maq and novoalign, singletons are also output if I am right. SOAP-PE should not map fewer than SOAP.
5. I do not quite understand why CM finds more correct alignments than novo-PE. I can understand CM performs local alignment, but this would not explain.
Anyway, a great benchmark. Thank you very much!
Ben Berman says
You might link, Heng has also put up some of his own benchmarking results , in which he includes 70-bp, 125-bp, plus memory requirements..
http://maq.sourceforge.net/bwa-man.shtml
Ryan M says
That ~18% of reads with errors/SNPs get misplaced is a sobering and frightening thought. I can see why this would happen with single-end reads, but I would expect that paired data would be much more robust to this (for aligners that handle paired data). Have you (or any of your readers) looked investegated that?
Thanks,
Ryan
Dan Koboldt says
Thanks, all, for the comments and feedback. Ben and MB raise some interesting points about the apparent 18% mis-alignment. I did use alignment quality (for Maq) to distinguish between uniquely mapped and ambiguously mapped reads.
I, too, find the 18% number disturbing, and am investigating this in some detail. It may in fact be related to the fact that the genome was mutated twice: first, when I created a mutated reference sequence, and second, when I simulated reads with Maq and allowed a certain amount of variation. It could be that by introducing 10,000 indels, Maq changed the length of the reference sequence sufficiently that numerous read positions were no longer accurate based on the un-mutated reference.
More details to follow….
~Dan Koboldt, MassGenomics
Daniel Brewer says
This post is really interesting, thanks for producing it. I was wondering whether it would be possible for you to do a post on the analysis workflow that you use. In particular, the maq map settings etc.. I am just starting to analyse a Solexa small RNAs project and I am finding it difficult to work out the best way to use these aligners.
Shawn says
I have a question, I have use the cell quite a bit and have noticed that it is faster the Maq and Soap, I was wondering if you used the memory option to include more RAM. Also what kind of hardware are you using. Some if not all of these algorithms can be run with multiple threads. I would like to see more information with regards to your bench marks, meaning what parameters are you using. Gapped/ ungapped, what are mismatch and deletion cost, what limits did you set. Thanks in advance
Shawn