Short Read Aligners Update at AGBT

January 20, 2009 by Dan Koboldt

Marco Island: Feb 4-8, 2009

I’ll be attending the coveted Marco Island meeting early next month (February 4-8), where I’ll present a poster on my evaluations of short read aligners for next-gen sequencing data. As you might infer from our AML cancer genome paper, Maq has been the central alignment tool here for over a year. This may not always be the case, because the longer reads (75-100 bp) promised by Illumina/Solexa may eventually reach lengths where the maq algorithm no longer has superiority. Heng Li, Maq’s developer, is already working on a new aligner currently in beta that uses the Burrows-Wheeler Algorithm. Has anyone looked at it yet? I’m curious about it, but don’t yet have time.

Alignment Programs Evaluated

Here’s a partial list of the short read aligners that I’m evaluating for this poster. I listed 10 aligners in my accepted AGBT abstract, but I expect there will be some changes.

Maq – obviously we’re evaluating Maq, not only as a benchmark for other aligners, but to better understand the results we’re getting from it. While we run ELAND (Illumina’s aligner) as well, more and more of our runs are paired-end, an area in which Maq is far stronger. Also, have you ever tried to look at ELAND output? It’s incomprehensible to me. No, it’s safe to say that we decided to gamble on Maq over a year ago, and so far, the bet has paid off.
Novoalign – this is Colin Hercus’ alignment tool, already in v2.0. Its speed is at worst comparable to Maq’s (in single-ended mode), and it does offer paired-end alignment. High marks for usability and allowing gaps in single-end alignments.
Bowtie – an aligner from Steven Salzberg’s group that claims to be 35x faster than Maq. My colleague Todd Wylie has evaluated Bowtie in some depth. Sadly, no paired-end mode yet.
cross_match – the classic pairwise aligner has seen some dramatic performance changes to address nextgen data. Still waiting for the usability and documentation to catch up.
RMAP – one of the few aligners (other than Maq/NovoCraft) that makes use of quality scores during alignment, RMAP shows promise. Unfortunately, there have been no updates since the initial release, and I hear through the grapevine that the authors have abandoned the project.
SOAP – this tool has seen the most dramatic changes since I began my evaluation. Initially, I had several problems with SOAP v1 (couldn’t get PE mode to work, for example). And, the practice of scanning reads into memory was rather slow. However, SOAP v2 has significant performance improvements (PE works too) and I see that BGI is also developing SNP and indel callers. This is probably a tool to watch.

Alignment Metrics and Comparisons

So what do I look for in a short read aligner? Obviously speed is a consideration, since we’re generating ever-more-overwhelming amounts of data. Usability and compatability with our in-house platforms (notably Illumina/Solexa) are just as important. And because we have a pipeline in place already, I’m looking for aligners that can beat Maq – in performance, features, or sensitivity – and that’s not easy to do. Maq is fast and does quality-based alignment, single or paired-end, assembly, SNP calling… there’s a reason why the rest of the industry seems to be conforming to it. Furthermore, Maq is well documented and (thus far) consistently updated. The latter point is, I think, a very serious consideration. We have no use for a tool that was developed once just to get a publication and will never see future improvements.

The Advantage of Open Source

Maq is open source, too, which is certainly not a requirement for a next-gen aligner, though it’s a strong selling point. My former colleague Brian Dunford-Shore used to delve into the code of earlier Maq releases when we encountered a problem. Now that the codebase seems to be more robust, it’s still useful to be able to look at the Maq code (and .map file format) to develop our own ancillary tools. It’s safe to say that no matter how good the aligner, we’ll almost certainly use more than one in order to build the most comprehensive pipeline.

AML: A New Era of Cancer Genomics

November 6, 2008 by Dan Koboldt

Next-generation sequencing technologies are said to be ushering in a new era of cancer genomics. A powerful demonstration of the new paradigm for cancer research came out today in Nature. It’s the much-anticipated publication of our AML project, in which we used the Illumina/Solexa platform to sequence the entire genome of a woman who died of acute myeloid leukemia. See my colleague David Dooling’s post on Politigenomics for links to some of the news coverage. This study offers two important milestones to the field:

1.) The first complete genome of a woman. Patient 933124 follows James Watson and J. Craig Venter into the archives of whole-genome history.

2.) The first cancer genome to be completely sequenced on a next-generation platform.

The basic biological problem is simple. This patient had AML. AML is a disease state that was initiated, and driven, by mutations in her genome. Her blood cells should be dominated by the clone of the most effective cancer genome. So, we sequence both tumor and germline DNA. We find any mutations in the tumor that are not in the germline, and identify which of those are novel, protein-coding variants. Among these should be, must be, the mutations that initiated and drove the development of cancer.

Major Informatic Challenges

Yet even with new technologies, this was no simple task. The sheer volume of data generated for this project is what amazes me. It took 98 full Solexa runs (4 libraries totaling ~5.86 billion reads) to reach our target diploid coverage in the tumor, which was 90%. The data was generated over a period of several months, during which both the sequencing technologies and the informatic algorithms were constantly evolving. The AML project offered me my first view of both 454 and Solexa data, and it presented our group with numerous challenges. Disk space. Computing power. Short read alignment. Variant calling. You name it.

The Power of the Unbiased Approach

It seems like a lot of work just for ten mutations. That’s how many validated, somatic, nonsynonymous mutations we found in this AML genome. Eight of these ten mutations, however, implicated new genes that were not previously linked to AML. Four of the genes are in gene families strongly associated with cancer pathogenesis (PTPRT, CDH24, PCLKC, and SLC15A1). The other four genes (KNDC1, GPR123, EBI2, and GRINL1B) are not known to contribute to cancer pathogenesis, but they have potential roles in metabolic pathways that may act to promote cancer growth. These are also four genes that would almost certainly have been excluded from a candidate gene approach.

Mutation Frequencies by 454 Read Count

Another interesting application of massively parallel sequencing was the estimation of mutation frequencies in the tumor sample using the Roche/454 platform. For each of the 10 somatic mutations, as well as 2 germline control variants, we performed PCR-targeted 454 resequencing in samples from the primary tumor, the relapse tumor, and the germline. The idea here was to profile the relative proportions of clonal cells that made up each sample. We got some results that the first author of the study, Tim Ley, described more than once (in our meetings) as “absolutely beautiful.” All of the somatic SNPs were at 50% frequencies in the tumor, as you’d expect for heterozygotes. They hovered slightly lower (around 40%) in the relapse sample, which was known to be less pure (i.e. ~78% blasts), but if you correct for the blast count, they reach 50% as well. Intriguingly, the somatic variants were detected in the germline sample as well at frequencies of 5-13%, suggesting that the skin sample was contaminated by a small fraction of leukemic cells. The one non-beautiful result was FLT3, which had frequencies of around 35% in tumor and 31% in relapse. It may be that the FLT3 ITD mutation was not present in all tumor cells; perhaps it was introduced later than the others.

Yes, We Can Find Indels in Short Reads

One of the significant bio-informatic challenges in which I became intimately involved was the detection of indels, which is theoretically possible but practically very difficult in fragment (non-paired) reads that are only ~36bp long. We ended up combining a few different approaches and found over 700 putative small indels, more than half of which were already in dbSNP. We attempted to validate 28 of these by 3730 sequencing. Two were the previously-known mutations in FLT3 and NPM1. Two were false positives. The other 26 were real, but present in the germline, which was a bummer since we thought they’d be somatic. Those are the breaks. Fortunately, indel detection is one area that will be helped dramatically by improvements to the sequencing technologies, namely longer reads and paired-end protocols.

A New Paradigm for Cancer Genomics

I think that most of all, this work was important because it established the feasibility of sequencing entire genomes with massively parallel / short read technologies and getting valuable results from it. It also drove us to develop and apply new algorithms (like decision trees) to analyze the data. I expect that we’ll begin to see a number of whole-genome-sequencing approaches to the study of cancer and other disease that take advantage of this new paradigm. The question of whether or not we can do science on a whole-genome scale has been answered. In the words of our next president, “yes we can!”

Dave and Decision Trees for NGS

October 15, 2008 by Dan Koboldt

My colleague David Larson just returned from CSHL’s Personal Genomes meeting, where he presented a poster on decision-tree filtering of variant predictions from Illumina/Solexa data. I don’t know much about machine learning, but I can see that it offers a useful approach in at least one aspect of next-generation sequencing.

From my basic understanding, a decision tree is a machine learning algorithm that you “train” on a dataset where the correct decisions are known, and then apply to another dataset in which decisions are not known.

A sample decision tree that uses weather attributes to determine if a game will be played or not. Image Credit: Wikipedia

A sample decision tree that uses weather attributes to determine if a game will be played. Credit: Wikipedia

For example, Dave’s poster described a decision tree that determines whether SNP predictions from Solexa are real (“Germline”) or false-positives (“WildType”). As a training set, Dave used ~650 SNPs whose true status had previously been determined on 3730 sequencing. For each SNP, he provided several attributes (base quality, read count, etc.) as well as the correct “answer (Germline or WildType) as determined by 3730. These inputs went into the c4.5 program which generated a decision tree to distinguish Germline from Wildtype based on these characteristics.

Dave applied the decision tree to whole-genome Solexa data for an individual that we recently sequenced to over 10x coveraged with Solexa fragmented reads. Maq had predicted ~5 million SNPs; the decision tree filter cut this number in half. Even more promising, Dave’s decision tree filter isolated a substantially better data set. Over 90% of the SNPs detected by array-based genotyping were among the Germline-classified SNPs. Concordance with dbSNP, which is one of our measures of specificity, was over 80% the last I heard.

It occurred to me that the decision tree approach has numerous applications for next-generation sequencing analysis. It could be used to distinguish true variants from false positives, or somatic mutations from germline variants. Decision trees might also be informative for short read alignments, where a number of attributes (read length, alignment score, alignment quality, mismatches, etc.) could be used to determine whether or not a read was correctly placed.

After talking with Dave, I spent half a day building decision trees that might be useful for 454 variant detection. One thing I realized very quickly is the importance of the training data set. First, I tried a training set of ~75 variants sequenced by 3730. This was way too small, yielding a tree with one decision (allele frequency) to classify the data. Then, I tried a training set of ~400,000 454 read alignments with several attributes. This was far too much, yielding a massive tree with hundreds of branches. Also, I worry about the correctness of the “answers” in my data sets. While 3730 sequencing is a gold standard, it also has a tendency to miss certain kinds of variants, which might be detected in 454. Real variants, labeled as Wild-Type in the training data set. I think I’ll have to find a larger, more reliable training set before decision trees bear fruit for 454 variant detection.

Short Read Aligners: Maq, Eland, and Others

May 14, 2008 by dkoboldt

This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.

Short-Read Data Sets at the WashU Genome Center

One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:

Metric	Maq 0.6.3	Maq 0.6.5
Average alignment time for normal runs	17.7 hours	9.1 hours
Max alignment time for a normal run	240 hours	28.8 hours
Total number of jobs	2168	1467
Jobs that took longer than 1 day	443	3

The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.

Heng Li’s Comparison of Short-Read Aligners

For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:

Eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.

RMAP: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq -se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.

SOAP: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.

SHRiMP: Actually I was not expecting that a program using seeding +Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP’s normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to “nan” normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.

Maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq’s huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq’s random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.

More Comments from Heng Li

Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.

Unbiased Comparisons of Short-Read Aligners

In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.

Credits

WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.