Archives for February 2009

Short Read Aligner Results

February 13, 2009 by Dan Koboldt

As promised, I’m sharing the results of my short read aligners poster from AGBT 2009. There was a healthy amount of interest in this topic at the meeting; I spoke with numerous people who largely were using Maq, but wanted to explore other options for short read alignment. Some people were interested in colorspace-capable aligners, presumably for SOLiD data, while others wanted a tool that could align across gaps (indels) in single-end mode.

The Short Read Aligner Comparison

In the end I presented a comparison of ten short read aligners: BFAST, Bowtie, CELL (CLCbio), cross_match, Maq, Novoalign (Novocraft), RMAP, SeqMap, SHRiMP, and SOAP. These are, of course, only a subset of the tools currently available – several people at AGBT asked about other aligners – but they’re a good sampling of different approaches to the same problem.

I focused on three data sets, all of which were based on 36-bp Illumina/Solexa paired-end libraries sequenced here as part of the 1,000 Genomes Project:

1 million simulated read pairs from C. elegans
1 million simulated read pairs from Hs36
1 million real Illumina/Solexa read pairs from a YRI sample

Speed and Accuracy of Aligners

One trend was immediately apparent: the aligners that used Burrows-Wheeler Transformation indexing of the reference sequence (Bowtie & SOAP) were consistently faster, especially at the human genome scale. It’s not surprising that Heng Li is now focusing on BWA as opposed to Maq. Another trend that I find rather frightening is this: when I introduced SNPs and indels, almost every aligner mis-placed (as in, placed uniquely but to the wrong location) 18% of the reads. This has important implications for variant detection, since even single base changes can have a dramatic effect.

Aligners that Disappointed

I noticed very quickly that SHRiMP was the slowest aligner, and given that it’s slated towards SOLiD (a platform we’re not heavily invested in), it was easy to drop SHRiMP from consideration. Another downer we knew about in advance was RMAP, whose authors abandoned their project after getting the publication out.

Aligners that Surprised Me

There’s a close relationship between our genome center and David Gordon / Phil Greene, which no doubt accounts for why cross_match continues to be used. Months ago Gordon came and touted the latest CM revision as something that was faster than Maq. Of course, he pointed out that you had to adjust several parameters from their default settings to make this happen. Did it turn out to be true? Not necessarily – in a few tests CM was faster than Maq, in a few it was slower, though they were always comparable. Where cross_match shined was sensitivity at mutated sites – in the C. elegans simulation, it successfully placed more reads in single-end mode than other aligners.

Most Promising Aligners

Maq is our current tool of choice, but we’re looking closely at Novoalign (a Maq-like tool with more sensitivity), as well as Bowtie/SOAP for speed considerations. For details, see the “Short Read Aligners” section of my blog.

Genomics Bloggers Meet at AGBT

February 9, 2009 by Dan Koboldt

One of my favorite parts of AGBT was the blogger’s luncheon, in which four of us from the genomics blogging community got together to talk shop. We had Daniel MacArthur of Genetic Future, Anthony Fejes of Fejes.ca, David Dooling of PolITigenomics, and me, Dan Koboldt of MassGenomics.

Daniel MacArthur (Genetic Future), Anthony Fejes (Fejes.ca), Dan Koboldt (MassGenomics), and David Dooling (Politigenomics)

BLOGGERS AT AGBT

It was an impromptu but very enjoyable meeting where I learned that I’m probably the least serious about all of this blogging. We all agreed that Daniel MacArthur is the most prolific among us, and as I found out in the course of AGBT, he’s probably the most well known as well. He and David Dooling had an intimate knowledge of their blog statistics and readership. I have much to learn in that vein.

Anthony Fejes of the self-titled Fejes.ca began his efforts not to blog about genomics, but actually photography, a keen interest of his. David Dooling, of course, covers three areas that gave the title to his blog: Politics, IT, Genomics.

All friendly competitions aside, I was intrigued to learn that we all face similar challenges. Perhaps foremost among them is knowing just how much or how little to say about the work that we’re doing. We are all scientists, of course, and thus keen to share knowledge with others. Yet we do work in a competitive field, one in which there are never as many grants as grantees, never as many publications as submitted manuscripts.

Fortunately, at our little roundtable, there were no adversaries for the NIH budget – Dooling and I work together, Fejes is in Canada, and MacArthur is on the other side of the Atlantic. For that reason, perhaps, we could talk openly, learn from one another, and finally, put a face to the names we already knew so well.

AGBT Day 3: Pharmacogenetics and Capture

February 7, 2009 by Dan Koboldt

The third day at AGBT finally saw decent weather, with temperatures pushing into the mid-sixties. That’s almost as high as the atypically warm weather St. Louis is having right now. It’s just not fair. Fortunately, the sessions were exciting enough to keep most of us indoors anyway.

Howard McLeod on Pharmacogenomics

There was a provocative and entertaining talk by Howard McLeod (of UNC, formerly of WashU) on using the genome to optimize drug therapy. He opened with the challenges faced by clinicians who treat their patients with drugs – there are multiple active regimens for most diseases, and the right one is prescribed only 50% of the time. Complications like variation in response and unpredictable toxicity add to the difficulty of optimizing patient treatments. While the speaker emphasized that DNA is not the only answer, he admitted that it does seem to be getting the most traction. He went on to describe my favorite poster-child of pharmacogenetics – warfarin, an oral anticoagulant prescribed to more than 2 million patients for year – in which the variation in maintenance dose can be as high as 50-fold from patient to patient. One of the pioneers of warfarin pharmacogenetics is Brian Gage, a colleague of mine at Washington University. In 2007, warfarin was the first drug for which the FDA added genetic information to the label: genotypes in VKORC1 (the target) and CYP2C9 (the metabolizer), which, together with age, body size, and drug interactions, explain nearly 60% of the variability of warfarin dose. There’s a free tool (http://www.WarfarinDosing.org) that lets clinicians incorporate all of these data into a dose decision engine.

Capture Technologies Abound

Richard Gibbs and Matt Bainbridge gave talks concerning Baylor’s well known Nimblegen-based capture system. Most of the groups running capture now do so on the HapMap/1000 Genomes pilot samples, which are especially powerful because of the broad genotyping information available from the HapMap data sets. A highlight of the sessions for me was the presentation of WashU’s solution-based capture platform, WU-CAP, which performed comparably to Nimblegen solid-phase capture but also can be sequenced quite easily on Illumina. What’s incredible about all capture platforms presented is that they’re reporting dbSNP concordances in the 80-95% range, which is higher than I typically see in whole-genome and PCR-targeted sequencing. Since variant prediction is my main area of interest, you can guess how excited I am about these platforms.

Daniel MacArthur Also Delivers

I can at last offer unbiased scientific endorsement of my blogging colleague Daniel MacArthur (Genetic Future), who presented some of his work on variation around the ACTN3 gene. The story of one particular variant (R577X) was particularly fascinating. Evidently the 20% of human populations that are homozygous-variant (XX) exhibit a marked reduction in muscle strength and explosive power. As such they’re highly under-represented in Olympic sprinters, for example. Yet there’s a trade-off in that XX individuals have better muscle endurance and faster recovery, and as you could guess, the allele is very common among long distance runners. Well done, Daniel, fascinating stuff.

Buying 21 Illumina/Solexa Sequencers

February 6, 2009 by Dan Koboldt

Not me, the center. It became official this morning: The Genome Center at Washington University will buy 21 additional Illumina GAII sequencers.

Illumina must be absolutely loving this conference. They presented some very impressive plans for improvements (reads out to 250 bp) by the end of 2009, and apparently WashU decided to double down on the platform.

This will bring our total count to something like 35 Illumina/Solexa machines, which officially puts us ahead of Wellcome Trust Sanger Institute (24). With this capacity we’ll be able to sequence an entire human genome to 25X coverage in a single day. Incredible.

As part of the medical genomics analysis team, I was immediately thrilled at the news. Then, I remembered that someone has to analyze all of that data…

Good thing we have Dave Larson. Yeah, I’m looking at you buddy!