Archives for January 2009

Short Read Aligners Update at AGBT

January 20, 2009 by Dan Koboldt

Marco Island: Feb 4-8, 2009

I’ll be attending the coveted Marco Island meeting early next month (February 4-8), where I’ll present a poster on my evaluations of short read aligners for next-gen sequencing data. As you might infer from our AML cancer genome paper, Maq has been the central alignment tool here for over a year. This may not always be the case, because the longer reads (75-100 bp) promised by Illumina/Solexa may eventually reach lengths where the maq algorithm no longer has superiority. Heng Li, Maq’s developer, is already working on a new aligner currently in beta that uses the Burrows-Wheeler Algorithm. Has anyone looked at it yet? I’m curious about it, but don’t yet have time.

Alignment Programs Evaluated

Here’s a partial list of the short read aligners that I’m evaluating for this poster. I listed 10 aligners in my accepted AGBT abstract, but I expect there will be some changes.

Maq – obviously we’re evaluating Maq, not only as a benchmark for other aligners, but to better understand the results we’re getting from it. While we run ELAND (Illumina’s aligner) as well, more and more of our runs are paired-end, an area in which Maq is far stronger. Also, have you ever tried to look at ELAND output? It’s incomprehensible to me. No, it’s safe to say that we decided to gamble on Maq over a year ago, and so far, the bet has paid off.
Novoalign – this is Colin Hercus’ alignment tool, already in v2.0. Its speed is at worst comparable to Maq’s (in single-ended mode), and it does offer paired-end alignment. High marks for usability and allowing gaps in single-end alignments.
Bowtie – an aligner from Steven Salzberg’s group that claims to be 35x faster than Maq. My colleague Todd Wylie has evaluated Bowtie in some depth. Sadly, no paired-end mode yet.
cross_match – the classic pairwise aligner has seen some dramatic performance changes to address nextgen data. Still waiting for the usability and documentation to catch up.
RMAP – one of the few aligners (other than Maq/NovoCraft) that makes use of quality scores during alignment, RMAP shows promise. Unfortunately, there have been no updates since the initial release, and I hear through the grapevine that the authors have abandoned the project.
SOAP – this tool has seen the most dramatic changes since I began my evaluation. Initially, I had several problems with SOAP v1 (couldn’t get PE mode to work, for example). And, the practice of scanning reads into memory was rather slow. However, SOAP v2 has significant performance improvements (PE works too) and I see that BGI is also developing SNP and indel callers. This is probably a tool to watch.

Alignment Metrics and Comparisons

So what do I look for in a short read aligner? Obviously speed is a consideration, since we’re generating ever-more-overwhelming amounts of data. Usability and compatability with our in-house platforms (notably Illumina/Solexa) are just as important. And because we have a pipeline in place already, I’m looking for aligners that can beat Maq – in performance, features, or sensitivity – and that’s not easy to do. Maq is fast and does quality-based alignment, single or paired-end, assembly, SNP calling… there’s a reason why the rest of the industry seems to be conforming to it. Furthermore, Maq is well documented and (thus far) consistently updated. The latter point is, I think, a very serious consideration. We have no use for a tool that was developed once just to get a publication and will never see future improvements.

The Advantage of Open Source

Maq is open source, too, which is certainly not a requirement for a next-gen aligner, though it’s a strong selling point. My former colleague Brian Dunford-Shore used to delve into the code of earlier Maq releases when we encountered a problem. Now that the codebase seems to be more robust, it’s still useful to be able to look at the Maq code (and .map file format) to develop our own ancillary tools. It’s safe to say that no matter how good the aligner, we’ll almost certainly use more than one in order to build the most comprehensive pipeline.

Lynda Chin on Mining the Cancer Genome

January 13, 2009 by Dan Koboldt

We had a very interesting talk from Lynda Chin, a researcher from the Dana-Farber Cancer institute who played a key role in the analysis of our TCGA glioblastoma publication. The attendees were an esoteric group, including key players from our genome center (Rick Wilson, Elaine Mardis, Tim Ley) and every cancer researcher here that I’ve met (Paul Goodfellow, Brian Suarez, Greg Longmore). The presenter’s talk – Mining and Translating the Cancer Genome – came in two logical parts: first, a discussion of “why study the genome” of cancer, and then a fascinating overview of her research into the determinants of cancer metastasis.

Why Study the Cancer Genome?

I wonder if people are still asking this question. Dr. Chin hit us with a few high-profile examples of how studying cancer genetics/genomics has improved our knowledge of and ability to treat the disease. There was the 2004 Science publication by Matt Meyerson’s group showing a correlation between patient mutations in the EGFR genes and their response to gefitinib therapy in lung cancer. A year later in the New England Journal of Medicine, Mellinghoff et al showed a significant association between PTEN/EGFR co-expression and response to EGFR kinase inhibitors (erlotinib) in glioblastoma. In 2007, Stommel et al published some interesting (and slightly frightening) findings in Science. They found that certain GBM tumors – particularly those deficient in PTEN – had multiple co-activated receptor tyrosine kinases (RTKs). Treatment with a single RTK inhibitor had little effect in these tumors, but combined therapy to target several RTKs (with Iressa/Tarceva/Gleevec, etc.) decreased signaling, cell survival, and anchorage-independent growth in GBM.

Dr. Chin was also kind enough to acknowledge work by Li Ding’s group on MGMT methylation and PIK3R1 mutations in glioblastoma, published in the pilot TCGA study published a few months ago in Nature. She emphasized several times the value of TCGA’s GBM sequencing as a reference cancer genome, with high-quality mutations already profiled for mutations, gene expression, etc.

Current Cancer Therapy: Medical Whack-A-Mole

The speaker likened our current approaches to cancer therapy to the game of Whack-A-Mole; numerous different targeted therapies are applied across a wide spectrum of cancer patients, each therapy proving effective in as few as 40% of patients with the same disease. The other 60% of patients might show no benefit, a possible explanation for the failure of so many Phase III clinical trials. Cancer, she argued, is likely to be the result of a network of interacting genes, pathways, and environmental factors. Only when we functionalize the entire network can we begin to treat individual patients with rational drug combinations based on their unique tumor genome – the promise of personalized medicine.

Model Systems for Cancer Metastasis

The second half of the talk focused on work done in mouse models. Throughout her talk, Dr. Chin championed model systems as critical components to studying cancer. Her group set out to answer the question, how can we distinguish at an early stage if a cancer will metastasize? The ability to do so, obviously, would have major implications for diagnosis and treatment; metastasis is generally what kills cancer patients, and being able to predict who’s at risk can inform the plan for therapy. The idea was to identify genes that were metastasis determinants by correlating gene expression with some kind of invasion assay that measured tumor progression. They whittled a 1,597 gene dataset down to 360 MD suspects, which went through a knowledge-based pathway analysis using IPA. The filtered dataset highlighted several key cancer processes, including those involved in cell motility and adhesion.

Metastasis Determinants in Other Cancers

When they narrowed down their list of metastatic determinants to the top 20 or so, things got pretty interesting. It turned out that mutations in these MD candidates were prevalent across many types of cancer. Twelve of their MD’s were mutated in breast cancer, around 8 or so in colon cancer, and so on. One particular gene of interest that Dr. Chin presented was HOXA1, a gene involved in development. When expressed, HOXA1 activated the TFGB network which, according to their IPA analysis, was centered on SMAD3. Thus, the take-home finding was that tumors whose expression profiles show HOXA1 over-expression might benefit from TGFB inhibitors.

Early Metastatic Genes: Probably Oncogenes

The presenter pointed out that genes that confer advantages for metastasis (invasion) alone have no Darwinian selective advantage in early tumor development. Thus, she proposed that many determinants of metastasis are likely to be oncogenes as well. If this is true, then finding mutations that confer both tumorigenesis and metastasis may provide biomarkers for prognosis. Metastatic potential is probably hard-wired, Dr. Chin said, and thus may eventually be predictable from the genetics of the primary tumor.

Genomics-related journals by impact factor

January 7, 2009 by Dan Koboldt

Looking ahead for 2009, I’ve already established that publication will be one of my highest priorities. To motivate myself, I updated my list of top journals for genomics, ranked them by their most recent impact factor (2007), then printed it out and taped it on a cabinet right in front of me.

When I was taping it on top of my older (2004?) list, I happened to notice some interesting changes in impact factor over the past few years. First, I saw that Nature has finally surpassed Science by impact factor (28.75 compared to 26.37). I used to get Science in print and like its format, but ironically, the big publications on my CV are all in Nature. AAAS had better watch out, because Nature Genetics, which I consider the top journal for our field, has moved to within striking distance (24.18 to 25.56).

Next, I was pleased to see open access journals moving up in the ranks, particularly PLoS Genetics (from 7.67 to 8.72) and BMC Genomics (4.03 to 4.18). I noted that the American Journal of Human Genetics (11.09) is still thrashing the European Journal of Human Genetics (4.00). I was disappointed at a slight drop for Human Mutation (6.47 to 6.27), since they published my inaugural first-author paper (don’t worry HuMu, I still love ya!).

Consistently moving up were Pharmacogenetics and Genomics (5.39 to 5.78) and Pharmacogenomics Journal (3.96 to 4.97), a reflection, perhaps, of the growing interest in personalized medicine.

Just for fun, I took the publications from my own CV and looked at the impact factors of journals where they were published (Nature, PLoS Biology, Human Mutation, J. of Infectious Diseases, and Genomics). It seems my average career impact factor is 19.24. Not too shabby!