Archives for July 2009

Maq, BWA, and Bowtie Compared

July 30, 2009 by Dan Koboldt

Until recently, Maq has provided the central alignment/assembly/variant-detection functionality for our Illumina pipeline. As technologies and algorithms evolve, however, we continue to investigate possible improvements. Heng Li’s sequel to Maq, called BWA, utilizes the incredibly fast Burrows-Wheeler indexing algorithm to speed up alignment time by orders of magnitude. Also, BWA generates alignments in SAM/BAM format by default, which is convenient for our large-scale sequencing projects where BAM files are becoming the standard format.

These features, along with our impression that Heng Li and company do not plan future updates to Maq, lead me to infer that BWA is the heir-apparent for our Illumina pipeline. Before the transition, however, we must compare Maq results with BWA results on the same dataset, to identify any differences that may affect downstream analysis. Also, we are continuing to evaluate other aligners, especially Bowtie, which offer comparable or even better speed at short read alignment.

Test Data: WGS and Targeted Sequencing of a Single Sample

We have a sample in-house for which we performed whole genome sequencing (WGS) and subsequently validated numerous novel variants. We also performed capture-based targeted resequencing (Illumina 2x75bp PE) of 6,000 genes in the same sample. To compare the performance of BWA, Maq, and Bowtie, we aligned the capture data with each tool separately, and looked at about a dozen sites where we’d validated novel variants from WGS.

Sensitivity – Total Reads Mapped

Here’s a histogram of the read depth at each of the 12 variant sites by aligner:

bwa-maq-bowtie-coverage

These results surprised me. Based on previous experience, I’d guessed that Maq would yield the highest depth, followed by BWA, and then Bowtie. Instead, with one exception, it was the other way around – Bowtie was more sensitive than BWA, which in turn was more sensitive than Maq. Yet these differences were relatively minor; overall, the coverage seems very comparable across all three aligners. I think that’s good news.

Variant Frequency by Read Count

Next, we looked at the observed variant frequencies, calculated as the relative fraction of reads supporting reference or variant alleles.

bwa-maq-bowtie-varfreq

When it comes to variant frequency, Maq and BWA yield almost identical results (despite slight coverage disparities). Bowtie yielded slighly higher frequencies in some cases, slightly lower frequencies in others. Again, these were very minor differences from three very different alignment algorithms, suggesting that each of them yields fairly robust results.

Farewell to Maq

Unfortunately, the results of my analysis do not bode well for Maq, only because Maq took a few days to align data that BWA and Bowtie processed in a matter of hours. So which Burrows-Wheeler aligner will prevail? It’s difficult to say. As far as SNP detection goes, BWA and Bowtie seem comparable.

SAM, BAM, Thank You Ma’am

July 24, 2009 by Dan Koboldt

Genome centers around the world have at last found something that we agree on. It seems like only a few months ago that I first heard of SAM (Sequence Alignment/Map) and his brother BAM (binary/compressed SAM). Yet those who produce next-gen sequencing data are rapidly adopting this universal short read data format as the de facto standard.

The Need for a Standard Short Read Data Format

From my work with short read aligners over the past year, it was quite obvious that we needed a standard format. More than a dozen alignment tools for next-gen data have been released – Maq, Bowtie, Novoalign, SeqMap, BFAST, and SHRiMP, just to name a few – and every one of them has its own custom format. The closest thing to a standard was Maq’s MAP format, a non-human-readable file that reached prominence only because Maq was the most widely used tool out there.

NGS Community Looks to Heng Li

The need for a standard is almost certainly not the only reason for the rapid adoption of the SAM/BAM specification. Another key factor is Heng Li, of Richard Durbin’s lab at Sanger. His work in developing Maq, and his reputation as a scientist, established a wide credibility for Heng in the NGS community. While Maq is, and continues to be, a powerful tool for analyzing Illumina and ABI SOLiD data, newer algorithms that leveraged Burrows-Wheeler Transformation (BWT) proved orders of magnitude faster at mapping reads. Bowtie was one of the first; soon after the makers of SOAP released SOAP2, a revamped version of their short read aligner that uses Burrows-Wheeler and is no longer open source.

Heng’s group soon developed a sequel to Maq, called BWA, that leverages this algorithm to speed up read alignments. It also produces BAM format output by default, a key feature that encouraged friends of Maq to make the switch. Richard Durbin’s group worked with a number of institutions to develop the SAM/BAM specification. Perhaps even more important was the development (and recent publication) of SAMtools, a suite of programs for manipulating SAM/BAM files and calling variants. By working with a standard alignment format, SAMtools can serve as a generic, aligner-agnostic variant detection pipeline.

SAM/BAM Convergence

Adoption of SAM/BAM by the sequencing community has been surprisingly rapid and widespread. A few short read aligners like Maq and Bowtie already have tools for converting their native formats to SAM. The latest Novoalign has a SAM output option built-in. There’s even a SAM/BAM converter for 454 data, which sounds like a dicey prospect given the homopolymer undercalling/overcalling problem. Large-scale sequencing efforts like the 1,000 Genomes Project and The Cancer Genome Atlas now consider BAM the standard format for data exchange. Many variant calling algorithms, like those developed at the Broad Institute, now accept BAM input as well.

The rush to make SAM/BAM a standard surprised me, but I don’t think it’s a bad thing. We needed a way to share NGS data that could be easily compressed, but also had a human-readable form. My only concern is this: many people may not realize that the data in a SAM/BAM file is aligner-dependent. There’s no such thing as a SAM file for an Illumina lane – instead, there’s on file for Maq alignments, one file for Bowtie, one file for BWA, etc. The good news, however, is that now we will have a centralized format for direct head-to-head comparisons of short read aligners.

Integrating Genomic Analyses with Functional Validation in Cancer

July 17, 2009 by Dan Koboldt

Yesterday our group discussed the recent Nature paper from Lynda Chin’s lab that identified GOLPH3 as a “first-in-class” Golgi oncogene. The study began where most cancer genomics efforts end up: with the identification of a genomic region (5p13) that’s amplified in numerous solid tumours. The authors reasoned that the amplified region likely contains a gene whose over-expression drives tumor development and/or growth. The minimal common region of amplification spanned six annotated protein-coding genes, yet only two of them (GOLPH3 and SUB1) showed significantly increased gene expression associated with increased copy number.

As John Wallis in our group pointed out, it’s the ensuing wet lab experiments that made this a Nature paper. Using a series of cancer cell lines (melanoma, prostate, pancreatic, NSC lung, and others), the authors show that overexpression of GOLPH3, but not SUB1, significantly increased colony formation and cell proliferation in vitro. Next, they performed a series of experiments to assess the function of GOLPH3 in cancer:

siRNA knockout, to demonstrate the effect of GOLPH3 on colony formation and cell proliferation.
Co-transformation assays, to show that GOLPH3 cooperates with HRAS in mouse embryonic fibroblasts and transforms human cell lines that constitutively express BRAF.
Mouse xenografts, to assess the effect of over-expressing GOLPH3 on tumor size.

Next, it was on to characterizing the mechanisms of GOLPH3’s oncogenic activity. Using a yeast two-hybrid screen, the authors found that GOLPH3 interacts with VPS35, a protein that’s part of the “retromer” – a protein complex involved in retrograde transport between endosomes and the trans-Golgi network. Though the nature of the interaction between VPS35 and GOLPH3 was not explored in-depth, this interaction (supported by staining/microscopy in cancer cell lines) implicated GOLPH3 as a Golgi protein and got the authors excited because, to their knowledge, it would be the first “Golgi oncoprotein.”

Linking GOLPH3 to mTOR

Pathway analysis is one of Lynda Chin’s specialties. Based upon their findings and evidence in the literature, the authors posited that GOLPH3 might be involved in the mTOR signaling pathway, which (as described in our TSP lung adenocarcinoma paper) plays a central role in many solid tumors.

L Ding et al. Nature 455, 1069-1075 (2008)

In a compelling series of experiments, the authors demonstrated that overexpression of GOLPH3 correlates with mTOR expression and increased phosphorylation of S6K, a substrate of the mTOR complex. This link suggests that GOLPH3’s regulates mTOR activity in mammalian cells, and because mTOR is the mammalian Target Of Rapamycin, it seemed reasonable to assess the influence of 5p13 copy number (and GOLPH3 expression) in the presence of rapamycin. As you could guess from the paper’s title, GOLPH3 overexpression increased sensitivity to rapamycin treatment. In fact, treating tumor cell lines with rapamycin had a similar effect on cell growth/proliferation as siRNA knockout of GOLPH3. For the life of me, I couldn’t understand the “peak-forward scatter height flow histograms” in figure 4A, but I got the message: same pathway inhibited, same effect.

Clinical Implications: How Will This Help Treat Cancer?

The authors are optimistic but careful in their discussion of the clinical implications of these findings. Obviously, GOLPH3’s role in tumor growth makes it an attractive target for rational drug design. Thanks to the key pathway analyses in this study, it is clear that GOLPH3 confers its oncogenic activity through the mTOR pathway, which can be inhibited by rapamycin treatment. Thus, the authors suggest that 5p13 copy number or GOLPH3 expression level might serve as a biomarker for sensitivity to mTOR inhibitors. I wonder if this possibility might be taken a step further – could induction of GOLPH3 overexpression in the tumors of cancer patients render them sensitive to rapamycin in a sort of cancer one-two punch? Regardless, if GOLPH3 expression or 5p13 copy number yields an informative biomarker for such treatment, it’s another step along the road towards personalized cancer therapy.

References
Scott, K., Kabbarah, O., Liang, M., Ivanova, E., Anagnostou, V., Wu, J., Dhakal, S., Wu, M., Chen, S., Feinberg, T., Huang, J., Saci, A., Widlund, H., Fisher, D., Xiao, Y., Rimm, D., Protopopov, A., Wong, K., & Chin, L. (2009). GOLPH3 modulates mTOR signalling and rapamycin sensitivity in cancer Nature, 459 (7250), 1085-1090 DOI: 10.1038/nature08109

With Collins at the Helm, What’s Next for NIH?

July 10, 2009 by Dan Koboldt

The nomination of Francis Collins for director of NIH is very exciting, particularly for those of us who work in gen collins-hgp omics. It not only recognizes his leadership and scientific reputation, but it seems to me a formal recognition of the importance of genetics and genomics for human health in the 21st century. At least three major scientific efforts were conceived and initiated during Collins’ tenure at NHGRI: the Human Genome Project (HGP), the International Haplotype Map Project (HapMap), and most recently, the 1,000 Genomes Project (TGP).

What Will Collins’ Next Big Project Be?

If Collins is confirmed as director of NIH, we can only wonder what his Next Big Project might be. Given his track record, I’d wager that it will have some of the same characteristics of past projects, namely:

Driven by new technology. Major advances in technology were key enablers of all of these big projects. For the human genome, it was capillary-based sequencing. For the HapMap project, it was high-throughput genotyping. For the 1,000 Genomes Project, it was massively parallel sequencing.
Unprecedented in scope and scale. There seems to be this one-upmanship that occurs when big projects are initiated. “We can sequence an entire gene,” someone probably told Collins. “Great,” he no doubt replied. “Let’s sequence the entire genome.” Or perhaps when groups were genotyping 50,000 SNPs, he just suggested, “let’s do all of them.”
A major advance in knowledge. The key to successfully initiating big projects, and certainly to getting funding for them, is a “big picture” deliverable – a huge new body of knoweldge that the project will yield, like the complete human genome sequence, or all of the variants in it. Also, they seem to promise major advances that will immediately impact scientific research, and eventually improve human health. Look at what the HapMap did for genome-wide association studies.

Unfortunately for Francis, he’s set the bar pretty high already. I think that few people appreciate the scale of the 1,000 Genomes Project and just how much information it’s already yielding about human genetic variation. TGP is also driving development of new technologies (like capture) at a pace that no one expected. If the project meets its fundamental goal, thousands of genomes will be sampled, many of them completely sequenced to high coverage. I’m not sure that anyone can one-up the TGP, at least in terms of undiseased individuals.

My Guess: Disease-Related Omics at Unprecedented Scale

Granted, I’m biased, but I believe that next-gen or next-next-gen sequencing will be the key new technology that drives Collins’ Big New Project. This time around, however, I doubt it will be anonymous, undiseased individuals with no phenotypic information that are surveyed. Instead, I foresee an ambitious effort to characterize the *functional* variation in genomes of phenotyped individuals. Let’s say that sequencing an entire genome reaches the $1,000 mark. As far as big-project budgets go, that means “free.” So if sequencing had no cost, who would you sequence? Anyone where the sequence could be interesting. What would you sequence? The genome, the transcriptome, and the methylome.

Yet with NIH, there will be a decidedly medical focus, so we’re talking patients. Perhaps 1000 patients for the 100 most prevalent genetics-influenced diseases. Such a project will keep us genome centers busy, certainly. More importantly, it will bring the research and medical communities closer together, because we’ll need geneticists, molecular biologists, and clinicians, working in concert with one another, to fully understand the relationship between genotype and phenotype.