Archives for December 2009

Sanger Adds Two Cancer Genomes

December 31, 2009 by Dan Koboldt

This week in Nature, investigators from Wellcome Trust Sanger Institute published the fourth and fifth complete cancer genomes. Interestingly, both are cancers in which the primary mutagen is known: malignant melanoma (UV light) and small-cell lung cancer (tobacco smoke). This seems to be important, because when I looked at the number of validated somatic coding mutations in each of the first five genomes, the latest two stood out.

first5cancer

Granted, small-cell lung cancer and malignant melanoma differ in many ways from leukemia and breast cancer. Yet the increase of confirmed somatic coding variants in these two recent studies is striking.

Corregendum: Not the First Catalogue of Somatic Mutations

Even so, the authors’ claim that they are “providing the first comprehensive catalogue of somatic mutations from an individual cancer” seems unjustified. Perhaps this is based on the idea that AML1 and LBC1 focused on coding variants only. Yet I point out that in AML2 we evaluated 282 noncoding somatic mutations and confirmed over 50. In the melanoma study, only 470 of the 33,345 newly found somatic mutations were sent through validation, and the method for selecting these was not made clear. At best, the “first comprehensive” claim is a semantic one; at worst, it’s just wrong.

That said, these are still landmark studies. Even with the emergence of next-generation sequencing, completing a cancer genome is a marvelous achievement. It requires substantial financial resources and technical expertise; we certainly knew that WTSI had these. But guiding the data through analysis and forming a cohesive story out of it is the real challenge. It requires persistence, intellect, and scientific rigor, but most of all it requires strong leadership. I congratulate our friends across the pond for showing that they have what it takes.

Illumina’s Fourth, ABI SOLiD’s First Cancer Genome

The melanoma study, like the first three cancer genomes, applied high-throughput sequencing on the Illumina platform (2 x 75bp, in this case). In contrast, the SCLC study is the first cancer genome to be sequenced on a different platform – ABI SOLiD. While the read lengths for SOLiD were not impressive (2×25 bp), the specificity was – 77 of 79 (97%) of somatic coding SNVs and 333 of 354 (94%) randomly chosen genome-wide variants tested confirmed by PCR and traditional sequencing.

Unfortunately, SOLiD also comes with a sensitivity cost. Only 22 of 29 previously identified SNVs (77%) were called. Indels were a real problem – neither of the two previously known coding indels were detected, and the validation rate for predicted somatic indels was 25%.

Mutational Signatures Implicate UV Light and Tobacco

Intriguingly, in both studies the authors identified distinct mutational signatures of exposure to the long-suspected environmental risk factor – ultraviolet radiation (in malignant skin cancer) and tobacco smoke’s “cocktail of carcinogens” (in lung cancer). The substantial number of mutations in each genome made it possible to characterize these signatures with unprecedented statistical power.

The strength of both studies is the insight into the molecular mechanisms of DNA damage, repair, and mutation that could be inferred from such a powerful dataset. In melanoma, the most prevalent mutations were C->T and G->A transitions; the mutational spectrum and sequence context indicate that most of these are attributable to ultraviolet-light-induced DNA damage. In lung cancer, G->T transversions were the commonest substitution; these mutations have previously been linked to polycyclic aromatic hydrocarbons and acrolein in tobacco smoke.

Don’t Smoke. Wear Sunscreen.

Cancer, like many common, complex diseases, has many risk factors that come from the environment as well as from one’s DNA. Here, for perhaps the first time, we get a picture of the significant mutational burden associated with two lifestyle choices. Smoking is an obvious one – avoiding it dramatically decreases one’s risk of cancer and a host of other diseases. Now science can offer another definitive piece of advice. Everybody’s free, as Baz Luhrmann puts it, to wear sunscreen.

References
Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, Varela I, Lin ML, Ordóñez GR, Bignell GR, Ye K, Alipaz J, Bauer MJ, Beare D, Butler A, Carter RJ, Chen L, Cox AJ, Edkins S, Kokko-Gonzales PI, Gormley NA, Grocock RJ, Haudenschild CD, Hims MM, James T, Jia M, Kingsbury Z, Leroy C, Marshall J, Menzies A, Mudie LJ, Ning Z, Royce T, Schulz-Trieglaff OB, Spiridou A, Stebbings LA, Szajkowski L, Teague J, Williamson D, Chin L, Ross MT, Campbell PJ, Bentley DR, Futreal PA, & Stratton MR (2009). A comprehensive catalogue of somatic mutations from a human cancer genome. Nature PMID: 20016485
Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, Varela I, Nik-Zainal S, Davies HR, Ordoñez GR, Mudie LJ, Latimer C, Edkins S, Stebbings L, Chen L, Jia M, Leroy C, Marshall J, Menzies A, Butler A, Teague JW, Mangion J, Sun YA, McLaughlin SF, Peckham HE, Tsung EF, Costa GL, Lee CC, Minna JD, Gazdar A, Birney E, Rhodes MD, McKernan KJ, Stratton MR, Futreal PA, & Campbell PJ (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature PMID: 20016488

SNP Discovery in NGS Data, Atlas-SNP2, and VarScan

December 29, 2009 by Dan Koboldt

A paper in this month’s Genome Research sheds light on predictors of sequencing error in next-generation sequencing. Using data from both 454 and Illumina platforms, Shen et al applied logistic regression models to identify sequence- and platform-related factors that contribute to substitution (SNP) errors.

atlas-snp2-paper

The results, I think, offer new insight into the challenge of accurate variant detection in massively parallel sequencing data. On the 454 platform, four factors were significantly correlated with erroneous SNP calls:

1.) Base quality of the substitution. No surprise there. Quality scores are inversely correlated with sequencing error rate.

2.) Dinucleotide polymorphism events (DNPs). Runs of 2+ consecutive substitutions are enriched for erroneous calls. In particular, so-called “swap-base” events, in which two adjacent positions invert their nucleotides, are evidently the result of “loss of synchrony” in 454 pyrosequencing.

3.) Neighboring quality standard (NQS). Requiring Q>20 at the SNP position and Q>15 at flanking positions (+/- 5bp) improved SNP calling accuracy. This is not a novel idea, as tools like ssahaSNP incorporated NQS into their algorithms a few years ago. However, I’m intrigued by the application of quality “windows” to reduce false positives in next-gen sequencing.

4.) Distance of base from 3′ end of read, normalized against the read length. This is an interesting source of erroneous SNP calls that I personally have observed in both 454 and Illumina datasets.

Not Significant: Homopolymers (!?), GC Content, Sequence Context

What I find most intriguing about this list is what’s missing: homopolymers, which have long plagued 454 pyrosequencing. The authors noted that the new Titanium base-caller had much improved performance calling homopolymers, so much improvement, in fact, that homopolymer-associated errors failed to reach statistical significance in their analysis. Other factors that were examined but not significant included GC content, sequence context, and substitution type. The Illumina platform had the same significant factors with one exception: swap-base events were not a significant source of error.

Variant Caller Comparison: Atlas-SNP2, Maq, and VarScan

The authors applied their linear regression models on E. coli and built sets of prior probabilities to tune their SNP caller, Atlas-SNP2. Then, they compared the performance of Atlas-SNP2 on to that of two other tools: VarScan and Maq. The comparison datasets included 454 data (about 1/2 region) and Illumina data (1 lane of 76-bp reads) from a highly-characterized strain of S. aureus. The variant calls from each tool were compared to known SNP positions to determine sensitivity and specificity.

DataSet	Caller	Sensitivity	Specificity
454	Atlas-SNP2	97.6%	88.4%
454	VarScan	97.6%	96.8%
Illumina	Atlas-SNP2	98.8%	99.9%
Illumina	VarScan	85.7%	99.9%
Illumina	Maq	4.8%-88.1%*	99.9%
* Maq was applied with various values of -D, from 100 (default) to 618

Please correct me if I’m wrong, but it seems that VarScan out-performed Atlas-SNP2 on 454 data. Baylor is (in my opinion) a leader in 454 sequencing, and published the first whole genome with that platform, so I’m pretty pleased. As one would expect, Atlas-SNP did shine in one area, the high-read-depth Illumina datasets. The authors’ tuning eventually produced a higher sensitivity than VarScan or Maq while maintaining similar specificity. While I acknowledge this achievement, I’m a little wary of any results that report 99.9% specificity for SNP calling in Illumina data.

Will Atlas-SNP2 Be Widely Adopted?

There do appear to be benefits of tuning variant detection software to account for platform-specific sources of error. It’s my opinion, however, that Atlas-SNP2 must overcome two obstacles prior to widespread adoption by the NGS community:

Ruby implementation. The software package comes as a set of Ruby scripts with little documentation. It will take a bioinformatician to run it. That limits the potential of expanding the tool, and/or integrating it with existing pipelines. Also, scripting languages like Perl and Ruby, while easy to write code in, are unlikely to scale to where NGS throughput is headed. No performance data was included in the study, so we don’t know how long it takes to train Atlas-SNP2 on a considerable dataset.
Archaic read mapping/alignment pipeline. This is the central limitation of Atlas-SNP2. The authors rely upon the same pipeline – anchoring with BLAT, then genome partitioning, then local alignment with cross_match – that was utilized in the Watson/454 paper. Come on, guys! This may have been feasible for their limited dataset and small bacterial genomes. But here’s the problem with using old-school aligners: BLAT and cross_match won’t scale to the large genomes and Illumina-like throughput (~20m reads per lane). They also cannot use read pairing information. It looks like there may be preliminary support for alternative alignments in SAM format, but only for 454 data.

The paper itself is well-written and clear, but there’s a lot missing. The methods are brief, and don’t describe the versions or parameters of any software that was used. The sequence datasets don’t appear to be publicly available. The numbers of expected and observed SNPs for S. aureus that were used for sensitivity/specificity calculations are not clear. It’s my understanding that there’s no limit on supplemental materials, so this information should have been provided. I’m surprised, in fact, that this publication made it to Genome Research. The novelty and scope are limited, and probably more appropriate for something like BMC Genomics. I say this, admittedly, with just a touch of envy ;-).

Insights into Next-Generation Sequencing

The publication was informative, however, in correlating several of the key factors contributing to sequencing error. While some of these (quality score, for example) were obvious, the swap-base phenomenon and NQS window approach are worth consideration for those of us developing variant detection algorithms. The linear regression training model may also have some value, especially as we continue to generate variant calls in large datasets and submit them for validation.

References
Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, & Yu F (2009). A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome research PMID: 20019143

Making the Leap: Maq to BWA

December 23, 2009 by Dan Koboldt

Like most curmudgeons I fought the change as stubbornly as I could. Leave Maq behind for something else? Never! Yet over the past few months I have come to realize that BWA, as it’s called, is not bad. At our genome center we still generate both Maq and BWA alignments for Illumina data; thanks to the hard work of our automated pipeline (apipe) group, I never have to run either on my own. Admittedly, even without a learning curve to daunt me, I found myself reluctant to use the BWA results over Maq.

How I Came to Love BWA

Speed was obviously the most compelling argument. Whereas Maq can take 1-2 days to map a single lane of Illumina data to the human genome, BWA can do it in a few hours. Also, BWA outputs directly into SAM format, which we’ve universally (and gratefully) adopted as the standard for next-gen sequencing data. There are, as it turns out, many differences between Maq and BWA, which I’ve summarized in the table below:

Aligner:	Maq	BWA
Algorithm:	Hash lookup table	Burrows-Wheeler Transform
Input format:	Binary FastQ	FastQ
Output format:	MAP file	SAM file
Gapped Alignment:	Paired-end only; one mate must map	Single-end and paired-end
Maximum Read Length:	63 bp	200 bp (aln) or 1,000 bp (dbtwsw)
Multi-threading:	No	Yes
Assembly / SNP calling:	Tools included	None. Use SAMtools.
Simulation Tools:	Tools included.	None.
Continued development:	No	Yes

There are other differences, but these are the ones I find most important. The ability to take standard FastQ input is a plus for BWA, but its native SAM output is critical. Granted, it’s possible to convert the output of several aligners (Maq, Bowtie, NovoCraft, even BLAST) to SAM format, but a direct output is convenient. Compared to Maq, BWA is lean and mean – no assembly, SNP calling, or simulation tools – with the single purpose of mapping reads to a reference, and a reliance on companion software (SAMtools) for subsequent analysis.

Two new features in BWA – gapped alignment and the ability to take longer reads – are a devastating blow to other next-generation sequencing aligners. Now, it’s possible to map single-end reads with gaps (Novoalign) and to map longer 454 reads (SSAHA2). Like Maq, BWA has color-space functions as well.

The take-home message: BWA can map data from all three available next-generation sequencers – Illumina/Solexa, Roche/454, and ABI SOLiD.

The Heng Li Factor

No software tool can remain competitive without continued development. Thus, when I heard that Heng Li did not plan any further releases of Maq, I knew its days were numbered. Some of us had already gotten the impression that Heng was focusing on BWA more than Maq anyway, but that made it official. Maq is still an incredible piece of software, and represented a technological leap for NGS analysis that helped bring about the era of whole-genome sequencing. We were among the first to climb aboard the Maq train, and it took us far. Now, sadly but with eager anticipation, we transferred to the express train, the bullet train. BWA.

References
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics, 25 (14), 1754-1760 DOI: 10.1093/bioinformatics/btp324

Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877