Mappability of the human genome

December 9, 2010 by Dan Koboldt

Accurately mapping reads to a reference sequence is one of the most critical tasks for next-generation sequencing, yet it remains one of the most challenging areas of bioinformatics. Certainly, we’ve come a long way since ELAND and Maq. Dozens of new mapping and alignment algorithms have been published. Speed and sensitivity continue to improve; with BWA, we can map a lane of Illumina data (2×100 bp) to the human genome in about four hours. Nevertheless, it seems like most of the problems that arise in downstream analysis can be traced back to the read alignments.

Mapping Error: Consequences and Causes

Take variant calling, for example. The two central issues we try to address are false positives (variants that aren’t real) and false-negatives (real variants that are missed). Countless hours of manual review and extensive validation have shown that false positives rarely arise from simple sequencing errors. Such errors occur randomly, and with sufficient read depth (8x-10x), they can be readily ignored as background noise. Instead, most false positives are the result of an alignment errors – reads mapped to the wrong place, local mis-alignment around indels, etc. False-negatives are also a concern, particularly for indels. Why? Because the larger the gap between read and reference sequence, the more difficult it is to place correctly.

In my humble opinion, three factors contribute to the problem of accurate read mapping: relatively short read lengths (36-100 bp), unprecedented throughput of NGS instruments, and size/complexity of the genome. As a bioinformatician, I can’t do much about the first two. The size and complexity of the genome, however, remains somewhat fixed. As such, many groups have sought to define those portions of the genome that are “mappable”, so that they can be prioritized for analysis.

Quantifying Short Read Mappability

Recently, I noticed that several such datasets have been incorporated as tracks in the UCSC Genome Browser. For build 18 (NCBI 36), there are about a dozen of them, all under the track name “Mapability”. Here are some of them across the BRCA1 gene region on chromosome 17:

ucsc-mappability

The top three tracks (in green) represent mappability with 36, 75, and 100bp reads as calculated by CRG-GEM, a suite of genome analysis tools by a group in Spain. Then there’s Duke uniqueness (35 bp) in red, and UMass uniqueness (15 bp) in blue. Looking at the tracks together, a few patterns become apparent: first, that mappability calculations remain fairly consistent across datasets generated by different groups. Most of the drops in mappability correspond nicely to RepeatMasker regions. Also, if the CRG results are accurate, the mappability increases substantially with read length (although the near-complete mappability for 100-mers seems a bit optimistic). This dovetails nicely with the theoretical expectation that longer reads can resolve more of the genome.

Utilizing Genome-Wide Mappability Scores

How might this information assist bioinformaticians with the analysis of NGS data? I have a few ideas. It could be used to normalize sequencing coverage by local mappability, for read-depth-sensitive measurements like genome-wide copy number. It can be plotted alongside whole-genome sequencing or capture coverage read depth, to help explain holes in coverage.

It could even be used pre-emptively, for targeted sequencing projects, to assess the regions that will or will not have reads mapped. This sort of analysis is useful to set realistic expectations for collaborators, who might expect that, since they’ve heard so many nice things about next-gen sequencing, it will yield beautiful, even coverage across every single base they want. For time-sensitive projects, you could even adapt a sequencing strategy based on this information. On the Illumina platform, for example, a 36-bp fragment-end run finishes days before a paired-end run of longer reads. If one’s target regions were very mappable, and time was of the essence, a faster and less expensive sequencing run could be ordered.

VarScan 2 Released on SourceForge

February 4, 2010 by Dan Koboldt

Accurate variant detection in massively parallel sequencing data is a significant bioinformatics challenge. Not only do new sequencers offer unprecedented breadth (whole genome) and depth (30x or more), but they suffer coverage biases and error rates that make variant calling difficult. Last year, we published VarScan, our in-house algorithm for SNP and indel detection on next-generation platforms. NGS analysis has changed somewhat since that time; SAM/BAM format was widely adopted, for one thing, and data throughput has skyrocketed.

To address these issues as well as the requests of many users, we have released VarScan 2 on SourceForge.net. The new version features many improvements and enhancements, including:

SAM/BAM compatibility. Rather than reading various native alignment formats, VarScan now accepts as input the “pileup” format of SAMtools. Since most widely used aligners can be converted to SAM format or output it directly, this makes VarScan compatible with a wide array of tools.
Java implementation. To increase speed and performance in the face of ever-increasing sequencing throughput, we’ve implemented the new VarScan in Java. This also means that it can run on any operating system through the Java virtual machine (VM).
New filtering and comparison tools. VarScan 2 now has commands to limit variants to a list of positions or chromosomal regions, which is useful for targeted sequencing projects. It also has a comparison tool that intersects or merges two sets of variants.
Somatic variant detection. This is the flagship feature of VarScan 2 – given pileup files from a tumor sample and matched control (normal), VarScan calls variants (SNPs and indels) and determines their somatic status (Germline, Somatic, LOH) using heuristic and statistical approaches.

Software and Documentation on SourceForge.net

VarScan joins the ranks of some of the most widely used tools for NGS analysis – Bowtie, Maq, BWA, SAMtools, and Picard – that are hosted on sourceforge.net. The download page, user’s manual, and Java documentation for VarScan are already online. There’s a new wiki site and discussion forum for VarScan as well, to help us developers keep in touch with users and the NGS community. The project page will have information about known issues and new software releases.

You’ll find it all at http://varscan.sourceforge.net.

References

Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, & Ding L (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25 (17), 2283-5 PMID: 19542151

Making the Leap: Maq to BWA

December 23, 2009 by Dan Koboldt

Like most curmudgeons I fought the change as stubbornly as I could. Leave Maq behind for something else? Never! Yet over the past few months I have come to realize that BWA, as it’s called, is not bad. At our genome center we still generate both Maq and BWA alignments for Illumina data; thanks to the hard work of our automated pipeline (apipe) group, I never have to run either on my own. Admittedly, even without a learning curve to daunt me, I found myself reluctant to use the BWA results over Maq.

How I Came to Love BWA

Speed was obviously the most compelling argument. Whereas Maq can take 1-2 days to map a single lane of Illumina data to the human genome, BWA can do it in a few hours. Also, BWA outputs directly into SAM format, which we’ve universally (and gratefully) adopted as the standard for next-gen sequencing data. There are, as it turns out, many differences between Maq and BWA, which I’ve summarized in the table below:

Aligner:	Maq	BWA
Algorithm:	Hash lookup table	Burrows-Wheeler Transform
Input format:	Binary FastQ	FastQ
Output format:	MAP file	SAM file
Gapped Alignment:	Paired-end only; one mate must map	Single-end and paired-end
Maximum Read Length:	63 bp	200 bp (aln) or 1,000 bp (dbtwsw)
Multi-threading:	No	Yes
Assembly / SNP calling:	Tools included	None. Use SAMtools.
Simulation Tools:	Tools included.	None.
Continued development:	No	Yes

There are other differences, but these are the ones I find most important. The ability to take standard FastQ input is a plus for BWA, but its native SAM output is critical. Granted, it’s possible to convert the output of several aligners (Maq, Bowtie, NovoCraft, even BLAST) to SAM format, but a direct output is convenient. Compared to Maq, BWA is lean and mean – no assembly, SNP calling, or simulation tools – with the single purpose of mapping reads to a reference, and a reliance on companion software (SAMtools) for subsequent analysis.

Two new features in BWA – gapped alignment and the ability to take longer reads – are a devastating blow to other next-generation sequencing aligners. Now, it’s possible to map single-end reads with gaps (Novoalign) and to map longer 454 reads (SSAHA2). Like Maq, BWA has color-space functions as well.

The take-home message: BWA can map data from all three available next-generation sequencers – Illumina/Solexa, Roche/454, and ABI SOLiD.

The Heng Li Factor

No software tool can remain competitive without continued development. Thus, when I heard that Heng Li did not plan any further releases of Maq, I knew its days were numbered. Some of us had already gotten the impression that Heng was focusing on BWA more than Maq anyway, but that made it official. Maq is still an incredible piece of software, and represented a technological leap for NGS analysis that helped bring about the era of whole-genome sequencing. We were among the first to climb aboard the Maq train, and it took us far. Now, sadly but with eager anticipation, we transferred to the express train, the bullet train. BWA.

References
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics, 25 (14), 1754-1760 DOI: 10.1093/bioinformatics/btp324

Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877