Next-Gen Sequencing in 2010

March 9, 2010 by Dan Koboldt

On the shuttle from Marco Island to the airport last week, I happened to sit next to a very nice gentleman from Illumina. We got to talking, of course, and I asked him if they saw a threat from any of the new sequencing platforms presented at AGBT. I’m aware that Illumina currently enjoys a greater-than-50% share of the next-gen sequencing market, so I was curious about his impressions.

“We definitely see a segmentation of the market,” he admitted.

Something had been bothering me about the sequencing-company presentations this year, and I finally realized what it was. During AGBT 2009, every player was gunning to take over the world. This year it seems like every sequencing platform has a niche in mind.

General Sequencing: Illumina vs. Life Technologies

Illumina’s HiSeq2000 and Life Tech’s SOLiD 4 are after the general sequencing market – whole genome, transcriptome, and targeted (capture) sequencing. It’s a constant game of one-upmanship in throughput and claimed accuracy. In February this year, Illumina launched the HiSeq2000 with expected throughput of 200 GB per run. Life Technologies launched SOLiD 4 with 100 GB per run, but promised 300GB per run later this year. On the read length front, Illumina remains the clear winner – 2×100 is in production at many genome centers, and even longer reads have been promised. Life Tech, to their credit, is pushing the SOLiD 4 platform pretty hard.

When Length Matters: 454

Roche/454 has wisely backed away from large-scale sequencing, and instead seems to be targeting applications where longer (450 bp) reads are a requirement. At AGBT, Henry Erlich (Roche) gave an interesting talk about genotyping and haplotyping human HLA regions to improve donor matching for organ transplants. Here’s a key challenge of modern medicine where sequencing can offer tangible benefits. Here at the genome center, we use 454 runs for validation and for small-scale targeted sequencing. There are many applications where relatively inexpensive long-read sequencing runs are idea; full-length cDNA sequencing, for example, comes to mind.

Complete Genomics: Sequencing as a Service

The business model of Complete Genomics seems a bit of a gamble to me. They aim to be the provider of relatively inexpensive, start-to-finish sequencing services. No technology or reagent sales for these guys. Instead, they want to take your samples and give you back the SNPs. In the coming years, they hope to build as many as 10 facilities throughout the world that provide these services. I’m a bit leery of Complete Genomics, not only because their proprietary technology lags behind others (currently it’s at 2X35 bp), but because they’ll need to do something like 10,000 genomes a year just to stay in business. I don’t think we’re ready for that.

Sequencing for the Masses: IonTorrent

Many of us were impressed by IonTorrent this year at AGBT. The incredibly low cost of their instrument ($50K) and sequencing runs ($300-500) mean that nearly any lab could write a grant around this technology. The sample prep, accuracy, and throughput are still a grey area, but if they prove to be good enough, high-throughput sequencing will suddenly be available to just about everyone.

Single Molecule Applications: Pac Bio and Oxford Nanopore

The true single-molecule sequencing platforms that are close to market are certainly getting everyone excited. In the next few years, however, it’s unlikely that Pacific Biosciences, Oxford Nanopore, mystery-Chinese-platform, or other companies will displace massively parallel sequencing. No, I think Illumina and SOLiD will remain the “work horses” for discovery, certainly at major genome centers. Where SMS technologies can excel, however, is ultra-long reads – think about PacBio’s strobe sequencing to resolve structural variation or finish assemblies – and lots of molecule-kinetics stuff that I don’t understand.

I think that 2010 will be an exciting and telling time for all of these platforms. In a year’s time, we should have results in hand from HiSeq, SOLiD4, PacBio, and even IonTorrent, and be able to distinguish between marketing claims and sequencing reality.

Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877

Annotation of Insertions and Deletions

March 29, 2008 by dkoboldt

We’re working on a project with ~2.2 million 454 reads from two cDNA libraries and my job is to find and classify the insertion/deletion variants (indels). As you might guess, since these are reads of transcribed sequence, there’s a lot of noise due to mRNA processing. Spliced-out introns look like deletions. Partially-processed transcripts might look like they contain insertions. So, once I made indel predictions based on aligning 454 data to the hg36 reference sequence, the next priority was to remove the noise.

Fortunately, two colleagues in my group, Ken Chen (the developer of PolyScan) and Brian Dunford-Shore (our resident physicist) have built a “transcriptome” based on all of the known transcripts in CCDS, Ensembl, and Vega databases. One of the files generated with the transcriptome is the refseq “footprint” which contains all of the UTRs and exons of all transcripts. It seems to me this file offers the most comprehensive source for annotating the indels from cDNA data.

So, I wrote a script, annotate_with_footprint.pl, which cross-references a set of indels with the footprint file. Insertions are classified as either within-CDS-exon, within-UTR-exon, or noncoding. Deletions are a bit more complicated – they could be within-CDS-exon, within-UTR-exon, or noncoding. They could also span multiple CDS or UTR exons, span intron-exon-junctions, etc.

As it turned out, only about 12% of the insertions and 1% of the deletions were in exons; The vast majority were in UTR/intron regions or intron-exon splice artifacts. Another 4% of the deletions appeared to span one or more CDS exons, but many of these may be exon-skipping events, not true deletions.

Even with strong 454 cDNA support, I won’t be confident that these are real coding mutations until we validate them in genomic DNA.