Random Post: Happy DNA Day!
RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • Subscribe
  • VarScan
  •  

    First Look: Data from IonTorrent’s 316 Chip

    June 22nd, 2011

    I got an early look at data from IonTorrent’s forthcoming semiconductor sequencing chip, called the 316. The data comes from a single sequencing run of E. coli, specifically the laboratory reference strain DH10B, not to be confused with the hybrid E. coli strain in the European outbreak that IonTorrent sequenced earlier this year. According to the run report, some 1.69 million reads were generated, of which 1.35 million were 100 base pairs or longer. It’s a nice, shiny report with color images, and of course I summarily ignored it in favor of looking at the data myself.

    The data package included an SFF file, a data format often used for Roche/454 data that contains read sequences and base qualities. There was also a BAM file, which included the read sequences aligned to the E. coli reference, indexed for random access. And there was a FASTQ file, which is the common sequence-and-quality input format used by many short read aligners. So how does one go about evaluating a new sequencing dataset? This is a relevant question for bioinformaticians, who are often handed a lane or a run’s worth of data and asked, “How does it look?”. Personally, I like to start with the basics: reads, bases, and qualities. I extracted sequence and quality files from the SFF archive, downloaded the latest E. coli reference genome from NCBI, and got to work.

    Distribution of Read Lengths

    There were 1.69 million reads totaling 175,570,901 bases, which works out to an average read length of about 104 bp. Here’s my own plot of the read length distribution:

    iontorrent-read-length-distribution

    The distribution was very tight, peaking right around 100 bases. The longest read was 127 bp. Using BLAT, I mapped 1.63 million reads to the E. coli reference (96.36%), and 1.51 million (about 90%) were uniquely placed. Yes, I could have simply used the alignments provided in the BAM file, but I knew little about how they were generated. I’ll talk about those later. The BLAT-like alignment score distribution closely mirrors the read length:

    iontorrent-alignscore-distribution

    What this means is that the majority of bases in each read were mapped, and most of the time, they matched the reference. You’d expect a leftward shift in the distribution if the reads contained undesired primer or barcode sequences that didn’t match the reference.

    Base Quality by Read Length

    Since this is a new sequencing technology, I was keenly interested in base quality along the length of the read. If you’ll recall, early versions of the Illumina/Solexa (1×32 bp or 1×36 bp, remember those days?) showed a decline in base quality as the read went on, with a substantial drop at around 28 bp, while early 454 data had fairly uniform base quality along the length of the read. Thus, I calculated the average base quality from the first 1,600 reads in the IonTorrent FASTA file. This is where things got interesting.

    iontorrent-base-quality-distribution

    You’ll notice that, like early Illumina/Solexa data, average base quality declines along the length of the read. I should point out that these are variable-length reads, so more values were used for read position 1 than, say, read position 60. Thus, it’s possible that only the last few bases in each read are low-quality, which reduces the average score as you reach the end of the reads. Also notable is the fact that the highest average base quality is 23. Remember, this is the platform’s own estimate of error rate; a score of 20 indicates that 1/100 bases will be wrong. I also noticed a lot of base quality values of 8, and wonder if this is the equivalent of Illumina’s q=2 indicating virtually no confidence in the base.

    The E. coli reference genome totals about 4.69 Mbp. With 175 Mbp of data, the theoretical coverage is around 37.5-fold across the E. coli genome. Because this is a laboratory reference strain, we expect it to be genetically homogenous, and essentially identical to the reference sequence. Any apparent SNPs or indels are likely due to sequencing error. To get a handle on this, I used VarScan 1 (which parses BLAT output) to detect SNPs and indels using the ~1.51m uniquely mapped reads. Some 142,920 substitution (SNP) events were detected, reflecting a substitution error rate of 0.092%.The distribution of where those substitutions occurred indicates some interesting biases:

    substitution-readpos-dist

    Most substitutions happen near the ends of reads, which is consistent with my earlier observations of base quality. There’s also a notable peak in the first few bases, which is more likely to represent an alignment artifact than true error rates. The average sequence depth across all substitutions was around 36, which is very consistent with the theoretical 37.5-fold coverage genome-wide. Looking at the context of substitutions, if one ignores a homopolymer issue (see below), the most frequent changes were T->C and A>G transitions, which often occurred when flanked by G/C bases.

    Homopolymers and Insertion/Deletion Errors

    VarScan detected an astonishing 1,122,276 insertion/deletion events, reflecting an indel error rate of 0.726%, or about eight-fold higher than the substitution error rate. VarScan’s heuristic filter immediately removed 983,367 indels (87.6%) because they were clear artifacts in homopolymer runs of 4 or more bases. Even the indels that passed this filter (28,182 insertions and 110,727 deletions) are mostly single base indels matching one of the flanking bases. In other words, an overcall or undercall in a homopolymer. This is very interesting, because I asked Jonathan Rothberg about this precise issue (homopolymer errors) at AGBT 2010, and he answered glibly that they wouldn’t be an issue because there wouldn’t be signal loss from the semiconductor (something about a linear pH change).

    Homopolymers are obviously an issue for this platform. In fact, when I went back and looked at the context of substitutions, at least 28% of them were due to an overcall followed by an undercall, e.g. AATT sequenced as AAAT, or vice-versa. Looking at the indel distributions, it seems that C and G were most likely to be under-called (accounting for 58% of all indels), and they were far more likely to be under-called (i.e. bases missed) than over-called.

    Applications and Implications

    The 454-like homopolymer issue raises some concerns about using this platform for variant discovery. Yet despite the errors, I’m impressed at how rapidly the technology has matured. In just a couple of years, IonTorrent is delivering over 100 Mbp per run. In this run, at least. There’s no guarantee that Dr. Rothberg himself will do the sequencing in future experiments (just an educated guess, based on his initials in the library name). What can you do with 100 Mbp of data? Human genome and exome sequencing are out, obviously, but more targeted efforts (PCR or custom capture) could achieve high read depth. Microbial sequencing is an obvious application as well, particularly when time is of the essence. At a price that’s 1/10 of most NGS systems, IonTorrent seems rather promising.

    AddThis Social Bookmark Button

    Recurrent mutations in chronic lymphocytic leukemia

    June 10th, 2011

    A study published online at Nature reports the identification of three recurrently mutated genes by whole-genome sequencing of four cases with chronic lymphocytic leukemia (CLL). This is the most common adult leukemia in western nations, with two major subtypes distinguished by somatic hypermutation of the immunoglobulin heavy chain (IgH) variable region. Led by Xose S. Puente of Universidad de Oviedo in Spain, the authors applied a combination of whole-genome sequencing, exome sequencing, and long-insert library sequencing to tumor samples and matched (normal) controls from two patients of each subtype.

    Puente et al identified roughly 1,000 somatic mutation per tumor in unique regions, estimating a mutation rate of less than one per 1 megabase. This is consistent with other leukemias, although (to my disappointment) the authors failed to refer to the first two sequenced leukemia genomes, AML1 (Ley et al, Nature 2008) and AML2 (Mardis et al, NEJM 2009), which I’ve cited below. In these four CLL cases, the most common substitution was G>A / C>T, which usually occurred in a CpG context. Interestingly, the mutation spectrum differed between subtypes; IGHV-mutated cases showed a higher fraction of A>C / T>G substitutions, and often A>C mutations occurred at adenines preceded by a thymine. The context and patterns of mutations in IGHV-mutated cases was consistent with error-prone polymerase during the normal process of somatic hypermutation of IGHV genes.

    Somatic Coding Mutations

    The authors divided somatic mutations into one of three classes:

    1. Class 1 mutations, which include nonsynonymous substitutions and frameshift indels
    2. Class 2 mutations, which include synonymous and UTR substitutions
    3. Class 3 mutations, comprising everything else.

    This classification system is similar to my group’s approach, which we apply separately to SNVs and indels. We classify variants as “tier 1″ if they affect coding sequences, “tier 2″ if they affect conserved bases or known regulatory elements, “tier 3″ if they map to unique noncoding regions, or “tier 4″ otherwise. For the present study, however, I dug into supplementary information to build this summary table of somatic coding mutations:

    Category CLL1 CLL2 CLL3 CLL4
    Frameshift (indels) 1 2 2 0
    Nonsynonymous/Splice 9 18 9 5
    Synonymouse (silent) 3 5 2 3
    Total Coding 13 25 13 8

    Summarized in the above fashion, these mutation counts are similar to the number observed in AML1 (n=10) and AML2 (n=12). The relatively small number of somatic coding mutations in leukemia is just incredible.

    Recurrent Mutations in CLL

    Using a pooled sequencing strategy, Puente et al screened for mutations in 26 genes among 169 additional CLL cases. The rate of recurrence is reported for 363 CLL patients; it was unclear how the ~200 additional cases were examined. In any event, four genes proved to harbor recurrent mutations in CLL:

    • NOTCH1 (12% of cases), a key signaling molecule involved in developmental processes that controls cell fate decisions. The observed mutations generate a truncated protein lacking the PEST sequence, which was constitutively activated and more stable than the wild-type isoform. NOTCH1-mutated patients had more advanced CLL at presentation.
    • MYD88 (2.9% of cases), an effector molecule for IL1 and TLR receptor signaling. In mutated cells, activation of IL-1 or TLR signaling triggered a dramatic over-production of IL1RA, IL6, CCL2, CCL3, and CCL4. The high production of these cytokines is known to recruit macrophages and T-lymphocytes, creating a favorable micro-environment for tumor survival. Indeed, patients with MYD88 mutations were diagnosed at a younger age, and with more advanced tumors.
    • XPO1 (1.1% of cases), which encodes exportin 1, a protein implicated in the nuclear export of proteins and mRNAs (including MAP kinases). Notably, all four cases with this mutation were of the IGHV-unmutated subtype and had NOTCH1 mutations, indicating a possible synergistic effect between mutated NOTCH1 and XPO1.
    • KLHL6 (0.8% of cases), which plays a role in germinal center formation during B-cell maturation. The three mutated cases harbored multiple point mutations, consistent with somatic hypermutation.

    Based on functional and clinical analyses, the authors conclude that mutations in NOTCH1, MYD88, and XPO1 are oncogenic changes that contribute to the clinical evolution of CLL.

    Structural Alterations

    Using paired-end sequence data and a basic analytical approach, the authors identified ten somatic structural variants (SVs), most of which were known events in CLL. Three of four cases harbored a deletion of 13q14; the minimally-deleted region includes several genes and a couple of micro-RNAs. This is a known lesion in CLL, and was not pursued further in the main text. From the copy number data in Figure 1, it is clear that these CLL genomes harbor relatively few genomic rearrangements, which is again consistent with what we’ve seen for acute leukemia.

    Variant Detection Sensitivity and Specificity

    The authors mention that they employed not just WGS but exome sequencing, though the latter finds no place in the main text. Looking through the supplemental materials, I found that some 42 mutations were identified and validated by exome sequencing. Of these, 37 were found using WGS data, suggesting a sensitivity of ~88% for somatic coding mutations. All mutations were manually reviewed to remove common sequencing- and alignment-related artifacts. Some validation was performed using PCR and Sanger sequencing; among the 86 class 1 / class 2 variants for which PCR and Sanger sequence data were obtained, 83 proved to be valid somatic mutations. An additional 384 random mutations (96 per tumor) underwent validation as well, and 96% of these were validated. This is an impressive specificity, though I would attribute it to the manual review process, which may not scale to genomes with more than 10-20 somatic mutations.

    References

    Puente XS, Pinyol M, Quesada V, Conde L, Ordóñez GR, Villamor N, Escaramis G, Jares P, Beà S, González-Díaz M, Bassaganyas L, Baumann T, Juan M, López-Guerra M, Colomer D, Tubío JM, López C, Navarro A, Tornador C, Aymerich M, Rozman M, Hernández JM, Puente DA, Freije JM, Velasco G, Gutiérrez-Fernández A, Costa D, Carrió A, Guijarro S, Enjuanes A, Hernández L, Yagüe J, Nicolás P, Romeo-Casabona CM, Himmelbauer H, Castillo E, Dohm JC, de Sanjosé S, Piris MA, de Alava E, Miguel JS, Royo R, Gelpí JL, Torrents D, Orozco M, Pisano DG, Valencia A, Guigó R, Bayés M, Heath S, Gut M, Klatt P, Marshall J, Raine K, Stebbings LA, Futreal PA, Stratton MR, Campbell PJ, Gut I, López-Guillermo A, Estivill X, Montserrat E, López-Otín C, & Campo E (2011). Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature PMID: 21642962

    Mardis, E., Ding, L., Dooling, D., Larson, D., McLellan, M., Chen, K., Koboldt, D., Fulton, R., Delehaunty, K., McGrath, S., Fulton, L., Locke, D., Magrini, V., Abbott, R., Vickery, T., Reed, J., Robinson, J., Wylie, T., Smith, S., Carmichael, L., Eldred, J., Harris, C., Walker, J., Peck, J., Du, F., Dukes, A., Sanderson, G., Brummett, A., Clark, E., McMichael, J., Meyer, R., Schindler, J., Pohl, C., Wallis, J., Shi, X., Lin, L., Schmidt, H., Tang, Y., Haipek, C., Wiechert, M., Ivy, J., Kalicki, J., Elliott, G., Ries, R., Payton, J., Westervelt, P., Tomasson, M., Watson, M., Baty, J., Heath, S., Shannon, W., Nagarajan, R., Link, D., Walter, M., Graubert, T., DiPersio, J., Wilson, R., & Ley, T. (2009). Recurring Mutations Found by Sequencing an Acute Myeloid Leukemia Genome New England Journal of Medicine, 361 (11), 1058-1066 DOI: 10.1056/NEJMoa0903840

    Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford-Shore BH, McGrath S, Hickenbotham M, Cook L, Abbott R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Ries RE, Payton JE, Westervelt P, Tomasson MH, Watson M, Baty J, Ivanovich J, Heath S, Shannon WD, Nagarajan R, Walter MJ, Link DC, Graubert TA, DiPersio JF, & Wilson RK (2008). DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature, 456 (7218), 66-72 PMID: 18987736

    AddThis Social Bookmark Button