Archives for September 2009

VisionWalk Funds Sight-saving Research

September 25, 2009 by Dan Koboldt

This weekend is VisionWalk St. Louis, a fundraising event for the Foundation Fighting Blindness (FFB) that supports research of retinitis pigmentosa (RP), glaucoma, age-related macular degeneration (AMD), and other vision-stealing diseases.

Retinitis Pigmentosa

Retinitis pigmentosa is a degenerative eye disease affecting some 10,000 Americans that eventually leads to blindness. It manifests as early as childhood, when affected individuals often have “night blindness” – difficulty seeing in low light conditions. Peripheral and perceptive vision gradually decline with age, and by age 40, many people with RP are legally blind.

Mutations in as many as 60 genes may cause RP, at least 18 of which have been implicated in the autosomal dominant form of the disease (adRP). There is significant interest in pinpointing the causal mutations underlying this disease, particularly because clinical safety trials of gene therapy reported improved vision for several RP patients. We have an ongoing collaboration with Stephen Daiger and colleagues at the University of Texas (Houston), who have collected a large cohort of RP families with multiple affected individuals. If you read my VarScan paper, you might have noticed that our collaborative sequencing efforts were supported by Foundation Fighting Blindness.

Family Connections: Real Life Genetics Quiz

I admit that my interest in RP goes beyond research alone – my aunt, cousin, and several of their relatives are affected. My aunt is under driving restrictions due to her vision, and my cousin was told that he could go blind within a decade. Recently I sat down with them to discuss joining the family cohort and also to take down the pedigree. Here it is:

Family Pedigree of Retinitis Pigmentosa

While the pedigree is not complete enough to be certain, I have a guess as to the inheritance pattern that it suggests. But I’m not a classically trained geneticist, so here’s your chance to chime in. Let me know what kind of inheritance you think it is, and why. The correct answer, whatever it may be, has real life implications for some of my relatives. The family chair of VisionWalk St. Louis, Kelly (another cousin of mine), is the mother on the bottom-left of the pedigree. She and her her husband have two children; she was told – perhaps erroneously – that she was a carrier, so they’ve undergone rigorous tests. It would be helpful to know the genetic probabilities that either child inherited RP.

Support Team Koboldt at the VisionWalk

I’m happy to report that Team Koboldt will make a strong showing tomorrow at VisionWalk – with coordinating T-shirts to boot. We’re currently the 5th place team in terms of funds raised so far for St. Louis VisionWalk – and we’re gunning for the fourth place team. So if you’d like to join Team Koboldt and support vision-saving research, here’s the link:

http://www.fightblindness.org/goto/DanKoboldt

Capture and Illumina Sequencing of Human Exomes

September 24, 2009 by Dan Koboldt

This month in Nature, a group from Jay Shendure’s lab reported perhaps the most ambitious targeted resequencing study to date – the whole exome sequences of 12 individuals.

Targeted capture and massively parallel sequencing of human exomes

Using an array-based hybridization capture method (2 microarrays, 10 g of input DNA), Ng et al selectively targeted CCDS regions totaling 26.6 Mb of sequence (~0.83% of the human genome). Capture specificity was similar to that of other published methods (35-55% of reads mapping to targets), but the completeness was astonishing – on average, 99.7% of target bases covered at least once and 96.3% covered at 8x with q>=30.

By focusing on coding exons, the authors achieved 51x coverage (on average) with just 6.4 Gb of mappable sequence per individual. Illumina 76-bp single-end sequencing was the platform of choice. If I make some rough empirical estimates of mapping rate and reads per lane, they generated a single Illumina run of data (7-8 lanes) per individual. Compared to whole-genome sequencing, the authors claim a 20-fold reduction in the amount of sequence required. I’d say this estimate is pretty close. Our second leukemia genome, which had 23x haploid coverage, took 16.5 Illumina runs to complete.

Strong Illumina Pipeline

It’s not simply the technological feat that impressed me about this study. The presentation of the work and underlying analytical approaches are just outstanding. While reading through the methods, I couldn’t help but think that nearly every step the authors took in processing their data was something that we’ve implemented here – Maq alignment, start site de-duplication, mining Maq-unplaced reads for indels, etc. We have a bit of a friendly rivalry with University of Washington (since we are, after all, Washington University), so I looked for weak points. Try as I might, I couldn’t find much to criticize about the analysis. When it comes to Illumina sequencing, UW seems to know what they’re doing.

How to Write A Nature Paper

And paper itself is just clear, concise, well-written – everything I’d expect from a Nature publication. Take Figure 1, for example. Figure 1, in general, is the focal point of most research papers, and for that reason I think many authors try to cram way too much into it. Not this time. Four histograms that all have “Number of observations of minor allele” as their X-axis. Yet each one tells a different story: (a), how novel-to-dbSNP variants were rare; (b), how nonsynonymous variant frequencies are shifted to lower values relative to those of synonymous variants, (c), how this shift in allele frequencies is more pronounced for damaging nsSNPs, consistent with natural selection, and (d), how the sizes of observed indels are enriched for non-frameshift events divisible by 3.

Illumina Sequencing and Deduplication

Early into our days of Illumina/Solexa sequencing, we observed a strange phenomenon in the data: lots of reads with identical start sites and orientations. The theory was that these occasional pileups were PCR-related, and each one arose from a single molecule that somehow was sequenced over and over again. Since just about every downstream analysis (coverage, mutation detection, etc.) relies on unbiased read counts, it’s important to normalize for such events. This requires a “de-duplication” step in which multiple reads with the same start site and orientation (presumably the same molecule) are discarded and only one is kept.

Credit: Nature 461:272-276 (2009)

The implications of this deduplication requirement, as pointed out by Ng et al, are that the maximum read depth for any given position in the genome is twice the read length for single-end libraries. In their case, 152x. One might be concerned that even with de-duplication there would be substantial bias in targeted capture. But look at the bell curve of the coverage distribution from supplemental figure 1 (left).

Someone had better call O’Reilly, because that’s just beautiful data. Importantly, the deduplication paradigm changes somewhat for paired-end sequencing, which is largely what we do here. With paired ends, you have two reads from each molecule, each with a start site and orientation. So the maximum coverage immediately jumps to 4 times the read length. Furthermore, due to the variation in fragment sizes of sheared DNA, insert sizes add further distinction for different molecules, allowing for read depths of 1000x or more after de-duplication for paired-end reads.

Identifying Disease-Causing Mutations

What pleased me most about this study is that the authors didn’t just present exome capture and sequencing of “undiseased” individuals. In addition to 8 HapMap samples, they included four samples from unrelated individuals with Freeman–Sheldon syndrome (FSS), an autosomal-dominant disorder caused by mutations in MYH3. After collecting the set of coding variants in each individual, the authors asked a simple question: could we have pinpointed the disease gene from mutation data? With the knowledge in hand that this was a monogenic, autosomal-dominant disorder, the authors assumed that the same gene might be mutated in most (or all) samples. And since the disease itself is uncommon, the authors inferred that common variants could be excluded. So, with the full set of mutations for each affected individual in hand, the authors looked for genes where:

There was at least one (but not necessarily the same) nonsynonymous SNP, splice-site SNP, or coding indel in all four samples.
The mutations were novel; that is, they weren’t found in dbSNP or the other 8 HapMap samples.
The mutations were predicted to damage the encoded protein

When these criteria were applied, the authors whittled down a list of 4,510 genes with mutations in at least one sample to just 1, and that gene was MYH3. Thus, whole-exome sequencing allowed for direct identification of a disease-causing gene with just a few samples from affected individuals. Granted, the authors got lucky. The causal mutations might have been SVs, or missed by variant callers, or not covered sufficiently by sequence data. Or, the disorder might be caused by a single mutation in one of several genes, as is the case of autosomal dominant RP, a monogenic disorder for which at least 16 genes have been implicated.

Even so, the authors applied a relatively straightforward approach and got the right answer. With whole-exome sequencing capability within reach, finding the genes behind autosomal disorders is only a matter of time.

References
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

HITS-CLIP Unravels microRNA-mRNA Interactions

September 3, 2009 by Dan Koboldt

Micro-RNAs (miRNAs) are short (18-26 nt) sequences that act as post-transcriptional repressors of gene expression. Over 700 miRNAs have been reported in the human genome; each is believed to bind directly to many mRNAs to regulate their translation or stability. Thus, miRNAs represent a key regulatory mechanism affecting numerous cellular activities, and are of particular interest in cancer research. Understanding the complex relationships between miRNAs and mRNAs remains challenging, however, and computational approaches alone have been largely unsuccessful.

HITS-CLIP: Isolation and Sequencing of Argonaute-miRNA-mRNA Complexes

Enter HITS-CLIP, a new approach that applies high throughput sequencing of RNAs isolated by crosslinking immunoprecipitation. Essentially, it’s a method by which radition is used to cross-link protein-RNA complexes and stringently purify them. Then, massively parallel sequencing yields all of the RNA “tags” bound by the protein of interest.

Ago-miRNA-mRNA Complex (Image Credit: Nature 460: 479-486, 2009)

In a recent Nature paper, Chi et al used HITS-CLIP to isolate RNA bound by the Argonaute protein (Ago), which mediates miRNA-mRNA interaction (see figure). The purified complexes showed two different modal sizes (110 kDa and 130kDa), suggesting that Ago (97 kDa) was crosslinked to two different RNA species – hopefully, miRNAs (small) and the mRNAs that they were targeting (large).

The authors applied Illumina high-throughput sequencing to characterize Ago-bound miRNAs and the mRNA “tags” to which they were linked. With relatively straightforward bio-informatics approaches, it was possible to cross-reference expressed miRNAs with complementary sequences of mRNA tags. The resulting “ternary map” of miRNA-mRNA interaction sites yields a wealth of information about this post-transcriptional regulatory mechanism.

Decoding miRNA-mRNA Interaction

The authors identified 454 unique miRNAs crosslinked to Ago in the mouse brain; mir-30e was the most abundant species, representing 14% of all miRNA tags. In silico clustering and normalization of the messenger RNA tags yielded 1,463 robust clusters from 829 different brain transcripts.

Locations of Ago-bound mRNA tags (Image Credit: Nature 460: 479-486, 2009)

When these tags were overlaid with gene annotations, several patterns emerged. As expected, a substantial portion (40%) of Ago-bound tags were in 3′ UTRs where miRNA activity is known to have high efficacy. Some 8% (one-fifth of the 40%) were actually outside of the UTR but <10kb downstream, regions likely to harbor unannotated 3′ UTRs.

Unsurprisingly, very few Ago-bound tags were in 5′ UTRs. However, a substantial fraction of tags fell in coding sequences (25%), introns (12%), and non-coding RNAs (4%), suggesting that miRNA activity occurs in these regions as well. Another 6% of tags were in intergenic regions, possibly in as-yet-unannotated transcripts. These unexpected locations of miRNA binding may offer additional insights into the mechanisms of miRNA regulation.

Next, the authors sought to define the Ago-mRNA “footprint” in which the majority of tags were contained. The distribution of tags in a defined cluster, at least in their figure, looks like a bell curve, with a sharp peak in the middle. About 95% of the time, Ago bound within 45-62 nucleotides of this peak, so the authors defined this region as the average Ago-miRNA footprint. Linear regression analysis of all 6-8 base motifs in clusters yielded numerous “enriched” seed sequences; the most prevalent corresponded to the binding site of miR-124, a well known brain-specific miRNA. Indeed, Ago-mRNA footprints were rich in miRNA binding sites, suggesting that this approach may predict active sites with far better specificity than other methods.

HITS-CLIP Implications

By reducing the search space for miRNA binding sites to a 45-60-nucleotide Ago footprint, HITS-CLIP offers a powerful complementary approach to bioinformatic methods for miRNA binding site prediction. Computational approaches alone are known to have high false positive rates, whereas the authors estimate FP rates of just 15% for HITS-CLIP. The new method offers dramatic improvement for transcripts with highly conserved 3′ UTRs, which often have many “predicted” miRNA binding sites because so many computational methods rely on conservation. Analysis of the HITS-CLIP ternary map revealed that real miRNA-mRNA binding events are very specific, with an average of just 2.6 Ago-mRNA clusters per regulated transcript. Despite the thousands of predicted binding sites, each miRNA bound an average of 655 targets. These results suggest that miRNA selectivity is much higher than previously believed. Yet Ago-mRNA clusters seemed to show no apparent sequence preference (data not shown), so it’s likely that other RNA-binding proteins are involved.

Thus, this study sets the stage for large scale genome-wide RNA-protein maps that include other proteins, tissues, and species, which should yield an unprecedented new level of understanding of this complex regulatory process.

References
Chi SW, Zang JB, Mele A, & Darnell RB (2009). Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature, 460 (7254), 479-86 PMID: 19536157