RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • VarScan
  •  

    VisionWalk Funds Sight-saving Research

    September 25th, 2009

    This weekend is VisionWalk St. Louis, a fundraising event for the Foundation Fighting Blindness (FFB) that supports research of retinitis pigmentosa (RP), glaucoma, age-related macular degeneration (AMD), and other vision-stealing diseases.

    Retinitis Pigmentosa

    Retinitis pigmentosa is a degenerative eye disease affecting some 10,000 Americans that eventually leads to blindness.  It manifests as early as childhood, when affected individuals often have “night blindness” – difficulty seeing in low light conditions.  Peripheral and perceptive vision gradually decline with age, and by age 40, many people with RP are legally blind.

    Mutations in as many as 60 genes may cause RP, at least 18 of which have been implicated in the autosomal dominant form of the disease (adRP).  There is significant interest in pinpointing the causal mutations underlying this disease, particularly because clinical safety trials of gene therapy reported improved vision for several RP patients.  We have an ongoing collaboration with Stephen Daiger and colleagues at the University of Texas (Houston), who have collected a large cohort of RP families with multiple affected individuals.  If you read my VarScan paper, you might have noticed that our collaborative sequencing efforts were supported by Foundation Fighting Blindness.

    Family Connections: Real Life Genetics Quiz

    I admit that my interest in RP goes beyond research alone – my aunt, cousin, and several of their relatives are affected.  My aunt is under driving restrictions due to her vision, and my cousin was told that he could go blind within a decade.  Recently I sat down with them to discuss joining the family cohort and also to take down the pedigree.  Here it is:

    Family Pedigree of Retinitis Pigmentosa

    Family Pedigree of Retinitis Pigmentosa

    While the pedigree is not complete enough to be certain, I have a guess as to the inheritance pattern that it suggests.  But I’m not a classically trained geneticist, so here’s your chance to chime in.  Let me know what kind of inheritance you think it is, and why.  The correct answer, whatever it may be, has real life implications for some of my relatives.  The family chair of VisionWalk St. Louis, Kelly (another cousin of mine), is the mother on the bottom-left of the pedigree.  She and her her husband have two children; she was told – perhaps erroneously – that she was a carrier, so they’ve undergone rigorous tests.  It would be helpful to know the genetic probabilities that either child inherited RP.

    Support Team Koboldt at the VisionWalk

    I’m happy to report that Team Koboldt will make a strong showing tomorrow at VisionWalk – with coordinating T-shirts to boot.  We’re currently the 5th place team in terms of funds raised so far for St. Louis VisionWalk – and we’re gunning for the fourth place team.  So if you’d like to join Team Koboldt and support vision-saving research, here’s the link:

    http://www.fightblindness.org/goto/DanKoboldt

    AddThis Social Bookmark Button

    Capture and Illumina Sequencing of Human Exomes

    September 24th, 2009

    This month in Nature, a group from Jay Shendure’s lab reported perhaps the most ambitious targeted resequencing study to date – the whole exome sequences of 12 individuals.

    Targeted capture and massively parallel sequencing of human exomes

    Using an array-based hybridization capture method (2 microarrays, 10 mug of input DNA), Ng et al selectively targeted CCDS regions totaling 26.6 Mb of sequence (~0.83% of the human genome). Capture specificity was similar to that of other published methods (35-55% of reads mapping to targets), but the completeness was astonishing – on average, 99.7% of target bases covered at least once and 96.3% covered at 8x with q>=30.

    By focusing on coding exons, the authors achieved 51x coverage (on average) with just 6.4 Gb of mappable sequence per individual.  Illumina 76-bp single-end sequencing was the platform of choice.  If I make some rough empirical estimates of mapping rate and reads per lane, they generated a single Illumina run of data (7-8 lanes) per individual.  Compared to whole-genome sequencing, the authors claim a 20-fold reduction in the amount of sequence required.  I’d say this estimate is pretty close.  Our second leukemia genome, which had 23x haploid coverage, took 16.5 Illumina runs to complete.

    Strong Illumina Pipeline

    It’s not simply the technological feat that impressed me about this study.  The presentation of the work and underlying analytical approaches are just outstanding.  While reading through the methods, I couldn’t help but think that nearly every step the authors took in processing their data was something that we’ve implemented here – Maq alignment, start site de-duplication, mining Maq-unplaced reads for indels, etc.  We have a bit of a friendly rivalry with University of Washington (since we are, after all, Washington University), so I looked for weak points.  Try as I might, I couldn’t find much to criticize about the analysis.  When it comes to Illumina sequencing, UW seems to know what they’re doing.

    How to Write A Nature Paper

    And paper itself is just clear, concise, well-written – everything I’d expect from a Nature publication.  Take Figure 1, for example.  Figure 1, in general, is the focal point of most research papers, and for that reason I think many authors try to cram way too much into it.  Not this time.  Four histograms that all have “Number of observations of minor allele” as their X-axis.  Yet each one tells a different story: (a), how novel-to-dbSNP variants were rare; (b), how nonsynonymous variant frequencies are shifted to lower values relative to those of synonymous variants, (c), how this shift in allele frequencies is more pronounced for damaging nsSNPs, consistent with natural selection, and (d), how the sizes of observed indels are enriched for non-frameshift events divisible by 3.

    Illumina Sequencing and Deduplication

    Early into our days of Illumina/Solexa sequencing, we observed a strange phenomenon in the data: lots of reads with identical start sites and orientations.  The theory was that these occasional pileups were PCR-related, and each one arose from a single molecule that somehow was sequenced over and over again.  Since just about every downstream analysis (coverage, mutation detection, etc.) relies on unbiased read counts, it’s important to normalize for such events.  This requires a “de-duplication” step in which multiple reads with the same start site and orientation (presumably the same molecule) are discarded and only one is kept.

    Credit: Nature 461:272-276 (2009)

    Credit: Nature 461:272-276 (2009)

    The implications of this deduplication requirement, as pointed out by Ng et al, are that the maximum read depth for any given position in the genome is twice the read length for single-end libraries.  In their case, 152x.  One might be concerned that even with de-duplication there would be substantial bias in targeted capture.  But look at the bell curve of the coverage distribution from supplemental figure 1 (left).

    Someone had better call O’Reilly, because that’s just beautiful data.  Importantly, the deduplication paradigm changes somewhat for paired-end sequencing, which is largely what we do here.  With paired ends, you have two reads from each molecule, each with a start site and orientation.  So the maximum coverage immediately jumps to 4 times the read length.  Furthermore, due to the variation in fragment sizes of sheared DNA, insert sizes add further distinction for different molecules, allowing for read depths of 1000x or more after de-duplication for paired-end reads.

    Identifying Disease-Causing Mutations

    What pleased me most about this study is that the authors didn’t just present exome capture and sequencing of “undiseased” individuals.  In addition to 8 HapMap samples, they included four samples from unrelated individuals with Freeman–Sheldon syndrome (FSS), an autosomal-dominant disorder caused by mutations in MYH3.  After collecting the set of coding variants in each individual, the authors asked a simple question: could we have pinpointed the disease gene from mutation data? With the knowledge in hand that this was a monogenic, autosomal-dominant disorder, the authors assumed that the same gene might be mutated in most (or all) samples.  And since the disease itself is uncommon, the authors inferred that common variants could be excluded. So, with the full set of mutations for each affected individual in hand, the authors looked for genes where:

    1. There was at least one (but not necessarily the same) nonsynonymous SNP, splice-site SNP, or coding indel in all four samples.
    2. The mutations were novel; that is, they weren’t found in dbSNP or the other 8 HapMap samples.
    3. The mutations were predicted to damage the encoded protein

    When these criteria were applied, the authors whittled down a list of 4,510 genes with mutations in at least one sample to just 1, and that gene was MYH3.  Thus, whole-exome sequencing allowed for direct identification of a disease-causing gene with just a few samples from affected individuals.  Granted, the authors got lucky.  The causal mutations might have been SVs, or missed by variant callers, or not covered sufficiently by sequence data.  Or, the disorder might be caused by a single mutation in one of several genes, as is the case of autosomal dominant RP, a monogenic disorder for which at least 16 genes have been implicated.

    Even so, the authors applied a relatively straightforward approach and got the right answer.  With whole-exome sequencing capability within reach, finding the genes behind autosomal disorders is only a matter of time.

    References
    Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

    AddThis Social Bookmark Button

    NGS Informatics: Hail to the Chief

    September 17th, 2009

    Bio-IT World’s Kevin Davies has a nice interview with David Dooling, who heads informatics here at the Genome Center and still finds time for his PolITiGenomics blog.  Dooling joined the center in 2001, as the Human Genome Project was wrapping up.  Now, he oversees about half of our informatics group – including IT personnel as well as the developers of our LIMS and automated data pipelines.

    All three groups, now that I think about it, have had to address significant challenges during our transition to a next-generation sequencing center.  Our LIMS deals with tens of millions of transactions per month, with a back-end database whose tables sometimes have billions of records.  Our automated pipeline (or APIPE) group develops all of the data pipelines that make whole-genome sequencing feasible – primary data analysis, alignment, coverage reporting, mutation detection, etc.  And the IT group must address the exponentially growing needs of data transfer and compute time for all of it – not an easy job.

    Despite these monumental tasks, under the leadership of David and others we’re currently “on a good path” to handle the current generation of sequencing tools.  Of course, that may change in the next couple of years, when technologies like Pac Bio’s SMRT platform begin cranking out single-molecule sequences 1,000 bases long or longer.

    In-House and Open Source

    Bio-IT World is heavily read by providers of commercial informatics tools, and this is reflected somewhat in the interview.  Davies often asks whether we’re working with any specific vendors, or considering any commercial tools.  Often enough we are – certainly for storage and data transfer systems, things that can’t be built from the ground up.  Yet whenever possible, we opt for the open-source solution.  Every workstation here, for example, is Linux.  We have but one Windows PC, and it’s not allowed to connect to the internet.  Most of our LIMS system and many of our in-house tools were written in Perl.

    A Tough Nut for Commercial Vendors

    There are, of course, commercial alternatives to anything.  Yet vendors face significant hurdles in marketing products to large genome centers.  The tools that we use are often highly customized, and must continually evolve to address new technological developments.  Take aligners for example.  In the early days of Illumina sequencing, we licensed some commercial software – SLIMsearch and SXOG, for example – because there simply were no good alternatives to ELAND.  Then Maq came along, offering better functionality and performance in a free and open source program (offered, no less, by our trusted friends across the pond).  Exorbitantly priced licenses, needless to say, were quickly not renewed.

    Now there are numerous commercial solutions, and we’re often wooed by companies like CLC bio.  Yet for every commercial aligner there’s half a dozen free/open-source alternatives, developed by academic groups that we respect and trust (Maq/BWA from Sanger, Bowtie from UMD, etc.), and many of these tools are pretty damn good.  A commercial option would have to be so incredible, so vastly superior to what’s currently available for us to consider a paid license.  With Bowtie and BWA mapping lanes of 15 million reads in just a couple of hours, the bar is already set pretty high.

    Outsourcing Sequencing?

    David offers, I think, a polite response to the question of whether we’d ever outsource our sequencing to a third party.  Personally, I can offer two reasons why this will probably never happen.  First, we’re already pretty happy with Illumina, a platform that can deliver whole human genomes at high coverage in just a few weeks.  All available evidence suggests that throughput will only continue to grow, and before long I expect we’ll be doing a genome on a single flowcell or less.  Of course, cost is a consideration (Illumina runs aren’t cheap).  It’s very possible that a company like Complete Genomics might be able to offer similar yields at a substantially reduced cost.  We do use companies like IDT and Agilent, for example, to synthesize oligo sequences that we might make in house.  They can make them cheaper, and faster, than we can.

    There is a second, and perhaps more compelling reason to keep sequencing in-house – because we’re in the business of research, and data is precious.  With our current capacity we can track the progress of sequencing runs in real-time, monitor error rates and alignment rates, and assess results the moment data is off of machines.  We maintain a forensics-lab-like “chain of custody” on the data from start to finish.  Doing so offers a certain sense of security, and confidence, when we use the results to tackle some of the most fundamental questions in biology.

    AddThis Social Bookmark Button

    HITS-CLIP Unravels microRNA-mRNA Interactions

    September 3rd, 2009

    Micro-RNAs (miRNAs) are short (18-26 nt) sequences that act as post-transcriptional repressors of gene expression.  Over 700 miRNAs have been reported in the human genome; each is believed to bind directly to many mRNAs to regulate their translation or stability.  Thus, miRNAs represent a key regulatory mechanism affecting numerous cellular activities, and are of particular interest in cancer research.  Understanding the complex relationships between miRNAs and mRNAs remains challenging, however, and computational approaches alone have been largely unsuccessful.

    HITS-CLIP: Isolation and Sequencing of Argonaute-miRNA-mRNA Complexes

    Enter HITS-CLIP, a new approach that applies high throughput sequencing of RNAs isolated by crosslinking immunoprecipitation.  Essentially, it’s a method by which radition is used to cross-link protein-RNA complexes and stringently purify them.  Then, massively parallel sequencing yields all of the RNA “tags” bound by the protein of interest.

    Ago-miRNA-mRNA Complex (Image Credit: Nature)

    Ago-miRNA-mRNA Complex (Image Credit: Nature 460: 479-486, 2009)

    In a recent Nature paper, Chi et al used HITS-CLIP to isolate RNA bound by the Argonaute protein (Ago), which mediates miRNA-mRNA interaction (see figure).  The purified complexes showed two different modal sizes (110 kDa and 130kDa), suggesting that Ago (97 kDa) was crosslinked to two different RNA species – hopefully, miRNAs (small) and the mRNAs that they were targeting (large).

    The authors applied Illumina high-throughput sequencing to characterize Ago-bound miRNAs and the mRNA “tags” to which they were linked.  With relatively straightforward bio-informatics approaches, it was possible to cross-reference expressed miRNAs with complementary sequences of mRNA tags.  The resulting “ternary map” of miRNA-mRNA interaction sites yields a wealth of information about this post-transcriptional regulatory mechanism.

    Decoding miRNA-mRNA Interaction

    The authors identified 454 unique miRNAs crosslinked to Ago in the mouse brain; mir-30e was the most abundant species, representing 14% of all miRNA tags.  In silico clustering and normalization of the messenger RNA tags yielded 1,463 robust clusters from 829 different brain transcripts.

    Locations of Ago-bound mRNA tags (Image Credit: Nature)

    Locations of Ago-bound mRNA tags (Image Credit: Nature 460: 479-486, 2009)

    When these tags were overlaid with gene annotations, several patterns emerged.  As expected, a substantial portion (40%) of Ago-bound tags were in 3′ UTRs where miRNA activity is known to have high efficacy.  Some 8% (one-fifth of the 40%) were actually outside of the UTR but <10kb downstream, regions likely to harbor unannotated 3′ UTRs.

    Unsurprisingly, very few Ago-bound tags were in 5′ UTRs.  However, a substantial fraction of tags fell in coding sequences (25%), introns (12%), and non-coding RNAs (4%), suggesting that miRNA activity occurs in these regions as well.  Another 6% of tags were in intergenic regions, possibly in as-yet-unannotated transcripts.  These unexpected locations of miRNA binding may offer additional insights into the mechanisms of miRNA regulation.

    Next, the authors sought to define the Ago-mRNA “footprint” in which the majority of tags were contained.  The distribution of tags in a defined cluster, at least in their figure, looks like a bell curve, with a sharp peak in the middle.  About 95% of the time, Ago bound within 45-62 nucleotides of this peak, so the authors defined this region as the average Ago-miRNA footprint.  Linear regression analysis of all 6-8 base motifs in clusters yielded numerous “enriched” seed sequences; the most prevalent corresponded to the binding site of miR-124, a well known brain-specific miRNA.  Indeed, Ago-mRNA footprints were rich in miRNA binding sites, suggesting that this approach may predict active sites with far better specificity than other methods.

    HITS-CLIP Implications

    By reducing the search space for miRNA binding sites to a 45-60-nucleotide Ago footprint, HITS-CLIP offers a powerful complementary approach to bioinformatic methods for miRNA binding site prediction.  Computational approaches alone are known to have high false positive rates, whereas the authors estimate FP rates of just 15% for HITS-CLIP.  The new method offers dramatic improvement for transcripts with highly conserved 3′ UTRs, which often have many “predicted” miRNA binding sites because so many computational methods rely on conservation.  Analysis of the HITS-CLIP ternary map revealed that real miRNA-mRNA binding events are very specific, with an average of just 2.6 Ago-mRNA clusters per regulated transcript.  Despite the thousands of predicted binding sites, each miRNA bound an average of 655 targets.  These results suggest that miRNA selectivity is much higher than previously believed.  Yet Ago-mRNA clusters seemed to show no apparent sequence preference (data not shown), so it’s likely that other RNA-binding proteins are involved.

    Thus, this study sets the stage for large scale genome-wide RNA-protein maps that include other proteins, tissues, and species, which should yield an unprecedented new level of understanding of this complex regulatory process.

    References
    Chi SW, Zang JB, Mele A, & Darnell RB (2009). Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature, 460 (7254), 479-86 PMID: 19536157

    AddThis Social Bookmark Button