Archives for April 2008

Happy DNA Day!

April 25, 2008 by dkoboldt

April 25th is National DNA Day in the U.S., an occasion that commemorates discovery of DNA’s structure by Watson and Crick (published in Nature on April 25th, 1953) and the completion of the Human Genome Project fifty years later in April 2003. NHGRI has set up a web site for DNA day featuring a welcome video from Francis Collins and a live chatroom where anyone can chat with “leading researchers in the field of genetics.” Strangely I haven’t yet gotten the call.

The WashU Genome Center outreach office is also sponsoring a number of activities. A number of DNA Day Ambassadors are visiting local high schools in St. Louis. A symposium on the medical campus will have student poster sessions and a seminar, “The Human Genome Sequence: A Foundation for Biological Inquiry”, given by our co-director Elaine Mardis.

By chance I happened to Google the King and Wilson 1975 paper the other day, and came across a very interesting site of Landmarks in the History of Genetics. While not up-to-date, it’s a nice story of key events in DNA’s history (and their implications) since 1745. Of course there’s Darwin’s publication of The Origin of Species in 1859, and Mendel’s Experiments in Plant Hybridisation just six years later. I recognize the name of Erwin Chargaff, whose insights (1950) into the relative incidence of A, C, G, and T nucleotides was not random, but perhaps a kind of code. Two years later came the Hershey and Chase experiments, which showed that viruses infect host cells by injecting their DNA, while the proteins generally remain outside the cell. Of course we know Watson and Crick (1953), as well as King and Wilson (1975). How about Barbara McClintock, whose discovery of transposable elements in maize in the 1940’s was not fully recognized for decades?

That rather sounds like Mendel, doesn’t it? I wonder, how many other important discoveries in biology have already been made, but not yet appreciated. It’s something to think about. Happy DNA Day everyone!

Cis-regs and Functional Noncoding Variation

April 24, 2008 by dkoboldt

On Tuesday I attended a very interesting thesis defense by Scott Doniger, a student in Justin Fay’s lab. I admit, I was lured in by the thesis title, “Comparing and Contrasting Cis-regulatory Sequences to Identify Functional Noncoding Sequence Variation.” While I do not know Scott personally, I’m certainly familiar with Justin Fay’s work on positive and negative selection in the human genome. His paper, in fact, is the foundation of my work on signatures of natural selection and the SNPseek project.

Scott proved a confident and articulate speaker, and laid the groundwork for his thesis by presenting three convincing motivations for this work:

The regulatory hypothesis of evolution. Despite the obvious phenotypic diversity of species on this planet, the DNA sequence diversity is surprisingly limited. More than twenty-five years before the completion of the human genome sequence, King and Wilson [1975] found that the chimpanzee and human genomes diverged by only 1.6%. From this seminal paper came the idea that regulation of gene expression, not differences in DNA sequence, drove phenotypic divergence.
The functional relevance of noncoding sequences. Despite the traditional view that functional variants in humans alter protein-coding sequence, it is becoming clear that the genetics underlying many traits extend into noncoding DNA, particularly for complex phenotypes like disease susceptibility and drug response.
The availability of numerous genome sequences. Draft genome sequences for at least 27 vertebrate species have been completed to date, and their availability has spurred wide interest in the field of comparative genomics.

Scott’s work is based on the reasonable premise that functional noncoding sequences are subject to purifying selection (fewer changes tolerated over time), and thus they should be conserved between genomes that share common ancestry. Thus, comparative genomics serves to guide us to functional variants, as SNPs in constrained positions are more likely to be deleterious. This works well for coding sequences in both humans and yeast (the Fay lab model organism). Scott looked at the 9 known quantitative trait nucleotides (QTNs) in yeast and sure enough, 8 of them were SNPs in highly conserved amino acid positions. Gravy.

Because deep sequence conservation approaches might not work for noncoding SNPs, they focused on a few closely related species of yeast, identifying 2,106 variant positions (13% of the total) that fell within conserved transcription factor binding sites (TFBS’s). Of those, 615 (29%) appear to be deleterious based on their conserved-nucleotide model. If I can extrapolate, by their approach about 3.8% of the SNPs between closely related yeast species are likely to be functional.

The Model-Free Approach: PhyloNet-SNP

All of Scott’s work to this point relies on having good annotations of cis-regulatory TFBS’s in your genome of interest. Because you can’t always count on that, they developed a “model-free” approach to evaluating SNPs. With some help from Gary Stormo’s group, they devised an algorithm (PhyloNet-SNP) that uses each SNP +/- 20 bp of flanking sequence in each direction as a query sequence to identify those within multi-copy conserved elements of a genome. By this approach, ~15% of the SNPs in their model system were called as functional.

The Experimental Backup: Allele-specific Expression

The brief wet-lab portion of the thesis work was an allele-specific expression experiment where the ability of SNPs to alter gene expression levels was evaluated in vivo. Among randomly-chosen SNPs about 8% had a regulatory effect. However, using sequence conservation and/or PhyloNet-SNP to select SNPs brought this up to 25%, suggesting that the conservation approach yields a three-fold enrichment of SNPs that affect gene expression.

At the conclusion, Scott admitted that while comparative genomics does help identify functional sequences and variation, it doesn’t explain everything. Indeed, recent findings from the ENCODE project cast doubt on whether many conserved noncoding sequences are important at all. Yet until we have a better understanding of the dark matter of the human genome, using sequence conservation to identify SNPs of interest seems like a good way to go.

The Genome that Won A Nobel Prize

April 18, 2008 by dkoboldt

My group met recently to discuss the in-press-at-Nature publication of Jim Watson’s genome – the first diploid human genome to be sequenced with next-generation technology. I’ve been waiting for this since 454 announced the project’s completion at the HGM2007 meeting last year in Montreal. It’s a landmark publication in terms of human genetic variation, and of particular interest to me since I work on our center’s 454 analysis pipeline.

In two months Roche/454 generated ~106.5 million genomic reads from Watson’s DNA in 234 runs. Using BLAT they mapped 93.2 million reads (87.5%) to hg36, yielding an average coverage of about 7.4x. No doubt the expense of this effort was substantial, though the authors claim it was 1/100^th of what capillary sequencing would have cost. It probably also hurt to throw away 2.5 million “unmapped” reads, though they did some post-processing of these with interesting results.

After a few filters were applied, the authors produced a set of 3.32 million SNPs in Watson’s genome, a number deliciously comparable to Craig Venter’s 3.47 million SNPs. In both men >80% of the SNPs are already known (to dbSNP). The most recent build of dbSNP (build 128), which doesn’t yet include novel Watson/Venter SNPs, has 9.89 million SNPs. The authors didn’t say but I estimate that the men share about 300,000 novel SNPs. Together they’ll add about 10% to the set of known SNPs, and only 1-2% of nonsynonymous SNPs. I hate to break it to you, but the sun is setting for nsSNPs. We know about 95% of them already and in Jim Watson only 7% are likely to be deleterious.

Also, over at GeneticFuture Daniel MacArthur discusses how the Watson Genome may be gloomy news for the field of personal genomics. He points out that we’re perhaps five years away from affordable whole-genome sequencing, and by then we will no doubt have a much better understanding of how functional variation affects human phenotypes.

Indels are why I love 454 technology. In Watson’s genome they identified >200,000 indels of at least 2bp. Insertion detection is limited by read length, and so most were <200 bp. The largest deletion, however, was nearly 40 kbp. Only a fraction of the indels (~350) affected coding sequence. They saw a validation rate of 70% for a sampling of coding indels between 2 and 50 bp, which is pretty good. Single-base indels were treated with extreme caution, as over 80% of these were associated with homopolymers, the Achilles heel of 454 sequencing.

This paper was worth the wait. Not only was it an impressive demonstration of the power of 454 sequencing for whole-genome sequencing, but it openly addressed many of the informatics challenges therein and answered some interesting questions along the way. We can now confidently say that an individual carries ~3.7 million SNPs relative to the reference sequence, of which perhaps 10,000 are protein-altering. Ten of Watson’s nsSNPs were Mendelian-recessive, highly penetrant, disease-causing alleles according to HGMD, suggesting that each of us carries many more deleterious alleles than was previously believed. Yet analysis of the unplaced 454 reads suggests that as many as 100 protein-coding genes are still absent from the reference sequence. It seems like the work on the human genome is never done. I certainly know the feeling.

Drowning in the Flood of Next-Gen Data

April 18, 2008 by dkoboldt

Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity – servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

You Didn’t Need That Done TODAY, Did you?

I submit about 1,000 small jobs and notice they’re all pending:

No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.