Archives for April 2010

DNA Day 2010 Highlights: Han Solo’s DNA

April 29, 2010 by Dan Koboldt

dna-day-2010

Last Friday was National DNA Day (April 23) in the United States, which commemorates the discovery of the double helix by James Watson and Francis Crick (1953), as well as the completion of the draft human genome (2003). Destined, unfortunately, to dwell in the shadow of Earth Day (April 22), DNA Day nonetheless offers an opportunity to share the wonders of genetics and genomics with the vast majority of the populace (>99%) who know very little about them.

NHGRI’s Big Day

The National Human Genome Research Institute (NHGRI) really goes all out. They host a DNA Day web site, a Facebook page, and even an online chatroom where students around the world can ask questions and have them answered by a panel of experts. There are always some good questions; here were a few of my favorites from this year:

Q: Is the DNA testing done on NCIS (TV criminal investigation show), the actual truth?
A: Phyllis Frosst, Ph.D.: The kind of testing on NCIS (and other TV shows) is quite close to reality in terms of capability, however the speed at which these tests take place is (by necessity) greatly accelerated compared to reality. It slows down the pace of a TV show if the characters have to wait 3 weeks for a DNA test!

Q: Do we retain any of the genes of our ancestors, such as dinosaurs?
A: Ian Wallace, M.S.: There have been numerous studies to determine what parts of our genetic code are conserved across species. Primates are our closest relatives, while lizards and birds would be very distant genetically. Modern lizards and birds surely have some stretches of DNA that are similar to those of their long-ago ancestors, such as dinosaurs.

Q: Since the DNA structure of humans has been mapped, are we any closer to cloning humans?
A: Dale Lea, R.N., M.P.H., C.G.C., F.A.A.N.: We are not close to cloning humans. The NHGRI has an Ethical, Legal, and Social Issues Branch that looks closely at these kinds of issues.

Q: Why do we still use SNPs when the whole human genome has been sequenced?
A: Sandy Woo, M.S.: While the majority of the human genome was sequenced (work remains on the centromeres and telomeres), we are still studying how the genome functions to affect human health and disease. That is where SNPs (which are found in the DNA between genes) are helpful as they serve as biological markers for scientists to locate genes associated with disease. In other words, scientists are still trying to “translate” the genome that it has been transcribed.

Q: What happened to Han Solo’s DNA when he was frozen in carbonite? [Asked by “Jarjar Binks”]
A: Sarah Harding, M.P.H: Well, since Han did not appear to age during the time he was frozen, it is likely his cells stopped dividing. When Leia released him, it restarted his aging process, so basically all Jabba did was to keep Han looking young and fresh.

Well answered, Dr. Harding, well answered.

Taking the Wonders of DNA to Students

This year I again recruited Tech D group leader (and one of Genome Technology “Tomorrow’s PI’s”) Vince Magrini to join me on a visit to Our Lady of Lourdes, a catholic elementary/middle school. It was the 7th grade science class, and although they hadn’t yet covered DNA in the curriculum, their teacher Mr. Falkler had given them a crash course the day before. We’d hardly arrived in the classroom when Vince and I were sweating – not because we were nervous, but because the school apparently has no A/C, and it was one of those humid, 80-plus-degree days that we get in Missouri in the spring.

I warm up the class with some double helix history.

The audience is clearly enraptured by my presentation.

Vince gives a demo of his “DNA from Strawberries” experiment.

Sure, it’s easy if you’re in TechD at the Genome Center.

EVERYBODY GETS A STRAWBERRY!

Lab technicians in the making.

Cheers!

Ah, success.

Transcriptome Genetics with HapMap and RNA-Seq

April 22, 2010 by Dan Koboldt

Two papers in Nature this month leverage the power of second-generation sequencing technologies to investigate gene expression variation in human cell lines. By performing RNA-Seq in HapMap cell lines, the authors generated the most extensive gene expression data to date for these samples, and were able to use publicly available HapMap genotypes to associate expression differences with genetic variation. This strategy was applied to the HapMap samples two years ago using expression microarrays. Using RNA-Seq instead of microarrays, however, offers a few key advantages:

More accurate quantification of highly abundant transcripts, where microarrays reach saturation
Access to rare transcripts below the sensitivity threshold for microarrays
Detection of novel gene structure from alternative splicing and unannotated exons
Identification of allele-specific expression

The first study, from Jonathan Pritchard’s lab at the University of Chicago, sequenced RNA from 69 Yoruban (African) individuals on the Illumina GAII platform. They generated at least two lanes per individual, for a total of 1.2 billion reads, of which 964 million (80%) mapped uniquely to the genome or to exon-exon boundaries. The second study, from Emmanouil Dermitzakis’s group at the Sanger center, sequenced RNA from 60 CEU (CEPH Europeans from Utah) individuals, also on the Illumina GAII platform. They generated one lane of paired-end data per individual, for a total of about 1.0 billion reads. Since neither study provided a table summarizing their data (which I’d have liked), I put one together:

Study	Pickrell et al.	Montgomery et al.
Samples	69, African descent (YRI)	60, European descent (CEU)
Sequencing	Illumina 1X35 or 1X46	Illumina 2X37
Reads/Sample	17.4 million	16.9 million
SNP Dataset	HapMap II/III	HapMap III
Total SNPs	3.8 million	1.2 million

Up to this point, the two studies sounded nearly identical. For the data analysis, however, each group went in a different (and interesting) direction.

Pooled Data for Discovery of Novel Gene Structures

Novel Exons. The Pritchard group pooled all data to examine the completeness of current gene annotations. Some 86% of uniquely mapped reads corresponded to known exons. Using conservation data from alignments of 28 vertebrate exomes, the authors identified 4,031 regions that are evolutionarily conserved and show evidence of transcription. About one-quarter of these appear to be part of spliced transcripts, but most appeared to be novel untranslated regions (UTRs). Some 115 regions, however, had sequences consistent with protein-coding exons. To investigate the possibility that their novel exons are real, the authors used RNA-Seq data from several human tissues and chimpanzee cell lines. The evidence suggests that their regions do represent novel exons, but ones that are expressed in a more tissue-specific fashion than annotated exons.

Novel Poly-A Sites. The authors next screened the ~70 million unmapped sequence reads for long runs of A or T nucleotides, which might indicate novel poly-adenylation sites. Of the ~8,000 novel sites that they identified, some 45% fell within 10 bp of a known cleavage site. To further validate their findings, they screened their poly-A regions for the binding site of the CPSF polyadenylation factor, and found a 32-fold enrichment for the CPSF target hexamer. The net result was a high confidence set of 3,481 cleavage sites that show evidence of poly-A (from RNA-Seq data) and CPSF binding.

RNA-Seq: 10 million Reads Is All You Need

The Dermitzakis study generated 16.9 million (+/- 5.9 million) reads per individual, which were mapped to the NCBI 36 reference sequence using Maq with a maximum insert size of 2 megabases). The resulting alignments were filtered to remove alignments with low mapping quality or to the X, Y, or MT chromosomes. Discordant read pairs (by distance or orientation) were also removed. To quantify the expression of known exons/transcripts/genes, the authors scaled read counts for each individual to a theoretical yield of 10 million reads, and only considered exons with data in >90% of individuals. This resulted in data for 90,064 exons from 10,777 genes, of which 95% had at least 10 reads (on average) per individual. While the normalization seems to reduce the dataset to less than half of known genes, it nevertheless provided an extensive view of gene expression across these 60 individuals.

Cis-Regulatory Effects on Gene Expression

Using HapMap genotypes for 1.2 million SNPs, the Dermitzakis group identified 836 genes associated with cis-regulatory variants (compared to 539 genes identified in microarray studies of the same individuals). Even when normalized for the number of genes tested, the increased resolution of RNA-Seq over microarrays yielded a larger number of genetic regulatory effects. The RNA-Seq exon eQTLs (expression quantitative trait loci) were enriched for abundant transcripts, suggesting that saturation of highly expressed exons reduces the sensitivity for microarrays to detect some cis-regulatory effects.

The Pritchard group searched for cis-regulatory variation with an even larger dataset – RNA-Seq for 69 individuals and 3.8 million HapMap SNPs. They identified 929 genes with local eQTLs (4.6% of annotated genes); consistent with previous findings, virtually all SNPs associated with expression level were near the corresponding gene. They also reported the overlap with the CEU study results: the top 500 associations reported in CEU samples were enriched 10 to 40-fold for significant eQTLs in YRI samples. Given the marked genetic differences between these two populations, this result suggests that these studies are identifying replicable cis-regulatory events.

Mechanism of Cis-Regulatory Effects

An important feature of RNA-Seq data is that it can be used not only to detect cis-regulatory variation, but to assess the mechanism by which these variants act. The Pritchard group looked at 222 of their 929 eQTLs for which the associated SNPs fell within the gene exons. They classified the RNA-Seq reads as originating from the high-expression haplotype or the low-expression haplotype, and found that for 195 of the genes (88%), more than 50% of the expressed transcripts carried the allele associated with high expression. Therefore, the modulation of gene expression is a direct result of the associated variation (probably by activating nearby cis-regulatory elements). In other words: the eQTL tells us that variants near the gene are associated with its expression. That means something nearby is regulating it. The fact that the haplotype associated with increased expression is the haplotype that predominates tells us that the high-expression allele is what drives the expression of its nearby gene. As opposed to, say, driving expression of the gene from both chromosomes.

Allelic Effects on Splicing

Finally, both groups looked at the actual content of expressed transcripts, to find SNPs associated with alternative splicing. The Pritchard group calls these splicing quantitative trait loci (sQTLs), and found 187 genes with significant associations. Binding sites for known splice factors (U1 snRNP and U2AF) were enriched for sQTLs, as were SNPs within 2 bp of a canonical splice site. The Dermitzakis group found 110 genes with significant associations, and stratified splicing-associated variants according to their position in the gene structure. When tested against the exons upstream and downstream of where they resided, splice donor variants were enriched 3.17-fold with the upstream (5′) exon, while splice acceptor variants were enriched 7.02 fold with the downstream (3′) exon. Thus, these SNPs affect the inclusion/exclusion of their exons in the mature transcript.

Dermitzakis’s group visually examined their most significant associations to characterize the mechanism of splicing regulation. Of the 110 significant sQTLs identified in CEU samples:

41% were single exon skipping events
17% created an alternate acceptor
13% were double or triple exon skipping events
6% created an alternate donor
5% were mutually exclusive exons
5% were retained introns.

In summary, these studies establish the feasibility of transcriptome sequencing to assess gene expression and characterize regulatory variation. Indeed, as the title of one study suggests, RNA sequencing is a powerful tool for studying the mechanisms underlying human gene expression variation, and will undoubtedly yield better understanding of the complex relationships between genotype and phenotype.

References
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, & Pritchard JK (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464 (7289), 768-72 PMID: 20220758

Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, & Dermitzakis ET (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464 (7289), 773-7 PMID: 20220756

The Four Dimensions of a Breast Cancer Genome

April 15, 2010 by Dan Koboldt

Published today in the journal Nature is the whole-genome sequencing of a basal-like breast cancer tumor, metastasis, and xenograft. There’s also a News and Views article by Joe Gray of Lawrence Berkeley National Laboratory, as well as a news feature on large-scale cancer projects.

brc1-nature08989screenshot

This study is a bit unlike our previous cancer genomes (AML1 and AML2). By my count it is the sixth cancer genome to be sequenced, and the third to come out of the Genome Center at Washington University. Obviously, it’s our first solid tumor. What’s particularly interesting about this study, however, is that we sequenced four DNA samples from a single patient with “double-negative” breast cancer: the primary tumor, peripheral blood (normal), a brain metastasis, and a mouse xenograft derived from the primary tumor. The xenograft is a success story in itself – we managed to create a human-in-mouse (HIM) transplant of the primary tumor that was >90% pure when harvested 101 days after engraftment.

The genomes of these four samples (tumor, normal, metastasis, and xenograft), examined with the incredible power of Illumina massively parallel sequencing, offer an unprecedented view of the somatic changes that underlie breast cancer development, growth, and metastasis.

Repertoire of Somatic Mutations

We validated a total of 50 somatic sites in at least one of the three cancer genomes, including:

28 missense mutations predicted to alter the sequence of an encoded protein
11 synonymous (silent) mutations in coding sequences
4 small insertions ranging in size from 1 to 6 bp
3 small deletions ranging in size from 1 to 13 bp
2 splice site mutations at intron-exon junctions
1 nonsense mutation predicted to result in a truncated protein
1 RNA mutation in a gene encoding a signal recognition particle (SRP) RNA.

We employed deep Illumina sequencing of PCR amplicons to assess the frequencies of each mutation across all four tissues. Intriguingly, more than half of them exhibited differential frequencies between primary tumor, metastasis, and/or xenograft. Two mutations (a nonsense mutation in MYCBP2 and a missense mutation in TGFBI) were significantly enriched in the primary tumor (88-89% vs 14-44%). Some 26 mutations were significantly enriched in the metastasis and/or xenograft. Perhaps most interesting, however, were two sites (a missense mutation in SNED1 and a silent mutation in FLNC) that appear to be de novo mutations unique to the metastasis.

Acquired Structural Variation

Using our internally developed tools for structural variant prediction (BreakDancer) and de novo assembly (TIGRA), we predicted 59 deletions and 18 inversions that were putative somatic events. Validation by PCR and 454/3730 sequencing showed that 73/77 (94.8%) were real structural variants, of which 34 (28 deletions and 6 inversions) were somatic alterations not present in the normal genome. Among them was a 46.5 kbp heterozygous deletion affecting FBXW7 (a known cancer gene) and two overlapping 500-kb deletions affecting CTNNA1 and a handful of other genes. The latter was particularly interesting, because loss of CTNNA1 has been shown to result in global loss of cell adhesion in human breast cancer cell lines.

We also validated seven translocations with a combination of manual review (Pairoscope), assembly, and PCR/3730 sequencing. One translocation that we assembled in all three tumor samples involves a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and the ABCA2 gene on chromosome 9. Two other validated translocations that assembled in all three tumors are on chromosome 2, and separated only by a 393-bp TcMar-Tigger repeat.

Insights from Comparisons of Tumor, Metastasis, and Xenograft

One of the most intriguing findings from our study was the differential mutation frequencies and structural variation patterns that we observed in the metastasis and xenograft, compared to the primary tumor. More than half of the somatic mutations (26/50) were significantly enriched in the metastasis and xenograft, while observed at relatively low frequencies in the primary tumor. This suggests that a sub-population of tumor cells, not the primary clone, gave rise to the cerebellar metastasis that eventually killed the patient.

Is there a fitness cost to the mutations that enabled metastasis? Can we develop sensitive tests to detect the cells that are likely to spread? Genome sequencing has brought us to a point where we can begin to ask these questions, and answering them brings us one step closer to unraveling the complex, devastating, deadly disease that is cancer.

References
Li Ding, Matthew J. Ellis, Shunqiang Li, David E. Larson, Ken Chen, John W. Wallis, Christopher C. Harris, Michael D. McLellan, Robert S. Fulton, Lucinda L. Fulton, Rachel M. Abbott, Jeremy Hoog, David J. Dooling, Daniel C. Koboldt, Heather Schmidt, Joell (2010). Genome remodelling in a basal-like breast cancer metastasis and xenograft Nature, 464 (15), 999-1005 : 10.1038/nature08989

Genome Technology Cancer Issue

April 7, 2010 by Dan Koboldt

This month’s Genome Technology is the 100th issue of the magazine, and the 6th annual cancer issue. In a brief editorial, magazine editor Ciara Curtin calls cancer The Good Fight and notes that the cover story focuses on “researchers who are using all the tools at their disposal – bioinformatics, metabolomics, sequencing, and more – to better understand the basics of cancer, why people are susceptible to it, and to better treat the disease.”

Systems Biology (?) Fights Cancer

I’m not certain if the title for the cover story (above) is appropriate. The research profiled in the article spans bioinformatics, gene expression, genotyping, proteomics, and other disciplines. Or do I just not understand what “systems biology” means? Either way, the story offered an esoteric view of some of the methods being applied to study cancer. Neil Hayes (of UNC Chapel Hill) and other members of the Cancer Genome Atlas research consortium leveraged the rich TCGA dataset to identify subtypes of glioblastoma based on gene expression and genomic alterations. Laura MacConnaill of Dana-Farber Cancer described OncoMap, a customized genotyping array for “actionable” somatic mutations – ones that confer sensitivity or resistance to cancer drugs. UCLA’s Stan Nelson recounted (again) the sequencing of a well-studied glioblastoma cell line on Life Technologies’ SOLiD platform. A collaborative effort between Harvard, Dana-Farber, and the Broad Institute used mass spectometry genotyping (Sequenom?) to genotype 240,000 sites in 1,000 human tumors.

It was an interesting article overall, and left me with the feeling that (as NCI’s Stephen Chanock was quoted) we are living in a golden age of cancer research.

News Briefs in the Sequencing World

There were some interesting non-cancer tidbits in the issue as well. From the news briefs, I learned that Stephen Lombardi left his position as president of Helicos BioSciences, another discouraging sign for the struggling single-molecule-sequencing company. Pacific Biosystems, meanwhile, announced the ten early-access customers of the Single Molecule Real Time (SMRT) sequencers: WashU, Broad, Baylor, CSHL, JGI, UW, Monsanto, NCI, OICR, and Stanford. RainDance Technologies and Life Technologies announced an agreement to co-market the RDT 1000 enrichment kit with the ABI SOLiD sequencer.

Computing Demands of NGS Data

An article on high-performance computing discussed the challenges of managing next-generation sequencing data. Among the interviewees were David Dooling of WashU and PolITigenomics fame, David Jaffe of the Broad, and Michael Brudno of the University of Toronto. It seems that most genome centers, like ours, are building pipelines from a combination of freely-available and internally developed tools for NGS analysis. The ever-increasing flood of data demands efficiency and automation wherever possible. “We take a holistic approach,” says Dooling. “Find out the areas of ambiguity that are problematic and understand them as best we can.”

In the face of the NGS data deluge, there’s not much time to look back.