Archives for January 2010

Subtyping Glioblastoma via Genomic Analysis

January 28, 2010 by Dan Koboldt

A recent paper in Cancer Cell reveals the power of integrated genomic datasets for understanding cancer origins and treatment. Members of the TCGA Research Network identified and characterized four glioblastoma subtypes using gene expression, somatic mutation, and copy number data.

Genetic Characteristics of GBM Subtypes

Each subtype was classified by gene expression clustering, and showed specific patterns of genetic alterations, particularly for four genes: platelet-derived growth factor receptor alpha (PDGFRA), isocitrate dehydrogenase 1 (IDH1), epidermal growfth factor receptor (EGFR), and neurofibromin 1 (NF1). Moreover, when analyzed with gene expression patterns of normal brain cells, the four GBM subtypes associate with different cell lineages.

	Expression	Signature	Alterations
Classical	NES; Notch (NOTCH3, JAG1), and Sonic hedgehog (SMO, GAS1) pathways	Astrocytic	EGFR, CDKN2A
Mesenchymal	Mesenchymal markers (CHI3L1, MET)	Astroglial	NF1 and PTEN
Proneural	Oligodendrocytic development genes (PDGFRA, NKX2-2, OLIG2)	Oligodendro- cytic	TP53, PDGFRA or PIK3CA/PIK3R1, IDH1
Neural	Neural markers (NEFL, GABRA1, SYT1, SLC12A5)	Neuron, Astro & Oligo	—

Clinical Features of GBM Subtypes

Overlaying the GBM subtypes with available clinical data revealed some interesting patterns as well. The Proneural subtype had younger patients and thus most of the secondary GBMs; the effect of this was that hypermutators were over-represented in Proneural. Perhaps the most important observation was that subtypes differed in their response to aggressive therapy: it worked well in Classical and Mesenchymal, and showed efficacy in Neural, but did not alter the survival of patients with Proneural GBM.

Clinical Feature	GBM Association
Age of patient	Proneural: Younger patients overrepresented
Hypermutator phenotype	Proneural: Among secondary GBMs
Response to aggressive therapy	Classical: Significantly reduced mortality Mesenchymal: Significantly reduced mortality Neural: Efficacy suggested Proneural: Did not alter survival

These findings have important implications for GBM diagnosis and treatment, and also demonstrate the power of the Cancer Genome Atlas: integrating gene expression, mutation, copy number, and clinical data for some of the world’s deadliest cancers.

References
Verhaak, R., Hoadley, K., Purdom, E., Wang, V., Qi, Y., Wilkerson, M., Miller, C., Ding, L., Golub, T., Mesirov, J., and The Cancer Genome Atlas Research Network (2010). Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1 Cancer Cell, 17 (1), 98-110 DOI: 10.1016/j.ccr.2009.12.020

St. Jude’s and WashU Tackle Pediatric Cancers

January 25, 2010 by Dan Koboldt

St. Jude’s Children’s Research Hospital and Washington University have joined forces to study some of the most deadly and devastating cancers: the Pediatric Cancer Genome Project. Over the next three years, we will sequence the tumor genomes of 600 pediatric cancer cases to characterize the inherited and acquired genetic changes that underlie cancer in children. st-judes-cancer-project

New Approaches to Cancer

It is an ambitious and expensive effort. Even with the plummeting costs of DNA sequencing, the estimated project cost ($65 million) works out to around $100,000 per patient. Unlike our first two published cancer genomes (AML1 and AML2), this project must examine germline (inherited) changes as well as somatic (acquired) ones, since we expect that many childhood cancers have a large heritable component. We already have a robust pipeline for identifying somatic mutations, but we’ll need to develop new algorithms for analyzing millions of inherited genetic variants to identify which ones are important.

Technologies, Old and New

Technology will be a critical factor as well. My guess is that Illumina will be our primary discovery platform. Pac Bio and other single-molecule sequencers, while promising, won’t reach the capacity we need in time. I expect that whole genome sequencing will be the main strategy, but exome capture approaches (those that target the coding regions of known genes) are promising too. The large sample size is an important consideration. Deep sequencing of pooled samples, and multiplexed/barcoded sequencing runs will present new challenges for VarScan and other in-house tools. For validation, we might be looking at custom arrays or oligo sets in addition to 3730 and 454 sequencing.

Win-Win Collaboration

I really don’t see a downside to the collaboration. St. Jude’s not only brings their considerable expertise and valuable samples to the table to this fight, but also their good name. It’s a boon for Washington University in St. Louis (see the article in today’s St. Louis Business Journal). It’s obviously a boon for our genome center and an incredible opportunity to tackle some rare and uncharacterized cancers. Most importantly, it’s a boon for children with pediatric cancer and the parents that suffer it with them.

Capture and Subassembly with Jay Shendure

January 15, 2010 by Dan Koboldt

Yesterday our 2010 Genetics Seminar Series kicked off with Jay Shendure (Univ. Washington) whose twelve-exome paper landed in Nature late last year. His talk covered three very different applications of next-generation sequencing: high-throughput mutational studies of core promoters, sub-assembly of Illumina reads to 454-length contigs, and exome capture to unravel Mendelian disorders.

Mutational Profiling

First, Dr. Shendure described some interesting experiments under way in his lab to elucidate the function of non-coding regulatory variants – specifically, single nucleotide changes in the core promoter that alter gene transcription. The approach is called “saturation mutagenesis” and involves generating every possible mutant in a construct, and then assaying the effect of each construct on transcription. By leveraging high-density Agilent arrays and next-generation sequencing, Shendure and his colleagues performed saturation mutagenesis in vitro in high-throughput fashion. Their process involves three steps:

Synthesize mutant constructs on an Agilent array. The oligos (probably ~150 bp) include the core promoter region surrounding a gene’s transcription start site (TSS). They generate a single mutation (SNP or single-base indel) per construct, and label each construct with a sequence barcode downstream of the TSS.
Cleave mutant templates from the array, amplify, and sequence on Illumina to measure relative construct abundance.
Perform in vitro transcription, then Illumina RNA-Seq, to measure the expression of each construct.

Dr. Shendure noted that there was some sequencing bias between barcodes, so they used multiple barcodes (6) per mutant construct and normalized the results. Then, by combining the construct abundance data (Seq) and the expression data (RNA-Seq) for mutants and comparing them to the results for the wild-type construct, they could assess the functional impact of each synthesized mutation on transcription.

As far as results go, Dr. Shendure showed a histogram: on the X-axis was each base of the core promoter region that they evaluated, and on the Y-axis, the effect of mutating that position on transcription. Most of the values were negative, indicating that mutations reduced transcriptional activity, particularly around the TATA box and INR site. Essentially, the plot neatly described the footprint of RNA polymerase binding, with the most effective mutations centered on the TSS. Intriguingly, the single-base deletion mutants consistently showed the greatest reduction of transcription, suggesting, perhaps, that indels in promoter regions are likely to be functional variants.

Short Read Subassembly

The next area of interest was very pertinent to groups with access to next-generation sequencing, but not the 454 “length matters” platform. While Illumina read lengths are still growing (most groups currently run 75- or 100-bp protocols), they still cannot rival the ~450 bp reads consistently produced on 454 Titanium. And yet, many applications of NGS benefit from longer reads – de novo assembly, metagenomics, and the core promoter assays I’ve just described, to name a few. Thus, Shendure and his group sought to combine some Tech D cleverness with Illumina’s incredible read depth to generate localized assemblies of kilobase-length fragments.

First, they sheared DNA into fragments that were a few kilobases long, ligated adapters to the ends of each fragment, and did a round of amplification. Now they had many copies of each fragment with adapters on each end. The fragments are concatemerized, then somehow randomly sheared to variable-length pieces of the original fragment such that each piece has one of the original adapters on one end. A new adapter is ligated to the sheared end. Then there’s another round of PCR, followed by Illumina paired-end sequencing. The resulting paired-end reads (75-mers) have a “read2” that’s the same for all pieces of the same kilobase-fragment, but a read1 that comes from some random location within the fragment.

Then, it’s possible to perform a localized assembly for each kilobase fragment. It’s an interesting approach, but here’s the problem: after assembly, in their proof-of-principle experiment, they achieved a median contig size of 350 bp. Granted, the per-base quality was very high (85% of bases had Q>40), but the lengths are unimpressive. As Dr. Shendure joked, they managed to get similar read lengths to a 454 run and make it cost just as much. There’s still a lot of work to do. Or they could just pick up one of those cute little GS-Juniors.

Human Exomes and Mendelian Disease

Finally, Dr. Shendure gave an overview of last year’s elegant Nature paper, in which exome sequencing of four individuals, followed up by careful downstream informatics, correctly identified the causative gene. Their defined “exome” was 30 Mb, which they targeted using two solid-phase array capture chips. Illumina sequencing of the exome capture generated about 6.4 gigabases per individual. Exome sequencing makes a lot of sense in certain Mendelian disorders, where (1) the pattern of inheritance, e.g. autosomal recessive, is known, and (2) the causative mutations occur in a single gene.

By sequencing the exomes of multiple individuals, isolating what we’d call “tier 1” variants – Nonsynonymous, nonsense, splice site, or frameshift-indel – and then removing all known common variants from public databases, Dr. Shendure and colleagues can reduce 20,000 gene candidates down to a handful. It worked out beautifully in the Nature paper – all four individuals had rare, tier 1 mutations in the same gene.

But in another cohort (4 individuals from 3 kindreds with Miller syndrome, a rare developmental disorder) Dr. Shendure and colleagues discovered the danger of overfiltering. They removed all variants from dbSNP 129, but when they limited the scope to only mutations predicted to be “damaging” or “deleterious”, the number of genes dropped to zero. Apparently the deleteriousness of at least one of the causal mutations wasn’t predicated correctly.

Obviously, the need is for better filters of common variants. But with projects like the 1,000 Genomes in full swing, I wonder, will filtering out using dbSNP get better, or worse? Already, as Shendure pointed out, certain genes have basically a SNP reported at every position. I know that TP53 does. What’s more, with the advent of next-generation sequencing, I hate to tell you, but people are going to be reporting a lot of false positives. I guarantee it. So when you filter all of the variants, you might actually remove the ones you’re looking for.

References
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

Finding Recurrent CNVs in Cancer

January 6, 2010 by Dan Koboldt

Copy number aberrations (CNAs) represent one of the most prevalent genetic alterations in cancer cells. There is considerable interest in finding CNAs that affect the same chromosomal region in multiple tumor samples. Recurrent CNA (RCNA) implies the presence of key cancer genes; on chromosome 7, for example, we often see amplification of the region containing the EGFR gene.

Most common approaches to RCNA identification involve a two-step approach: first, call CNAs in each individual sample; second, perform cross-sample analysis to look for recurrence. Unfortunately, with large numbers of samples and increasingly dense genomic data, this two-step approach carries a significant computational burden.

Enter the Matrix: Correlational Matrix Diagonal Segmentation

Now online at Bioinformatics Early Access is a paper describing CMDS, a population-based method for detecting RCNA in cancer that was developed here at Washington University by Qunyuan Zhang and his colleagues.

cmds-screenshot

CMDS uses raw intensity ratio data (from SNP arrays, CGH, etc.) and adopts a diagonal transformation strategy to identify RCNAs via between-chromosomal-site correlation. Not only does this reduce the computational burden of RCNA identification, but it increases the detection power as well.

Done in 13 Seconds

CMDS has a speed advantage as well. Qunyuan compared its execution time to that of AWS-STAC, SBS-STAC, and pREC-A on a dataset comprised of 10,000 sites in 100 samples. The R version of CMDS finished in 13 seconds. The other algorithms took more than 300 times longer on the same dataset, indicating that CMDS represents a substantial performance gain. There’s also a C version of CMDS that runs even faster.

Application to Real Data: Lung Cancer and Glioblastoma

To evaluate CMDS on real data, Qunyuan applied it to lung adenocarcinoma and glioblastoma (brain cancer) datasets that were generated as part of the Tumor Sequencing Project (TSP) and the Cancer Genome Atlas (TCGA), respectively. CMDS called 39 significant RCNA regions in lung cancer and 37 in brain cancer. All of the significant regions had been previously reported/validated; they included or were proximal to a number of well-known cancer genes including EGFR, CCND1, KRAS, MDM2, PDGFRA, and others.

When the two datasets were combined, a few key RCNA regions emerged – amplification of EGFR, CDK4, and MDM2, and deletion of CDKN2A – that were shared by both cancers. This, to me, demonstrates one of the most powerful aspects of CMDS – its population-based approach can compare not only samples of the same cancer type, but also pools of samples across sample types. It makes a great addition to our arsenal of cancer genomics tools at Washington University.

CMDS is implemented in R and C programs which are available from Qunyuan’s web site.

References
Zhang Q, Ding L, Larson DE, Koboldt DC, McLellan MD, Chen K, Shi X, Kraja A, Mardis ER, Wilson RK, Boreki IB, & Province MA (2009). CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics (Oxford, England) PMID: 20031968