Archives for October 2009

The Search for Somatic Changes

October 29, 2009 by Dan Koboldt

As cancer genome sequencing ramps up here and pretty much everywhere around the world, I got to thinking about strategies for identifying somatic changes, with confidence, from massively parallel sequencing data. As part of the Cancer Genome Atlas Project (TCGA), we’ve been applying both targeted (capture-based) and whole-genome sequencing approaches to tumor samples and matched normal controls. Ideally, the resulting data will yield high (>20x) coverage in both tumor and normal across our positions of interest. What happens next, at least at WashU, is the culmination of a multiple-year effort to develop a comprehensive pipeline for detecting somatic variants.

First up: Single Nucleotide Variants (SNVs)

With more than 15 million entries in dbSNP, single nucleotide polymorphisms (SNPs) remain the most common form of DNA sequence variation in humans. In cancer, most of the well-characterized somatic mutations are single nucleotide changes as well. Conceptually, SNVs should be the easiest things to find in next-gen sequencing data. They occur at a single position that can be directly compared between tumor and normal. They should have minimal effects on sequence alignments to the reference genome. For example, here’s a putative somatic variant in TP53:

What you see above is SAMtools “pileup” output at a single position (7518990 on chr17), for Normal and Tumor. The Normal shows 4 reads that all support the reference on the – strand (,,,,). The Tumor, however, shows 6 reads that all support a G variant, 2 on the + strand (GG) and 4 on the – strand (gggg). It seems reasonable that, given this output across the entire genome for Normal and Tumor, one can compare them at every position and look for differences such as these.

Yet we struggle to validate even high-confidence SNVs that look to be somatic. Some are real, but Germline (probably under-sampled or missed in the Normal); most are simply false positives in the tumor. These might arise from a number of causes – homopolymers, paralogs, repeats, sequencing error, alignment error, etc. Only a small fraction of variants that appear somatic in NGS data will validate as such.

Why is that? In general, it’s because by screening for somatic variants, we remove all of the variants that are most likely to be real. First, we exclude any variants that are present in the normal (germline) – which account for the majority of true sequence variations. We also exclude known variants from dbSNP and 1,000 Genomes databases, which are also likely to be real but almost certainly germline events. Then, we prioritize variants that are predicted to have functional effects – on protein coding, on splicing, in conserved regions, etc. Such regions are often under negative selection for damaging mutations, meaning that variants should be exceedingly rare. Every one of these filters selects for variants that are less likely to be valid.

Small Indels

With longer (>50 bp) fragment-end reads and/or paired-end libraries, it’s possible to detect small insertion/deletion variants (indels) in next-gen sequencing data. Here, detection and specificity are the challenges. In 454 data, the reads are [hopefully] of sufficient length (250 bp) for accurate gapped alignment to a reference sequence, and indeed, aligners commonly used with 454 data (Newbler, BLAT, cross_match, SSAHA2) do so. Unfortunately, indels are both the strength and the weakness of 454 data – due to the underlying pyrosequencing, homopolymeric regions are often under- or over-called, resulting in numerous false positives. Many can be filtered, but often homopolymer-associated errors cause mis-alignment of reads, yielding indels that might not look like homopolymer artifacts.

Indel detection is also possible with Illumina data, though the shorter read lengths make this challenging. Few short read aligners can handle the throughput of Illumina data and allow for gaps in read alignments, because speed and gapped alignment are at odds with one another. Fortunately, paired-end sequencing on Illumina offers a solution implemented by Maq some time ago – first, map all reads that you can without gaps, and then, look for gapped alignments in unplaced reads whose mate is mapped nearby. This reduces the search space considerably for gapped alignment, and also limits the query space to reads that likely contain indels (gaps).

In cancer sequencing, small indels present one additional problem – determining whether they are present in the normal. Even the best aligners can’t always precisely define where an indel starts or stops. Thus, a germline indel might have different coordinates in the tumor than in its matched control; when comparing the samples, it might appear to be somatic.

Loss of Heterozygosity (LOH)

It is well known that the genomes of tumor show extensive loss of heterozygosity (LOH). Generally, this occurs because a position that is heterozygous in the germline is affected by some kind of structural event – deletion, gene conversion, chromosome loss, etc. – that results in the loss of one allele. Of course, to detect LOH, one needs a variant that’s heterozygous in the Normal, and to precisely define the region of LOH, one needs a dense set of heterozygotes. Even so, the maximum precision for the start and stop of an LOH region is the interSNP distance, since only SNPs can inform on LOH, and that can be hundreds or thousands of bases. But LOH calls do tend to cluster, and detection of LOH regions is not really the problem. Even lower-resolution array technologies identify recurrent LOH regions in tumor samples.

But what exactly does LOH mean in terms of cancer development and growth? It’s hard to say. Quite possibly, a tumor suppressor gene was deleted, or an oncogenic allele was duplicated. Unfortunately, LOH regions tend to be kilobases or megabases in size, containing dozens or hundreds of genes, and identifying which ones are truly affected in terms of cancer remains challenging. We see a lot of LOH in cancer, but sadly, it never seems to get anyone excited.

Structural and Copy Number Variation

Image Credit: Wikipedia

Last and most difficult to characterize are the sub-microscopic structural changes – insertions, deletions, inversions, translocations, duplications, etc. – that often occur in tumor genomes. These tend to be large, complex events that are tough to infer from NGS data. We run Ken Chen’s breakDancer, of course, and it predicts numerous SVs. But how do you validate a massive, complex variant spanning thousands of bases? We do our best with PCR and 3730/454 sequencing, but until read lengths get really really long (perhaps on single-molecule sequencing), validating such events and determining their breakpoints is tough.

There are well-characterized, recurrent copy number alterations in cancer, like EGFR amplification on chromosome 7. Here’s my question: where are all of those extra copies? Are they just tandem duplications of part of a chromosome, or are they duplications that get inserted elsewhere in the genome? In the absence of a complete, linear, high-confidence genome, I’m not sure we can tell.

Fruits of Our Labors

It occurs to me that this is a bit of a negative article – focusing entirely on the challenges and failures, without highlighting the successes. And there are many successes. Every cancer genome tells us something, and every new piece of knowledge goes into our arsenal in the war against cancer. As sequencing ramps up, we’ll see exponential growth in the number of known somatic mutations across a wide array of cancers. With the help of cancer biologists, these data will be leveraged to better understand the genes, proteins, and pathways underlying tumorigenesis. Greater understanding will undoubtedly improve the detection, diagnosis, prognosis, and treatment of cancer patients.

Back from Baylor

October 16, 2009 by Dan Koboldt

I’ve returned from Baylor’s Human Genome Sequencing Center (HGSC), where earlier this week colleagues from Baylor, Boston College, the Broad Institute, University of Michigan, NCBI, and EBI converged for a face-to-face meeting on Pilot 3 of the 1,000 Genomes Project. Unlike pilots 1 and 2, which emphasized whole genome sequencing to low or high coverage, respectively, in Pilot 3, the exons of 1,000 genes (~1.5 Mbp total) were selectively targeted for sequencing by capture technologies.

Capture, Exons, and the Exome

For the genome centers, this pilot was one of the first applications of relatively new technologies to enrich for particular regions of the genome. The idea is that by focusing on the exons of protein-coding genes, one can maximize the return of sequencing because variation in those regions is [presumably] more likely to be phenotypically relevant. A post by fellow blogger Keith Robison of Omics! Omics! discusses how capture technologies have recently scaled to offer “exome sequencing” and wonders if this approach will miss important non-coding variation.

While the question of which genomic regions harbor phenotypically-relevant variation is a subject of open debate, I think that Pilot 3’s focus is more technological than biological. It motivated Baylor, Broad, WashU, and Sanger centers to push developing capture technologies into production. Perhaps the most important aspect of this project, as Carrie Sougnez of the Broad Institute put it, is that Pilot 3 “helped us learn how to do capture.”

Cross-Platform and Cross-Pipeline Comparisons

In the face-to-face meeting, Kiran Garimella of the Broad Institute and Gabor Marth of Boston College presented some comparisons of variant calls across platforms and across BAM-generation pipelines. The results, surprisingly, were similar across most of the approaches in terms of the variants that were detected. Comparisons of BAM files generated by different pipelines (Broad’s and Baylor’s, for example) revealed few differences. One exception, however, was the aggressive marking of PCR duplicates in 454 data by Baylor’s MarkDuplicates algorithm, which reduced the number of [false-positive] SNP calls. Matthew Bainbridge of Baylor has already been generous enough to share this algorithm with other centers.

Overall, the Pilot 3 variant calls are looking good – dbSNP concordances in the 70-80% range or higher, and transition/transversion ratios of about 3-3.50 – and consistent across 454 and Solexa data from multiple centers.

Validation and Biological Significance

As with any SNP discovery project, validation is a key step, and the decisions of how to validate thousands of variants across hundreds of samples are non-trivial. Much of the face-to-face meeting discussions were devoted to coming up with a validation plan. While we don’t yet know for certain how many of the ~80,000 novel putative variants discovered in Pilot 3 are real, the results look promising. As expected, novel variants tend to be rare – found in just one or a few individuals in the study. Yet our strategy of capture-based sequencing to target exons seems to be paying off, because more than half of the novel variants are predicted to alter protein sequence (nonsynonymous) or mRNA splicing.

Although there’s a lot of work yet to be done, it’s clear to me that this Pilot, and the 1,000 Genomes Project as a whole, will yield a tremendous wealth of new knowledge about sequence variation in the human genome.

First Breast Cancer Genome in Nature

October 9, 2009 by Dan Koboldt

October is Breast Cancer Awareness Month, and the timing couldn’t be better. Our friends at the BC Cancer Agency published the whole genome sequencing of a breast cancer this week in a letter to Nature.

Nature Vol 461 | 8 October 2009

Using Illumina paired-end sequencing, Shah et al generated 141 Gbp of sequence to achieve 43x haploid coverage of a metastatic lobular breast cancer. Some 32 somatic, protein-altering (nonsynonymous) mutations were identified, of which 11 could be detected in the primary tumor sample from 9 years earlier. Deep RNA-Seq data from the metastatic sample also permitted transcriptome analysis, though its presentation was brief. Interestingly, the authors validated two novel RNA-editing events that change the amino acid sequences of SRP9 and COG3.

No WGS of Normal or Primary Tumor?

I realize that a letter to Nature must be brief, but even so, what struck me most about this paper is what’s missing. First of all, only the metastatic sample was whole-genome sequenced – the primary tumor and matched normal were not. Instead, the authors identified nonsynonymous coding variants in met WGS data, and validated them by PCR/3730 sequencing in the met, tumor, and normal samples. This seems laborious to me, since there were 1,120 nonsynonymous SNVs, of which 437 (39%) were valid and only 32 (<3%) were absent in the normal and therefore somatic. Another regrettable limitation of this approach is that it doesn’t offer a complete picture of the somatic mutations beyond nonsynonymous-coding events.

Missing Methods

My understanding of Nature journals is that there’s no limit on supplementary material that accompanies publications. Thus, I don’t understand why the methods are incomplete. For example, though the authors found and confirmed >60 germline indels, there’s no description anywhere of the indel-calling algorithm. There’s a lot of text describing their internally developed SNVmix algorithm to identify SNVs, but no link to download it that I could find. No mention of dbSNP or Affy SNP array concordance for SNVmix calls was offered, so one cannot evaluate the algorithm. Also, there’s no description of read de-duplication, which is alarming because it suggests that duplicate reads from the same molecule weren’t removed prior to analysis.

The Importance of RNA-Seq

I do like that the authors performed RNA-Seq of the transcriptome, which provides insights into mechanisms like alternative splicing (AS), allele-specific expression (ASE), and RNA editing. Sadly, only the last one received mention in the results section, suggesting that no significant AS or ASE events were found. Interestingly, not only did the authors validate two instances of high-frequency, protein-altering RNA editing (COG3 and SRP9), but they found that the ADAR RNA-editing enzyme was one of the most highly expressed genes in the metastasis. The authors note that “these observations emphasize the importance of integrating RNA-seq data with tumor genomes,” although this claim would have been far better supported if one did not have to dig through a massive/disorganized Excel file for most of the RNA-seq data.

“Evolution” of a Breast Cancer Tumor

Perhaps the most intriguing – and contentious – finding of the paper (as highlighted by GT’s In Sequence magazine and Keith Robison on Omics Omics) was that few of the somatic mutations in the metastasis were detected in the primary tumor sample from 9 years earlier. PCR and deep resequencing of mutation-containing amplicons in the metastasis and primary tumor allowed for a frequency analysis of the 32 somatic mutations. Five of these (in ABCB11, HAUS3, SLC24A4, SNX4, and PALB2) were present at high levels in the primary tumor, while another six (in KIF1C, USP28, MYH8, MORC1, KIAA1468, and RNASEH2A) were detectable at lower frequencies (1-3%). Of the remaining 21 mutations, 19 were not detected at all and 2 could not be determined.

I’m not an oncologist, but I still wonder how surprising it should be that many of the mutations in a metastatic tumor are absent from a primary tumor almost a decade earlier. Are these simply passenger mutations that arose from a surviving subclone from the original tumor, or are they key drivers of metastasis and tumor growth? Or was it the intervening radition and therapy that caused these mutations? There was zero discussion of the known functions of these genes in this paper, so it’s difficult to say. The authors contrast this result with our sequencing of AML1, though I’m not sure it is an appropriate comparison since (1) we had data from a relapse 3 years post-diagnosis, whereas theirs was from a metastasis 9 years post-diagnosis. Even so, the findings in the breast cancer study are interesting enough to merit further investigation.

References
Shah, S., Morin, R., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., Delaney, A., Gelmon, K., Guliany, R., Senz, J., Steidl, C., Holt, R., Jones, S., Sun, M., Leung, G., Moore, R., Severson, T., Taylor, G., Teschendorff, A., Tse, K., Turashvili, G., Varhol, R., Warren, R., Watson, P., Zhao, Y., Caldas, C., Huntsman, D., Hirst, M., Marra, M., & Aparicio, S. (2009). Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution Nature, 461 (7265), 809-813 DOI: 10.1038/nature08489

Nobel Prize: Telomeres, Telomerase, Aging and Cancer

October 8, 2009 by Dan Koboldt

The 2009 Nobel Prize in Physiology/Medicine was shared by three researchers “for the discovery of how chromosomes are protected by telomeres and the enzyme telomerase.”

Elizabeth H. Blackburn, now at the University of California, San Francisco.
Carol W. Greider, now at Johns Hopkins University
Jack W. Szostak, now at Harvard Medical School / Massachusetts General Hospital.

In response to the award, Scientific American made available a very nice article (Telomeres, Telomerase, and Cancer) written by Greider and Blackburn in 2006. In it, the authors chronicled the work of their predecessors as well as their own travails that eventually yielded the Nobel-prize-winning discoveries.

Discovery of the Telomere

In the 1930s, two geneticists working independently and with different model organisms – Barbara McClintock at the University of Missouri-Columbia (my alma mater) and Hermann J. Muller at the University of Edinburgh – found that chromosomes had “caps” on their ends that provided stability. Muller coined the term “telomere” and McClintock found that chromosomes without telomeres stick to one another or underwent structural changes that were bad news for the cell housing them.

Telomeres and the End-Replication Problem

In the 1970s, the makeup of telomeres (6-nucleotide repeats, or telomeric subunits) was determined. It differed across species, and the number of telomeric subunits varied from cell to cell, even within a single organism. Telomere size fluctuated over time as well. In 1972, our friend James D. Watson observed that DNA polymerases could not copy linear chromosomes all the way to the tip, which meant that some sequence at each end of a chromosome goes uncopied during DNA replication. Blackburn and others reasoned that there must be a way for cells to compensate for this constant shortening of telomeres.

It was in 1984 that Blackburn and Greider, working together at UC Berkeley, set out to find the agent that maintained telomere length in cells. Experiments with Tetrahymena cells showed that something in the cell extracts – an enzyme – had the ability to lengthen telomeres by adding subunits. The mystery agent, telomerase, is made up largely of protein. However, it contains one additional key element – a single-stranded RNA template from which new telomeric subunits are made.

A Factor in Human Aging

In complex organisms such as humans, not all cells express telomerase. Indeed, while telomerase is found in cells of the germline – no doubt to preserve telomeres in constantly-dividing germ cells – most somatic cells don’t have it. This means that as cells divide, their telomeres are shrinking. Somatic cells from a human newborn will divide 80 to 90 times when placed in culture, yet the same cells from a 70-year-old human will divide only 20-30 times. When cells that normally undergo division stop doing so, they “senesce” – change in appearance and eventually die.

Intuitively, this provides a very neat model for human aging, whose common hallmarks (e.g. wrinkles, weakened immunity, loss of hair) seem tied to the loss of dividing cells. Could it be that telomerase, then, might offer immortality?

Telomerase and Cancer

It appears so, though not in the long-sought Fountain of Youth manner that one might hope. Instead, telomerase plays a key role in the immortality of many forms of cancer. In 1996, Harley and Shay (the latter at UT Dallas) found that telomerase was expressed in 90 of 101 tumor samples representing 12 different cancer types, but none of 50 control samples of somatic cells. Greider, Harley and others, in working with virus-induced models of cancer, found that transformed cells divide continuously – shortening their telomeres as they go – and at some critical point, most of them die. Some few, however, begin expressing telomerase when their telomeres are dangerously short, and survive. It seems that telomerase activation is not the initiating event in tumor development; instead, it confers immortality to cells that are already cancerous.

As Blackburn and Greider point out in their article, this offers a very intriguing (if hypothetical) model for the role of telomerase in cancer. Normally, in the course of human development telomerase activity is suppressed in the cells of somatic tissue. These cells divide continuously until reaching some critical shortening of telomeres, at which they stop. If, however, somatic mutations occur that confer the ability to ignore that “stop” signal, the cells may transform. Loss of the telomeres would result in dangerously unstable chromosomes, which might explain the crazy cytogenetics and structural changes often observed in cancers. Many such cells will die, but some fraction may acquire changes that allow them to become immortal, such as the activation of telomerase.

It could be, however, that telomerase plays only an ancillary role in the development of most human cancers. Only further research, perhaps with tumors that have many structural/cytogenetic abnormalities, will define the part that telomeres and telomerase play in tumorigenesis.