Driver Mutations and Metastasis

November 30, 2010 by Dan Koboldt

Two recent papers used very different appraoches to shed light on the genetic alterations underlying tumor growth and progression in human cancers. Peter Campbell and colleagues from the Wellcome Trust Sanger Institute employed Illumina paired-end sequencing to survey the landscape of structural variation in metastatic pancreatic cancer. Ivana Bozic and colleagues from Harvard University took a different approach – they constructed mathematical models of tumor progression via the accumulation of driver and passenger mutations. I happened to read both papers on a long airplane ride, and learned a great deal about mutations and metastasis in human cancers.

Pancreatic Cancer: Bad News

You learn a lot from the introduction sections of these papers, even if the Letter to Nature format keeps them short. I knew that pancreatic cancer had, in general, a poor prognosis. It turns out that the five year mortality for this cancer is 97-98%, usually due to “widespread metastatic disease.” These tumors also appear to carry a heavy mutational load. A 2008 survey of 24 pancreatic cancers (by Bert Vogelstein’s group at Johns Hopkins) found that tumors had ~63 genetic alterations on average, the majority of which were point mutations. Copy number changes are also common in this cancer type. Frequently mutated genes include tumor suppressors (TP53, SMAD4, CDKN2A) as well as oncogenes (KRAS, MYC). Less was known about the patterns of structural variation in pancreatic cancer.

Detecting Rearrangements by Paired-End Sequencing

Peter Campbell’s group has developed a very nice strategy for identifying somatically acquired rearrangments by massively parallel paired-end sequencing on the Illumina platform. They’ve already applied it to the characterization of SVs in several cancer cell lines. In this study, they generated 50-150 million read pairs (2 x 37 bp) per patient, which, in their experience, enables detection of 50-60% of rearrangements in a sample. Across the 13 pancreatic tumors, they identified 381 somatic and 177 germline rearrangements across seven categories: amplicon, deletion, tandem duplication, inversion, fold-back inversion, interchromosomal (translocation), and “other” intrachromosomal.

Many rearrangements corresponded with a change in copy number. In one metastasis, for example, numerous rearrangements (some inverted, some not) combine to amplify the KRAS oncogene.

Rearrangement/Amplification of KRAS (Credit: Nature).

Fold-back Inversions and Inter-Lesion Genetic Heterogeneity

One sixth of the rearrangements identified fell into a class the authors call “fold-back” inversions. These are genomic regions that are duplicated, but the two copies face in opposite directions from the breakpoint (as opposed to a tandem duplication). The authors suggest breakage-fusion-bridge cycles as the likely mechanism that creates such an event. Basically, a double-stranded break that occurs during G0-G1 phase is replicated (in S phase), creating two duplicated end sequences. These are fused together by DNA repair processes, resulting in a sort of inverted duplication (fold-back inversion) with two centromeres. These “dicentric” chromosomes are unstable, and frequently initiate the amplification of oncogenes.

Each rearrangement was [laboriously] genotyped by PCR in both the index tumor sample and matched normal control to verify the somatic status. Further, PCR and capillary sequencing were employed to resolve breakpoints, and some 206 rearrangements were genotyped across multiple lesions (metastases) in the 10 patients for which metastatic samples were available. There was a considerable amount of genetic heterogeneity among samples from the same patient. While the majority of rearrangements were present in all samples but not the germline (omnipresent); several were present in some samples but not others (partially shared) or unique to the index tumor sample (private).

Telomere Loss and Breakpoint-Fusion-Bridge Cycles

Fold-back inversions were significantly more likely than other classes of rearrangement to be omnipresent, suggesting that they occur early during tumor progression, before cancer cells disseminate. Because breakage-fusion-bridge cycles are often initiated by telomere loss, the activity of telomerase to maintain telomeres may play a pivotal role in the development of pancreatic cancer. Other studies have shown that telomerase expression is low in early tumor stages, but markedly increased in the invasive tumor. The increased expression likely suppresses breakage-fusion-bridge cycles, which may help explain why fold-back inversions are more likely to occur earlier in the development of the disease.

Ongoing Evolution in Tumors and Mets

In several patients, the authors found rearrangements that were in the primary tumor and some metastases, but not all of them. The most likely explanation for such a pattern is that the metastases were “seeded” by different cells from the primary tumor. This is intriguing, because it suggests ongoing clonal evolution, in the primary tumor, among cells capable of initiating metastases. There were also rearrangements in some metastases that weren’t detected in the primary tumor, suggesting that secondary lesions, too, are undergoing clonal evolution.

Overall, the authors demonstrated that pancreatic cancers and secondary invasions show a substantial amount of genetic heterogeneity within the same patient. There’s certainly more to be done to get the full picture of genetic alterations in these tumors, but at just ~4-10 Gbp of data per sample, the scope and nature of what the authors have uncovered is pretty impressive.

Drivers and Passengers

The other paper (contributed by Bert Vogelstein to PNAS) took a theoretical approach to modeling the accumulation of driver and passenger mutations during tumor progression. In contrast to previous models that account for only 1-2 mutations, the authors develop a model in which mutations occur sequentially in tumor cells, with each new driver mutation conferring a slightly faster growth rate. This more closely reflects recently-characterized solid tumors, which harbor 40-100 coding gene alterations, of which 5-15 are considered “driver” mutations.

Based on the assumption that any human cell contains 286 tumor suppressor genes and 91 oncogenes, the authors estimate that ~34,000 positions in the human genome could host a driver mutation. By this estimate, the driver mutation rate is approximately 3.4 x 10-5 per cell division. Under the authors’ assumption that each driver speeds tumor growth, the rate at which drivers accumulate becomes faster and faster, because the more drivers a cell has, the faster it divides. Not all mutations are successful, because they only reduce the probability that a cell will senesce or die (they don’t guarantee it). The authors considered a mutation in a tumor suppressor gene to be the central rate-limiting factor, since the other working copy tends to be lost relatively quickly due to large-scale LOH events.

Six simulated patients were modeled and presented in this study. All of them started with one driver mutation. Strikingly, though all of the input values (mutation rate, division rate) were the same, there was enormous variation in the rates of tumor progression between simulated patients. Patient 1, for example, went 20 years before acquiring a second driver mutation, and the size of the tumor remained small (<5 g). In contrast, patient 6 had a secondary driver mutation in less than 5 years; by the end of the simulation, that tumor weighed hundreds of grams. While this model is undoubtedly an oversimplification, it does highlight the importance of, well, random chance. Given the large size of the human genome and the relatively small number of potential driver mutations, an individual’s fate hinges on stochastic processes. If you’re lucky, you go decades without picking up that crucial second hit. If you’re unlucky, you don’t.

Intuitively, this seems reasonable, given the anecdotal evidence of de novo cancers, which seem to strike somewhat randomly. Of course, the older you are, the more times your cells divide, and the better chance you have of picking up additional driver mutations. And environmental exposures (like smoking and radiation exposure) certainly have a role to play, because they increase cellular mutation rates. Even so, if you believe in the model, chance plays a significant role.

Here’s to hoping you’re one of the lucky ones.

References

Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, Karchin R, Kinzler KW, Vogelstein B, & Nowak MA (2010). Accumulation of driver and passenger mutations during tumor progression. Proceedings of the National Academy of Sciences of the United States of America, 107 (43), 18545-50 PMID: 20876136

Campbell PJ, Yachida S, Mudie LJ, Stephens PJ, Pleasance ED, Stebbings LA, Morsberger LA, Latimer C, McLaren S, Lin ML, McBride DJ, Varela I, Nik-Zainal SA, Leroy C, Jia M, Menzies A, Butler AP, Teague JW, Griffin CA, Burton J, Swerdlow H, Quail MA, Stratton MR, Iacobuzio-Donahue C, & Futreal PA (2010). The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature, 467 (7319), 1109-13 PMID: 20981101

The Fruits of a Thousand Genomes

November 1, 2010 by Dan Koboldt

Last week saw the publication of the 1,000 Genomes Project, which has characterized ~15 million SNPs, 1 million short insertions/deletions (indels), and 20,000 structural variants in seven human populations. This is discovery and genotyping at unprecedented scale, with an astonishing 4.9 terabases (trillion bases) sequenced – the equivalent of about 1,500 human genomes – across three pilot projects:

Deep whole-genome sequencing of trios (mother-father-daughter) from 2 populations
Low-coverage sequencing of 179 unrelated individuals from 4 populations
Exon sequencing of 906 randomly-selected genes in 697 individuals from 7 populations.

The three pilots have shed new light on sequence variation in human genomes and its distribution among human populations. Perhaps unsurprisingly, variation was not evenly distributed in the genome – certain regions (e.g. HLA and sub-telomeres) show high rates of variation, whereas (e.g. a 5 Mbp, gene-dense, highly-conserved region on chromosome 3) show very little. At the chromosomal level, different forms of variation were highly correlated (e.g. SNPs and indels), but there were exceptions for some types of structural variants implicating different mechanisms of mutation.

Novelty and Population-Specificity

The vast majority of SNPs detected were already known to dbSNP. Among known variants, 56% were present in all population panels while 25% were found in only a single panel. In contrast, only 4% of novel variants were found in all panels and 84% were found in only one. This difference supports the notion that the majority of common SNPs in human populations have already been found. There’s more work to do for other forms of variation, though. Many of the novel SVs were detected in all population panels. Half of the common short indels had never been reported.

The smallest two chromosomes – mitochondrial and Y – seemed to benefit the most. There was a lot of heteroplasmy in mitochondrial DNA within individuals – 79% of samples had length heteroplasmy, and 45% had substitution heteroplasmy. On the Y-chromosome, there were 2,870 variable sites, most of which (74%) were novel to public databases. These new variants helped identify several clear, significant sub-clades within the 12 haplotype groups represented in 1,000 Genomes samples.

Coding Regions and Loss-of-Function Variants

In total, the three pilots identified 68,300 non-synonymous variants, almost half of which were novel. Genotyping a subset of these in 620 samples revealed novel NSS variants had dramatically lower minor allele frequency (2.2%) than known ones (26.2%). From this I can draw two conclusions: most novel nonsynonymous variants are rare, and the majority could only have been identified by population-scale sequencing projects like these.

The authors estimate that an individual genome differs from the reference at 10,000 to 11,000 nonsynonymous sites and perhaps 12,000 synonymous sites. A typical genome harbors a much smaller number of loss-of-function (LOF) variants — inframe/frameshift indels, early stops, and splice-site variants — perhaps 340-400 LOF variants per individual, affecting 250-300 genes. Compared to synonymous variants, putative functional variants (nonsynonymous and LOF) tend to have lower allele frequencies and be more population-specific, presumably due to the action of purifying selection against deleterious mutations. Which means, of course, that the really important variants are much harder to find.

Signatures of Natural Selection

Looking in and around genes, the authors found diversity is lowest in exons (50% that of introns) and slightly reduced in 5′ and 3′ UTRs, compared to intronic and intergenic sequences. This signature of natural selection acting upon genes actually has a broad effect; diversity is reduced by 10% in the vicinity of genes compared to gene-distant loci, and that reduction extends up to 85 kbp away. Thus, selection on linked sites appears to restrict variation across the majority of the human genome. Looking across panels, the authors observed that SNPs with large allele frequency differences between populations were enriched for nonsynonymous sites, likely reflecting local adaptation and selection by different continental groups.

Finally, the authors examined the trios to look at a different environment for mutation and selection – immortalized cell lines. Some 952/1001 new mutations in the CEU daughter and 634/669 new mutations in the YRI daughter were not present in the germline, indicating that they occurred either in somatic cells or in the cell lines. Further, the higher number of mutations in the CEU sample may be related to the age of the lines – the CEU line is decades older than the YRI line.

Implications for Future Studies

The findings of the 1,000 Genomes Project thus far have immediate, significant impact on genetic association studies. Using publicly available gene expression data and their expanded catalogue of variants, the authors identified 20-30% more significant expression quantitative trait loci (eQTLs) than had previously been detectable. Thus, it is clear that while existing SNP arrays represent the majority of common variation, a significant amount of rare, phenotypically-relevant variation remains to be incorporated.

References
1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092

The Four Dimensions of a Breast Cancer Genome

April 15, 2010 by Dan Koboldt

Published today in the journal Nature is the whole-genome sequencing of a basal-like breast cancer tumor, metastasis, and xenograft. There’s also a News and Views article by Joe Gray of Lawrence Berkeley National Laboratory, as well as a news feature on large-scale cancer projects.

brc1-nature08989screenshot

This study is a bit unlike our previous cancer genomes (AML1 and AML2). By my count it is the sixth cancer genome to be sequenced, and the third to come out of the Genome Center at Washington University. Obviously, it’s our first solid tumor. What’s particularly interesting about this study, however, is that we sequenced four DNA samples from a single patient with “double-negative” breast cancer: the primary tumor, peripheral blood (normal), a brain metastasis, and a mouse xenograft derived from the primary tumor. The xenograft is a success story in itself – we managed to create a human-in-mouse (HIM) transplant of the primary tumor that was >90% pure when harvested 101 days after engraftment.

The genomes of these four samples (tumor, normal, metastasis, and xenograft), examined with the incredible power of Illumina massively parallel sequencing, offer an unprecedented view of the somatic changes that underlie breast cancer development, growth, and metastasis.

Repertoire of Somatic Mutations

We validated a total of 50 somatic sites in at least one of the three cancer genomes, including:

28 missense mutations predicted to alter the sequence of an encoded protein
11 synonymous (silent) mutations in coding sequences
4 small insertions ranging in size from 1 to 6 bp
3 small deletions ranging in size from 1 to 13 bp
2 splice site mutations at intron-exon junctions
1 nonsense mutation predicted to result in a truncated protein
1 RNA mutation in a gene encoding a signal recognition particle (SRP) RNA.

We employed deep Illumina sequencing of PCR amplicons to assess the frequencies of each mutation across all four tissues. Intriguingly, more than half of them exhibited differential frequencies between primary tumor, metastasis, and/or xenograft. Two mutations (a nonsense mutation in MYCBP2 and a missense mutation in TGFBI) were significantly enriched in the primary tumor (88-89% vs 14-44%). Some 26 mutations were significantly enriched in the metastasis and/or xenograft. Perhaps most interesting, however, were two sites (a missense mutation in SNED1 and a silent mutation in FLNC) that appear to be de novo mutations unique to the metastasis.

Acquired Structural Variation

Using our internally developed tools for structural variant prediction (BreakDancer) and de novo assembly (TIGRA), we predicted 59 deletions and 18 inversions that were putative somatic events. Validation by PCR and 454/3730 sequencing showed that 73/77 (94.8%) were real structural variants, of which 34 (28 deletions and 6 inversions) were somatic alterations not present in the normal genome. Among them was a 46.5 kbp heterozygous deletion affecting FBXW7 (a known cancer gene) and two overlapping 500-kb deletions affecting CTNNA1 and a handful of other genes. The latter was particularly interesting, because loss of CTNNA1 has been shown to result in global loss of cell adhesion in human breast cancer cell lines.

We also validated seven translocations with a combination of manual review (Pairoscope), assembly, and PCR/3730 sequencing. One translocation that we assembled in all three tumor samples involves a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and the ABCA2 gene on chromosome 9. Two other validated translocations that assembled in all three tumors are on chromosome 2, and separated only by a 393-bp TcMar-Tigger repeat.

Insights from Comparisons of Tumor, Metastasis, and Xenograft

One of the most intriguing findings from our study was the differential mutation frequencies and structural variation patterns that we observed in the metastasis and xenograft, compared to the primary tumor. More than half of the somatic mutations (26/50) were significantly enriched in the metastasis and xenograft, while observed at relatively low frequencies in the primary tumor. This suggests that a sub-population of tumor cells, not the primary clone, gave rise to the cerebellar metastasis that eventually killed the patient.

Is there a fitness cost to the mutations that enabled metastasis? Can we develop sensitive tests to detect the cells that are likely to spread? Genome sequencing has brought us to a point where we can begin to ask these questions, and answering them brings us one step closer to unraveling the complex, devastating, deadly disease that is cancer.

References
Li Ding, Matthew J. Ellis, Shunqiang Li, David E. Larson, Ken Chen, John W. Wallis, Christopher C. Harris, Michael D. McLellan, Robert S. Fulton, Lucinda L. Fulton, Rachel M. Abbott, Jeremy Hoog, David J. Dooling, Daniel C. Koboldt, Heather Schmidt, Joell (2010). Genome remodelling in a basal-like breast cancer metastasis and xenograft Nature, 464 (15), 999-1005 : 10.1038/nature08989

AGBT: PacBio Somewhat Unveiled

February 27, 2010 by Dan Koboldt

Yesterday the Pacific Biosciences commercial instrument (photo) was at last unveiled to a packed room of conference attendees. The road to this third generation sequencer’s release has been paved with nearly $300 million of investment capital since leaving a basement at Cornell University. PacBio, in addition to becoming something of a media darling, has quietly swelled to a several-hundred-employee company.

Since last year, PacBio claims to have achieved read lengths of up to 10.3 kbp, although I haven’t spoken to anyone outside the company who has seen reads that long. Even so, a few vignettes presented in the workshop told of how PacBio has been applied to influenza strain identification and detection of stuctural variants (SVs).

Strobe Sequencing in Real Time

Of particular interest is the “strobe sequencing” mode of the instrument, in which the detection laser is turned off for precise amounts of time to generate mate-pair-like reads spanning large fragments. This feature relies on the real time sequencing, which occurs at a very consistent per-base rate. In fact, it’s possible to infer sequence insertions and deletions as spikes or dips (respectively) in the time required to sequence a template of known size.

Kinetic Variation Applications

The kinetics of real-time sequencing offer an informative new dimension of information from the PacBio data. In a talk today, Eric Schadt of PacBio showed that the kinetics of sequencing vary significantly for “modified” bases, i.e. methylated residues. In a collaboration with Carrie Harwood (UW), PacBio is sequencing the genomes and transcriptomes of 132 isolates of a hydrogen-producing species of Rhodopseudomonas. It turned out that kinetic variation exists at many bases as a “mixture” of sequencing times; by mining these, they identified thousands of methylated bases that caused up to 12-fold variation in sequencing kinetics.

Burning Questions Unanswered

Personally, I was not entirely satisfied with the PacBio workshop. When it opened for questions, I asked the first: whether PacBio had improved any upon the “dark bases” that go by undetected in single molecule sequencing. The presenter — Stephen Turner of PacBio — first gave me a nice 2-minute lecture on why there are no such thing as “dark bases” on PacBio’s sequencing platform due to its inherent awesomeness (sarcasm mine). There is still a problem with “missed bases” but Turner was almost comically evasive (as Daniel MacArthur put it) in stating how often they occur. The next question concerned read lengths, a second topic on which Turner refused to provide concrete information.

Thus, I find myself cautious in my excitement about this new platform, and will reserve judgment until later this year, when the first of the golden-ticket early access partners begin generating data on their own PacBio SMRT sequencers.