Archives for March 2011

Sequenced: A mouse model of leukemia

March 24, 2011 by Dan Koboldt

A study published yesterday in the Journal of Clinical Investigation reports the whole-genome sequencing of a mouse acute promyelocytic leukemia (APL) genome. This is a subtype of AML, characterized by the presence of a t(15;17) translocation that creates the PML-RARA fusion oncoprotein. In mice, you can induce expression of PML-RARA transgenically, and they’ll develop APL after a latency period that’s often a year or more. This suggests that the oncogene alone is not sufficient to cause disease, but requires additional progression mutations. The identity of these mutations, and the means by which they contribute to APL development, are largely unknown.

Generating A Mouse APL Cohort

Wartman et al performed have a mouse model of APL in which a human PML-RARA cDNA is knocked in to the 5′ UTR of the cathespin G gene on chromosome 14 of the 129/SVJ strain. The fusion protein is expressed in myeloid progenitor cells, and the mice develop myoproliferative disease that evolves into “acute leukemia with promyelocytic features” after a latency of 7-18 months. In the current study, they back-cr0ssed transgenic mice onto the C57BL/6 background for 10 generations. The goal here was to reduce “passenger” variation inherent to the 129/SVJ genome. A large cohort was established for sample banking; some 60% of the resulting mice went on to develop APL.

Whole-genome Sequencing and Mutation Detection

To identify disease-contributing mutations, the authors performed whole-genome sequencing on the diploid tumor genome of a mouse that developed APL after 335 days. Three runs on the Illumina GAIIx platform yielded 59.64 Gbp (billion base pairs) of sequence, roughly 16x haploid coverage of the mouse genome. But the subsequent SNP calling revealed a bit of a surprise: over 100,000 variants. Unfortunately, a matched germline sample for the donor mouse hadn’t been preserved; it was thought unnecessary because of the back-crossing.

So the authors did the next best thing: sequenced the 129/SVJ genome, using pooled samples from 6 young wild-type males. At roughly 30x coverage, some 4.95 million SNPs between 129/SVJ and C57BL/6 were identified. These served to eliminate 79,339 of the tumor SNVs. Another 17,179 occurred in “contiguous blocks” of 4+ SNPs within 40 kbp, which were presumed passenger mutations arising from genetic drift. That left 15,628 potential somatic SNVs, of which 31 were heterozygous “tier 1” (coding) variants.

Pursuit of the Coding Mutations

PCR and 3730 sequencing confirmed 8 of 31 coding SNVs as valid mutations. Two were synonymous (silent) mutations and not given further consideration. The remaining 6 were screened in a collection of 89 mouse APL tumors. Three were recurrent among litter-mates, presumably de novo germline mutations in a common ancestor. Two weren’t recurrent at all. Only one remained: a missense mutation (V657F) in the pseudokinase domain of Jak1. Intriguingly, this mutation is homologous to a recently described activating mutation (Jak1 V658F) found in human APL and ALL tumors. The orthologous position in human Jak2 is commonly mutated (V617F) in myoproliferative neoplasms, or MPN.

JAK Mutations in Human Cancers

In the commentary article, Rampal and Levine note that more than 30 mutations in JAKs have been reported in various human cancers including AML, ALL, breast cancer, lung cancer, and others. Janus kinases (JAKs) are protein tyrosine kinases involved in the transduction of cytokine receptor signaling. Since these are predominantly gain-of-function mutations, JAKs are an appealing target for new cancer drugs. Indeed, some JAK inhibitors have already entered late-stage clinical trials for treatment of MPN.

In the last part of the paper, the authors demonstrate that a pan-JAK inhibitor reduced APL colony formation with similar efficacy to all-trans retinoic acid, the standard therapeutic agent for APL. Thus, their work not only serves to elucidate the pathogenesis of leukemia, but also points the way to new strategies for treating this type of cancer.

References
Wartman, L., Larson, D., Xiang, Z., Ding, L., Chen, K., Lin, L., Cahan, P., Klco, J., Welch, J., Li, C., Payton, J., Uy, G., Varghese, N., Ries, R., Hoock, M., Koboldt, D., McLellan, M., Schmidt, H., Fulton, R., Abbott, R., Cook, L., McGrath, S., Fan, X., Dukes, A., Vickery, T., Kalicki, J., Lamprecht, T., Graubert, T., Tomasson, M., Mardis, E., Wilson, R., & Ley, T. (2011). Sequencing a mouse acute promyelocytic leukemia genome reveals genetic events relevant for disease progression Journal of Clinical Investigation DOI: 10.1172/JCI45284

Rampal, R., & Levine, R. (2011). Finding a needle in a haystack: whole genome sequencing and mutation discovery in murine models Journal of Clinical Investigation DOI: 10.1172/JCI57200

Exome sequencing of human induced stem cells

March 18, 2011 by Dan Koboldt

Human induced pluripotent stem cells (hiPS cells) have incredible promise for therapeutic use. With genetically identical stem cell lines, it may become possible to replace cells or tissues that have been compromised by disease. Yet questions remain about the safety of hiPS cell lines. Does the induction of pluripotency alter the genome of the resulting cell? If so, what mutations could arise? A new study in Nature begins to shed light on these issues. Using whole-exome sequencing, Gore et al characterized somatic coding mutations in 22 hiPS cell lines that were reprogrammed using five different methods.

Patterns of Mutation

They identified and validated a total of 124 mutations, or roughly 5-6 per hiPS line. The majority of mutations (92/124, or 74%) were missense, nonsense, or splice-site variants predicted to alter protein sequences. Strikingly, some 50 of the affected genes were known to be mutated in human cancers, an enrichment that proved highly significant (p=0.0019). Among these were tumor-suppressor gene ATM, tyrosine kinase receptors NTRK1 and NTRK3, cell division proteins, and others. Furthermore, 14 of the 22 hiPS lines had acquired mutations in genes linked to Mendelian disorders.

Sources of Mutation

Where did all of these mutations come from? The authors considered two possible explanations:

The mutations already existed in the fibroblast cells, having been acquired over the lifetime of the donors, or
The mutations were newly acquired during or shortly after reprogramming, and became fixed hiPS cell populations.

The ages of the donors for this study ranged from 0 to 82 years, and did not correlate with mutational load. Further, the observed mutation rate was ten-fold higher than that of skin fibroblasts from the same patients grown in culture without reprogramming.

Uber-Deep Sequencing in Normal Fibroblasts

All of the somatic mutations in hiPS lines were heterozygous and fixed at roughly ~50%. Thus, the process for generating these mutations had already completed. It could be that some mutations were present at very low levels in the skin fibroblasts, and became fixed through clonal expansion. To investigate this possibility, the authors performed PCR and deep Illumina sequencing (10-million-x coverage) for 32 mutations in the original fibroblast DNA. They claim that for 17 of 32 mutations, they detected the variant allele at low frequency (0.003-10%) by comparing read counts between fibroblasts and negative controls to remove the noise. Personally, I’m leery of this result. We’re talking about detection of variants at levels far below the known error rate for Illumina sequencing (~0.1%), and that doesn’t even account for PCR. I know, I know, you can try to model the errors by looking at the negative controls, but I just don’t buy it.

Dead Ends: Reprogramming, Mutation, and Selection

Even if you believe the uber-deep read counts, roughly half of the mutations (15/32) are completely absent from fibroblasts. Also, different hiPS clones derived from the same line contained different sets of mutations. These observations suggest that a significant portion (at the very least) of somatic mutations occurred during reprogramming and subsequent culturing. This might have happened during a short window of elevated mutation rate, possibly due to transient repression of TP53, RB1, or other tumor suppressor genes. Selection, too, might have played a role by favoring mutations that facilitated induction or colony growth. Yet the colonies with mutations in tumor-suppressor genes had similar mutational loads to those without, and pathway analysis of the iPS-acquired mutations found no significant functional advantage.

We seem to leave this study with more questions than answers about mutations in hiPS cell lines. When do they occur, and by what mechanism? How do they consistently become fixed in the colony population? More studies are needed to shed light on these important issues.

References

Gore A, Li Z, Fung HL, Young JE, Agarwal S, Antosiewicz-Bourget J, Canto I, Giorgetti A, Israel MA, Kiskinis E, Lee JH, Loh YH, Manos PD, Montserrat N, Panopoulos AD, Ruiz S, Wilbert ML, Yu J, Kirkness EF, Izpisua Belmonte JC, Rossi DJ, Thomson JA, Eggan K, Daley GQ, Goldstein LS, & Zhang K (2011). Somatic coding mutations in human induced pluripotent stem cells. Nature, 471 (7336), 63-7 PMID: 21368825

Improving Detection of Genome Structural Variation

March 8, 2011 by Dan Koboldt

Large-scale structural variation (SV) is pervasive in the human genome, both in healthy individuals and in tumor cells. Numerous methods have been developed to detect such variants, most of which rely on the information provided by molecularly paired reads. Even the most sophisticated methods, however, still generate numerous false positives. A new study in Nature Genetics describes an innovative, population-based method to improve the accuracy of SV calling. In their introduction, Handsaker et al offer four main causes underlying false positives in SV calls:

Sequencing errors, which occur more frequently in next-generation sequencing data and exhibit both random and platform-specific bias distributions.
Chimeric molecules, in which read pairs linking two non-continguous segments of DNA masquerade as SVs. Sequencing libraries can contain millions of such fragments, which represent ~1% of sequence reads.
Read depth variation, which fluctuates across the genome for both biological and technical reasons.
Genome repeats, which confound most short-read aligners even when read pairing information is available.

These issues are exacerbated in population-scale sequencing, which often yields lower coverage across large numbers of samples. As more genomes are sequenced, false positives accumulate faster than real variants do. However, the authors hypothesized that population-scale sequencing might enable new analytical approaches. Here, they describe three strategies to do just that: allele sharing, population heterogeneity, and allelic substitution.

Credit: Nat. Genet. 768 (Handsaker et al, 2011)

Coherence Around Shared Alleles

Most of the variation in any given genome is shared, at some level, with other members of the population. Pilots from the 1,000 Genomes Project have shown that variants with appreciable allele frequencies (>1%) in the population will generally be shared by multiple samples if the pool is sufficiently sized. Further, for medical sequencing projects, causative variants should be enriched among cases even if they’re rare at the population level. The authors sought to exploit shared variation wherever it could be found, without filtering out singleton variants. In essence, they looked for evidence of similar deletion alleles (measured by larger-than-expected insert sizes for read pairs) across multiple samples. The idea was that random chimeric events should be specific to a single library, whereas SVs reflecting true variation should persist across multiple libraries from multiple samples. Looking in the 1,000 Genomes data, the authors found that 89% of the SVs had evidence across multiple genomes.

However, it became clear that allele coherence by itself was an insufficient criterion for SV calling, because even after it was applied, there were ten times the number of expected SVs according to extrapolations of copy number data.

Heterogeneity in Populations

Next, the authors sought to use allele heterogeneity in their sample populations to distinguish real variants, which should be present in some individuals, but not others. For each deletion, they performed a chi squared test of the number of read pairs supporting or not supporting the variant across 168 genomes. The resulting p-value, or heterogeneity statistic, was consistently low for “control” deletions that were known to be real by copy number data. Many of the loci that had passed the shared allele coherence test, but failed the heterogeneity statistic, were flanked by homologous sequences that caused aligners to mis-place reads; copy number data suggested that few such cases represented real variants.

Copy Number Correlations

To bolster the support for putative SVs, the authors evaluated the relationship between predicted deletions and copy number depth for the reference allele. In theory, if the variants represented true deletions, there should be a corresponding drop in coverage. In many cases, there was no such correlation; further review showed that many of these loci bore cryptic polymorphisms (often small indels) that caused reads to mis-align to nearby, paralogous sequences. Another cause of predictions that passed shared-allele and heterogeneity tests but failed the read depth correlation was transposon insertion polymorphisms not contained in the reference sequence. Reads from such insertions often mapped to nearby paralogous sequences, thereby falsely supporting large deletions of the intervening sequences.

Breakpoint Resolution and Genotype Determination

By combining data across all individuals found to have a structural allele in common, it was possible to localize the breakopints of deletions with resolutions of 1-20 bp. Many types of information in population-scale sequencing data – paired-end alignments, read depth, and breakpoint-spanning reads – can supply partial information about the genotype state of SVs in individuals. The authors developed a Bayesian framework to integrate this information into an integrated measurement of relative likelihood that the sequence data from each genome arose from each potential SV allele at that locus. Comparisons of these inferred genotypes to copy number data and (where available) high-resolution array genotype data supported a high accuracy for the method, and showed that the confidence score tracked with accuracy.

For deletions smaller than 300 bp, few genotypes could be inferred with high confidence. To resolve these, the authors utilized haplotypes formed by SNPs and SVs together. Most common SVs characterized to date have been shown to segregate with common SNP haplotypes; by employing imputation algorithms and haplotype information, the authors were able to extend this approach to resolve many low-confidence genotypes. The resulting calls were consistent with the features of sequence data and also fit the haplotype structure of the population. The authors genotyped 13,826 of the deletion polymorphisms identified by the 1,000 Genomes project (ranging in size from 48 bp to 960 kbp), with an average call rate of 94.1%. This was ten times as many deletions as could be genotyped by combination SNP-CNV arrays that were designed for genome-wide association studies.

In summary, Handsaker and colleagues have presented strategies that could help develop new analytical approaches as sequencing is extended to large populations. Together with SNP- and small indel-detection algorithms, these approaches will help realize the full potential of population-scale sequencing.

References
Handsaker RE, Korn JM, Nemesh J, & McCarroll SA (2011). Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature genetics, 43 (3), 269-76 PMID: 21317889