AML: A New Era of Cancer Genomics

November 6, 2008 by Dan Koboldt

Next-generation sequencing technologies are said to be ushering in a new era of cancer genomics. A powerful demonstration of the new paradigm for cancer research came out today in Nature. It’s the much-anticipated publication of our AML project, in which we used the Illumina/Solexa platform to sequence the entire genome of a woman who died of acute myeloid leukemia. See my colleague David Dooling’s post on Politigenomics for links to some of the news coverage. This study offers two important milestones to the field:

1.) The first complete genome of a woman. Patient 933124 follows James Watson and J. Craig Venter into the archives of whole-genome history.

2.) The first cancer genome to be completely sequenced on a next-generation platform.

The basic biological problem is simple. This patient had AML. AML is a disease state that was initiated, and driven, by mutations in her genome. Her blood cells should be dominated by the clone of the most effective cancer genome. So, we sequence both tumor and germline DNA. We find any mutations in the tumor that are not in the germline, and identify which of those are novel, protein-coding variants. Among these should be, must be, the mutations that initiated and drove the development of cancer.

Major Informatic Challenges

Yet even with new technologies, this was no simple task. The sheer volume of data generated for this project is what amazes me. It took 98 full Solexa runs (4 libraries totaling ~5.86 billion reads) to reach our target diploid coverage in the tumor, which was 90%. The data was generated over a period of several months, during which both the sequencing technologies and the informatic algorithms were constantly evolving. The AML project offered me my first view of both 454 and Solexa data, and it presented our group with numerous challenges. Disk space. Computing power. Short read alignment. Variant calling. You name it.

The Power of the Unbiased Approach

It seems like a lot of work just for ten mutations. That’s how many validated, somatic, nonsynonymous mutations we found in this AML genome. Eight of these ten mutations, however, implicated new genes that were not previously linked to AML. Four of the genes are in gene families strongly associated with cancer pathogenesis (PTPRT, CDH24, PCLKC, and SLC15A1). The other four genes (KNDC1, GPR123, EBI2, and GRINL1B) are not known to contribute to cancer pathogenesis, but they have potential roles in metabolic pathways that may act to promote cancer growth. These are also four genes that would almost certainly have been excluded from a candidate gene approach.

Mutation Frequencies by 454 Read Count

Another interesting application of massively parallel sequencing was the estimation of mutation frequencies in the tumor sample using the Roche/454 platform. For each of the 10 somatic mutations, as well as 2 germline control variants, we performed PCR-targeted 454 resequencing in samples from the primary tumor, the relapse tumor, and the germline. The idea here was to profile the relative proportions of clonal cells that made up each sample. We got some results that the first author of the study, Tim Ley, described more than once (in our meetings) as “absolutely beautiful.” All of the somatic SNPs were at 50% frequencies in the tumor, as you’d expect for heterozygotes. They hovered slightly lower (around 40%) in the relapse sample, which was known to be less pure (i.e. ~78% blasts), but if you correct for the blast count, they reach 50% as well. Intriguingly, the somatic variants were detected in the germline sample as well at frequencies of 5-13%, suggesting that the skin sample was contaminated by a small fraction of leukemic cells. The one non-beautiful result was FLT3, which had frequencies of around 35% in tumor and 31% in relapse. It may be that the FLT3 ITD mutation was not present in all tumor cells; perhaps it was introduced later than the others.

Yes, We Can Find Indels in Short Reads

One of the significant bio-informatic challenges in which I became intimately involved was the detection of indels, which is theoretically possible but practically very difficult in fragment (non-paired) reads that are only ~36bp long. We ended up combining a few different approaches and found over 700 putative small indels, more than half of which were already in dbSNP. We attempted to validate 28 of these by 3730 sequencing. Two were the previously-known mutations in FLT3 and NPM1. Two were false positives. The other 26 were real, but present in the germline, which was a bummer since we thought they’d be somatic. Those are the breaks. Fortunately, indel detection is one area that will be helped dramatically by improvements to the sequencing technologies, namely longer reads and paired-end protocols.

A New Paradigm for Cancer Genomics

I think that most of all, this work was important because it established the feasibility of sequencing entire genomes with massively parallel / short read technologies and getting valuable results from it. It also drove us to develop and apply new algorithms (like decision trees) to analyze the data. I expect that we’ll begin to see a number of whole-genome-sequencing approaches to the study of cancer and other disease that take advantage of this new paradigm. The question of whether or not we can do science on a whole-genome scale has been answered. In the words of our next president, “yes we can!”

You there, with the Typhoid!

August 4, 2008 by Dan Koboldt

There’s an interesting study in this month’s Nature Genetics in which the authors performed 454 and Solexa whole-genome sequencing on 19 isolates of Salmonella enterica serovar Typhi, the pathogen that causes typhoid fever. Typhi is different from many other Salmonella pathogens in that it’s human-restricted; the main reservoir driving transmission of Typhi is thought to be human carriers. Also, Typhi isolates exhibit very low levels of genetic variation (something like 1 SNP every 2,300 bp).

Among the 19 isolates selected for this study, 10 were sequenced on 454 (average depth: 10.8x) and 12 were sequenced on Solexa (average depth: 20.4x) with 3 isolates done on both platforms. Following Newbler assembly, 454 contigs were aligned to the finished Typhi sequence with MUMmer. Solexa reads were “too short to be assembled effectively using current software” and thus were mapped directly to the finished sequence with Maq (v0.6.0). Cut-offs for SNP calling were determined by comparing data from 454, Solexa, and published sequences for the three strains done on both platforms.

The authors offered little discussion of the relative performance of 454 and Solexa technology for the three strains sequenced on both platforms. However, I got the scoop from Supplemental Table 1. After applying their filtering criteria, the platform-specific performance was as follows. On 454, the mean false positive rate was 1.8% and the mean sensitivity was 85.0%. On Solexa, the mean false positive rate was higher (2.7%) and the sensitivity lower (77.1%). These estimates, by the way, are based on the assumption that the SNPs detected independently by *both* platforms represent the true set of SNPs for each isolate.

The authors claim that a careful sampling strategy designed to capture the full phylogenetic tree, coupled with whole-genome sequencing, allowed them to capture much of the variation present in the Typhi population. Their analysis supports the previously proposed small population size and genetic drift, with little evidence for purifying selection, antigenic variation, or recombination between isolates. The vast majority of genes (72%) contained no SNPs; for the remainder, the distribution of SNPs per gene followed a Poisson distribution. The only gene with a strong signal of positive selection was gyrA, where mutations at codons 83 and 87 are associated with fluoroquinolone resistance; this no doubt reflects selective pressure on Typhi associated with antibiotic use in human populations. However, the sparse evidence for antigenic variation within Typhi suggests that this pathogen is not under strong selective pressure from the human immune system.

The low levels of purifying selection, antigenic variation, and recombination in Typhi are consistent with the role of human carriers as the main consistent reservoir for the pathogen. In other words, the disease persists because there are a number of people who are infected, but asymptomatic. The authors conclude that vaccination may be a crucial long-term strategy for control of typhoid fever because it would treat asymptomatic carriers as well as the infirm. I might add that the apparently-healthy carriers of the Typhi pathogen might be a promising population for immunogenetics studies.

NextGen Aligner Focus Group

June 23, 2008 by Dan Koboldt

As our genome center makes the tradition from capillary-based to massively parallel sequencing platforms, the development of automated pipelines for data processing has become a high priority. Last week we had a visit from Illumina’s informatics group to discuss several issues related to the GA (Solexa) platform, including image compression, data storage, workflow informatics, etc. There was also talk of a downstream analysis tool, called Bullfrog, that will perform SNP/indel/SV detection (though I got the impression that the software’s nowhere near release at present).

But Illumina is not the only platform, and Eland is certainly not the only aligner. Thus we’ve formed a focus group to evaluate the different programs for sequence alignment and variant detection in next-generation sequence data. We met last week and put together a list of aligners that work with Illumina (Solexa) and/or Roche (454) data. We also compiled, separately, a shorter list of external and internal programs that do SNP and indel detection on either platform. Some programs, like Maq, were in both lists because they do alignments and SNP detection. Some tools are feasible for short (Solexa-length) reads but not long (454-length) reads, and vice versa. In the end we had a list of 15 different aligners for Illumina/Roche data. Some are good, some are bad, and some we simply don’t know.

We agreed that the plan was to evaluate each aligner on the same data set, but decisions on which data set to use, and how to compare the different aligners, were matters of more intense debate. Should we work with human data, or focus on less complex genomes like C. elegans or E. coli? Performance metrics like CPU time, memory usage, disk space, and cost (some are non-free) are obvious points for comparison, but what about alignment accuracy? We need some way to determine if a read placement on the genome is correct or erroneous. How do we know? The question of alignment “truth” and how to determine it was not an easy one to answer.

After an hour of discussion, we tentatively agreed on a dataset – Illumina PE runs on the first human samples that we’ve already sequenced in-house for the 1000 Genomes Project. These runs come from one of the HapMap Project trios, which means that we can validate our SNP detection results against the known HapMap genotypes that were generated on a variety of platforms (and predominantly by other centers). Also, the 1000 Genomes Project DCC will be performing its own evaluation of alignment tools and sequence analysis using the same data, so we can compare notes.

We put together a short list, by platform, of the aligners to evaluate first. Some decisions here were easy – we’re obviously going to look at Maq and Eland for Solexa data, and we’re already evaluating BLAT and cross_match on some of our 454 data. Other decisions were more difficult – should we evaluate RMAP, whose authors [allegedly] don’t plan to continue development? What about SX OligoSearch, which we can currently only run on Itanium servers? We eventually had five or six aligners per platform that made the short list. This week, we’re putting together the data, and next week, the real work begins.