Archives for June 2008

NextGen Aligner Focus Group

June 23, 2008 by Dan Koboldt

As our genome center makes the tradition from capillary-based to massively parallel sequencing platforms, the development of automated pipelines for data processing has become a high priority. Last week we had a visit from Illumina’s informatics group to discuss several issues related to the GA (Solexa) platform, including image compression, data storage, workflow informatics, etc. There was also talk of a downstream analysis tool, called Bullfrog, that will perform SNP/indel/SV detection (though I got the impression that the software’s nowhere near release at present).

But Illumina is not the only platform, and Eland is certainly not the only aligner. Thus we’ve formed a focus group to evaluate the different programs for sequence alignment and variant detection in next-generation sequence data. We met last week and put together a list of aligners that work with Illumina (Solexa) and/or Roche (454) data. We also compiled, separately, a shorter list of external and internal programs that do SNP and indel detection on either platform. Some programs, like Maq, were in both lists because they do alignments and SNP detection. Some tools are feasible for short (Solexa-length) reads but not long (454-length) reads, and vice versa. In the end we had a list of 15 different aligners for Illumina/Roche data. Some are good, some are bad, and some we simply don’t know.

We agreed that the plan was to evaluate each aligner on the same data set, but decisions on which data set to use, and how to compare the different aligners, were matters of more intense debate. Should we work with human data, or focus on less complex genomes like C. elegans or E. coli? Performance metrics like CPU time, memory usage, disk space, and cost (some are non-free) are obvious points for comparison, but what about alignment accuracy? We need some way to determine if a read placement on the genome is correct or erroneous. How do we know? The question of alignment “truth” and how to determine it was not an easy one to answer.

After an hour of discussion, we tentatively agreed on a dataset – Illumina PE runs on the first human samples that we’ve already sequenced in-house for the 1000 Genomes Project. These runs come from one of the HapMap Project trios, which means that we can validate our SNP detection results against the known HapMap genotypes that were generated on a variety of platforms (and predominantly by other centers). Also, the 1000 Genomes Project DCC will be performing its own evaluation of alignment tools and sequence analysis using the same data, so we can compare notes.

We put together a short list, by platform, of the aligners to evaluate first. Some decisions here were easy – we’re obviously going to look at Maq and Eland for Solexa data, and we’re already evaluating BLAT and cross_match on some of our 454 data. Other decisions were more difficult – should we evaluate RMAP, whose authors [allegedly] don’t plan to continue development? What about SX OligoSearch, which we can currently only run on Itanium servers? We eventually had five or six aligners per platform that made the short list. This week, we’re putting together the data, and next week, the real work begins.

Harold Varmus’ New Era of Cancer Research

June 13, 2008 by Dan Koboldt

Yesterday Harold Varmus, Nobel laureate and president of the Memorial Sloan-Kettering Cancer Center, visited WashU and gave a talk on the “New Age of Cancer Research.” The auditorium was packed with just about every key researcher I know on campus. Tim Ley, longtime collaborator and co-director of the Genome Center, made the introduction. Together with Michael J. Bishop, Harold Varmus demonstrated the cellular origins of oncogenes – genes that control cell growth and proliferation that, when mutated, often lead to cancer – for which they won the Nobel Prize in 1989.

Dr. Varmus began with an overview of the three classes of genes involved in oncogenesis:

Proto-oncogenes
Tumor suppressor genes
Genes coverning DNA integrity

The protein products of these genes have diverse biochemical and physiological functions, including enzymes (e.g. tyrosine kinases). The discovery of oncogenes led to some new paradigms of cancer research, like the design of antibodies against oncogenic proteins (e.g. Herceptin), risk assessment by inherited mutation analysis, and occasional gene expression/ mutational profiling for diagnosis/prognosis/Rx.

Improved Mouse Models of Human Cancers

Recent improvements in mouse models of human disease, specifically models in which oncogenes can be switched on or off, were the central experimental focus of the talk. Basically, his group creates transgenic mice that express, or do not express, certain genes based on whether they are fed, or not fed, doxycycline. It offers a powerful model to study short and long-term effects of both oncogenes and cancer drugs.

Tumor Maintenance Genes and “Oncogene Addiction”

One important point Dr. Varmus made is that oncogenes not only initiate the oncogenic state, they maintain it as well. Without continued oncogene expression, cancerous cells die. Mouse models of this oncogene dependence phenomenon have led to the implication of several tumor maintenance genes, including:

C-MYC in T-cell myeloid leukemia
H-RAS in melanomas
BCR-ABL in B cell tumors
MET in hepatomas
C-MYC, NEV, and WNT-1 in mammary tumors
K-RAS in lung adenocarcinomas

It turns out that many cancer drugs work by targeting these genes. The poster child of such designer drugs is Gleevec, which treats chronic myeloid leukemia (CML). CML is the most common form of adult leukemia, and almost always arises from the “Philadelphia Chromosome”, a somatic translocation of chromosomes 9 and 22 that creates a fusion protein, BCR-ABL. Gleevec is remarkably effective at treating human cancers; some patients are disease-free for up to 7 years.

Tyrosine Kinase Inhibitors and Lung Adenocarcinoma

Other tyrosine kinase inhibitors have proven to be potent anti-cancer agents. Dr. Varmus told the well-known story of Iressa and Tarceva (gefitinib and erlotinib), which target mutant epidermal growth factor receptor (EGFR) proteins in lung adenocarinoma. The before-after slides of the lungs of a patient treated with these drugs are quite dramatic – about 4 days into treatment, the tumors are just gone. Before treatment, she was in a wheelchair and on oxygen because of the tumor load. Two weeks later she walks into the doctor’s office on her own feet, no oxygen.

Drug Resistance

Unfortunately, there’s a sad part to the story, as is often the case with cancer. Gleevec might buy you a few years. Gefitinib/Erlotinib work are effective for about one year. After that, the tumors become drug resistant, almost always because of secondary mutations in the tyrosine kinase domain. Sometimes, other drugs can treat the resistant tumors, but not always.

Katerina Politi, a talented postdoc in the Varmus lab, developed a mouse model of drug resistance to gefitinib/erlotinib by constitutively expressing mutant EGFR but intermittently treating mice with the drugs (4 week intervals). In the drugs-on phase, most tumor cells are eliminated, but the few that survive grow in the drugs-off phase. It’s a rapid model of selection for drug-resistant tumors. This mouse model led to several revelations about the secondary mutations underlying drug resistance [see Politi et al 2006]. Almost all are in the tyrosine kinase domain of the targeted protein, but other pathways (such as MET amplification) can lead to drug resistance as well. One particular mutation, T790M, is really bad news – tumors bearing it are refractory to virtually all drug alternatives.

The Future of Oncogenic Research

Dr. Varmus left us with a few points about where cancer research should go from here.

Genomics (and Epigenomics) – moving beyond the candidate gene approach to get the full repertoire of somatic changes in cancer. Obviously, the WashU GC is working on this.
Progression and metastasis – there’s more to learn about how tumor cells interact with their micro-environment.
New targets, new drugs, and better understanding of resistance – always more to learn.
Relate cancer to development – studying the vulnerability of certain cells to certain cancer types, and working with “cancer stem cells.”
Extend the mouse models – this guy loves a good mouse model
Form multidisciplinary teams – bringing together people with different expertise who can all tackle the cancer problem, like the TCGA project. Also, train scientists who work both in the lab and in the clinic to gain a more complete understanding of the disease.

As Tim Ley said, it’s good to see a Nobel laureate not “resting on his laurels.”

Fitness Effects of Amino Acid Mutations in Humans

June 4, 2008 by Dan Koboldt

The current issue of PLoS Genetics has an interesting article on the distribution of fitness effects (DFE) among new amino acid changing (nonsynonymous) mutations.

Adam R. Boyko et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4(5): e1000083. May 2008.

Call me old-fashioned, but I’m still impressed by strong datasets. The authors of this study resequenced the exons of 11,404 protein-coding genes in 35 individuals (20 EUA, 15 AFA), which provided a uniform ascertainment and frequency estimate for some 47,576 coding SNPs. The paper itself is very statistical in nature, with various “selection models” applied to determine the demographic and selective effects on amino acid variation in the human genome. Let me admit that I understand only the fundamentals of such things. While the authors only look at nonsynonymous and synonymous variants, they’ve done a lot of work to comprehensively investigate evolutionary models with their data. Let me hit you with the highlights:

They investigated the unfolded nonsynonymous site frequency spectra using 13 different selection models, including some complex two-parameter and three-parameter models.
The authors inferred a similar mean selection coefficient (-0.030) for newly arising mutations in European Americans as in African Americans, despite complications of demographic history (admixture) in both groups.
Various manipulations of the data showed that two major potential confounding factors, SNP ascertainment bias and weak selection at presumably-neutral sites, had little influence on the inferences from their data set.
The authors estimate that 10-20% of amino acid divergence between chimps and humans is due to positive selection. This figure holds in both African and European derived samples.
According to best-fit models, 27-29% of nonsynonymous changes are neutral, 30-42% are modestly deleterious, and the remainder highly deleterious. Due to the strength of purifying selection, however, deleterious mutations make up <1% of common segregating SNPs (MAF >= 0.05) in human populations.

It follows from the last point above that the vast majority of common human genetic variation, i.e. SNPs with derived allele frequencies of at least 5%, is neutral or nearly neutral with respect to fitness. If this is true, then there are important implications for genetic association studies, which often rely on surveys of common genetic variation in the human genome. Such studies may miss the rare, highly deleterious mutations that are both evolutionarily and medically relevant.

The authors conclude that “re-sequencing in large samples of phenotypically extreme individuals, on the other hand, is much more likely to discover rare, large-effect mutations that are predicted… to be deleterious.” As a HapMap consortium member I’m not sure that I agree outright, but as an employee of the WashU Genome Sequencing Center, I have to say, resequencing is not a bad way to go.