Archives for October 2008

TSP and TCGA Cancer Projects in Nature

October 23, 2008 by Dan Koboldt

If you look very closely, possibly with the aid of a magnifying glass, you’ll see my name among the authorship lists for two articles in the October 23 issue of Nature. The first of these is the Cancer Genome Atlas (TCGA) study of glioblastoma that came online a couple of weeks ago. The second is a study of lung adenocarcinoma, through the Tumor Sequencing Project (TSP). Both articles are the fruits of large-scale candidate gene resequencing efforts by genome centers at Washington University, Baylor, and the Broad Institute (MIT/Harvard).

One of the most interesting, if unsurprising, findings of the lung adenocarcinoma study was that the numbers of mutations per tumor were substantially higher in smokers than non-smokers. If you ever doubted the evidence that connects smoking to lung cancer, listen up. Tumors from smokers had as many as 49 mutations, whereas those from non-smokers never had more than 10. One basic way to interpret this: if you smoke, you have five times as many mutations as someone who doesn’t. Five times as many chances to have a mutation that leads to cancer.

Highly-Mutated Genes in the mTOR Pathway

A high-level view at the molecular pathways we studied for TSP reveals that some genes are mutated far more often than others. In the mTOR pathway, for example, highly-mutated genes included both oncogenes (RAS, EGFR, MDM2) and tumor-suppressor genes (TP53, STK11, CDKN2A). More than a third of the patients studied has mutations in this pathway, suggesting that rapamycin (whose target is mTOR) offers a promising therapy.

Sequence-based LOH Analysis

One of the genes we studied (STK11) did not have SNP-chip data. To examine potential LOH in STK11, we sequenced both tumors and normals at positions of known high-frequency variants. Then, we compared the tumor genotype to the normal genotype for each sample at each position. Several samples had extensive LOH in the tumor across STK11, suggesting a large-scale structural event (probably a deletion) across the gene region. There were even a couple of samples with both LOH and mutations in STK11; these patients are probably 0 for 2 on functioning proteins encoded by this tumor suppressor.

The Future of Cancer Genomics

It seems to me that TSP, TCGA, and other large-scale initiatives are yielding a wealth of information about cancer genomics. And these were all 3730-based sequencing efforts. I can hardly imagine the things we’ll learn from studies that apply next-generation sequencing platforms. You can bet they’re on the way.

Dave and Decision Trees for NGS

October 15, 2008 by Dan Koboldt

My colleague David Larson just returned from CSHL’s Personal Genomes meeting, where he presented a poster on decision-tree filtering of variant predictions from Illumina/Solexa data. I don’t know much about machine learning, but I can see that it offers a useful approach in at least one aspect of next-generation sequencing.

From my basic understanding, a decision tree is a machine learning algorithm that you “train” on a dataset where the correct decisions are known, and then apply to another dataset in which decisions are not known.

A sample decision tree that uses weather attributes to determine if a game will be played or not. Image Credit: Wikipedia

A sample decision tree that uses weather attributes to determine if a game will be played. Credit: Wikipedia

For example, Dave’s poster described a decision tree that determines whether SNP predictions from Solexa are real (“Germline”) or false-positives (“WildType”). As a training set, Dave used ~650 SNPs whose true status had previously been determined on 3730 sequencing. For each SNP, he provided several attributes (base quality, read count, etc.) as well as the correct “answer (Germline or WildType) as determined by 3730. These inputs went into the c4.5 program which generated a decision tree to distinguish Germline from Wildtype based on these characteristics.

Dave applied the decision tree to whole-genome Solexa data for an individual that we recently sequenced to over 10x coveraged with Solexa fragmented reads. Maq had predicted ~5 million SNPs; the decision tree filter cut this number in half. Even more promising, Dave’s decision tree filter isolated a substantially better data set. Over 90% of the SNPs detected by array-based genotyping were among the Germline-classified SNPs. Concordance with dbSNP, which is one of our measures of specificity, was over 80% the last I heard.

It occurred to me that the decision tree approach has numerous applications for next-generation sequencing analysis. It could be used to distinguish true variants from false positives, or somatic mutations from germline variants. Decision trees might also be informative for short read alignments, where a number of attributes (read length, alignment score, alignment quality, mismatches, etc.) could be used to determine whether or not a read was correctly placed.

After talking with Dave, I spent half a day building decision trees that might be useful for 454 variant detection. One thing I realized very quickly is the importance of the training data set. First, I tried a training set of ~75 variants sequenced by 3730. This was way too small, yielding a tree with one decision (allele frequency) to classify the data. Then, I tried a training set of ~400,000 454 read alignments with several attributes. This was far too much, yielding a massive tree with hundreds of branches. Also, I worry about the correctness of the “answers” in my data sets. While 3730 sequencing is a gold standard, it also has a tendency to miss certain kinds of variants, which might be detected in 454. Real variants, labeled as Wild-Type in the training data set. I think I’ll have to find a larger, more reliable training set before decision trees bear fruit for 454 variant detection.

Help Wanted at the Genome Center

October 2, 2008 by Dan Koboldt

The WashU Genome Center is hiring! Well, they’re almost always hiring, but one of the current open positions is in my group. So I thought I’d put the word out here on Massgenomics.

The basic requirements of the staff scientist position are outlined on the GC web site. We’re looking for someone with a degree (preferably graduate degree) and 4+ years of experience in computer science, biology, or a similar field. This person must have solid programming abilities, ideally in Perl. Most of these guidelines apply to just about any non-laboratory position at the GC, so they’re not terribly informative. Since we’re hiring someone in my group, however, I can probably offer some advice about what we’re looking for.

Our work centers around analysis. We develop, test, and apply algorithms for sequence analysis, mutation detection, and similar tasks. We work on several projects concurrently. As one of the big three genome centers in the U.S., we play a significant role in major initiatives like the Tumor Sequencing Project (TSP), the Cancer Genome Atlas (TCGA), and the 1000 genomes project. Our analysis pipeline for traditional capillary-based resequencing is largely in place, so the focus is on next-gen technologies (Roche/454, Illumina/Solexa, ABI/Solid).

For this position, programming abilities are not the only requirement. Simply put, we’re looking for a scientist. This means that the strong candidate will have all three of the following:

Technical skills. Experience with multiple programming languages including Perl. The experts here will test you and probably ask for some code samples. Familiarity with common bio-informatics tools like BLAST, BLAT, BioPerl, etc.
Scientific rigor. Your CV should list publications in peer-reviewed journals, scientific meetings attended (with talks/posters given), etc. Be prepared to talk about them, and if things go well, to give a brief talk.
An interest in biology. This will come across both in the CV and in the interviews. We’re looking for someone who’s self-motivated and passionate about biological questions.

There are some important realities about working in academia. First, it usually won’t make you a millionaire. Genome Technology’s annual salary survey will tell you that salaries are almost always higher in the private sector. However, pay at the GC is very competitive for an academic setting, and the benefits are excellent. Second, we’re proponents of open source and open access. I hope you know your way around Linux, because we only have one Windows workstation and it’s not allowed to access the internet.

Why Bother?

It’s just my opinion, but I find this a pretty exciting and rewarding place to work. The GC early access to lots of new cutting-edge technologies. WashU consistently ranks in the top 5 for medical schools and the top 10 for biomedical research. I like to think that we tackle some of the biggest problems in biology and human genetics. If you’re interested, the instructions for applying are on the employment opportunities page.