Archives for June 2009

Variant Detection in Massively Parallel Sequencing

June 22, 2009 by Dan Koboldt

There is at last some evidence that I do real scientific work, and real scientific writing outside of Massgenomics. Online today at Bioinformatics is our publication of variant detection in massively parallel sequencing of individual and pooled samples. In it we present VarScan, the culmination of my work over the past two years to develop algorithms for calling SNPs and indels in deep resequencing data from the 454 and Illumina/Solexa platforms. It’s one of the many cancer genomics tools that we have developed and made available to the research community, and the first to be published.

VarScan variant detection in massively parallel sequencing

The Challenge

With next-generation sequencing, the problem of accurately calling variants faces two key challenges. First, to distinguish between true variants and false positives in shorter, more error-prone sequencing reads. Second, to handle the sheer volume of data, which might be millions of reads from a single run. The 454 and Illumina platforms present their own particularly difficulties as well. On the 454, it’s homopolymers – stretches of A’s, C’s, G’s, or T’s four or more bases long. The exact number of bases in a homopolymeric stretch becomes difficult to resolve by pyrosequencing, and as a result, 454 sequencing tends to overcall or undercall bases in these regions. This presents alignment difficulties and numerous artifact insertions/deletions relative to the reference sequence. On the Illumina platform, reads are shorter, and the base qualities drop sharply after 28 or so bases. Until recent improvements that have yielded 50, 75, and 100 bp read lengths, alignment of Illumina reads was particularly difficult given this base quality paradigm.

Targeted Resequencing, Pooled Samples

For whole genome sequencing of individual samples, we’ve used Maq with some success to map, assemble, and call variants from Illumina/Solexa reads. However, for various projects we’re interested in sequencing only selected regions (targeted by PCR or capture), often in many samples. Deep sequencing of pooled samples is one area, unfortunately, where Maq and RunMapper/Newbler continue to struggle. Here, the individual read counts supporting reference and variant alleles become important in distinguishing true variants and assessing their frequencies in a pool. Also, due to the high-priority of say, validating somatic variants in a tumor sample with 454 sequencing, we wanted something that could leverage faster aligners – parallelized BLAT for 454 and Bowtie for Illumina.

VarScan: SNP and Indel Calling in BLAT, Bowtie, Novoalign…

To address these challenges we developed VarScan, a program for calling SNPs and indels in next-gen sequencing data. VarScan accepts read alignments in several native formats – BLAT and Newbler for 454 data, Bowtie, cross_match, and Novoalign for Illumina/Solexa data. Given a file of read alignments, VarScan detects sequence variants, combines them by position and type, and then computes the read counts, average base quality, and number of strands supporting each allele. As a final step, one can use VarScan to limit variant calls to a set of targeted positions, which removes variants arising from spurious read alignments. Variant calls can also be filtered by base quality, read count, and variant frequency, which is useful for isolating variants at frequencies of say 1% or higher in a pool of samples.

Proof of Principle: Individual 454 and Pooled Illumina Sequencing

To demonstrate VarScan’s capabilities, we used data from a collaboration with Stephen Daiger and colleagues, who are resequencing candidate genes in a cohort of families with Retinitis Pigmentosa (RP). The Genome Center had designed 1,000 amplicons targeting the exons of dozens of genes, and obtained PCR products for these for each sample. Samples were sequenced individually on the 454 platform to ~70x coverage. Then, samples from 42 individuals were pooled and sequenced on the Illumina platform to extremely deep coverage (~6000x, or ~125x per sample). We used BLAT and Bowtie to align the 454 and Illumina reads, respectively, and applied VarScan to the resulting alignments.

Sensitivity, Specificity, SNPs, and Indels

I won’t ruin the whole story for you, but suffice it to say that most of the SNPs called by VarScan in the 454 sequencing were present in dbSNP or supported by VarScan’s Illumina/Solexa calls, or both. Incredibly, when we plotted the estimated allele frequencies from deep Illumina sequencing of the pool to the known frequencies from individual 454 results, the correlation was extremely high (>0.95, Pearson’s). This demonstrates VarScan’s capabilities not only to detect variants with good sensitivity/specificity, but to accurately estimate their frequency in a pool. To evaluate VarScan’s performance on insertion/deletion (indel) variants, we identified 77 high-confidence short (1-5 bp) indels in the 454 data and attempted to validate them with Illumina data. Since Maq, Bowtie, and most other Illumina aligners don’t allow gaps, we used another aligner, Novoalign, which claims to have higher read placement sensitivity and aligns reads with gaps in single-end libraries. We used VarScan to call indels in the Novoalign output, and we were able to validate around 60% of the high-confidence indels from 454 with Novoalign alignments of Illumina data. Taken together, these results support the utility of VarScan for variant detection in next-gen sequencing, and demonstrate the feasibility of large-scale mutational profiling by deep resequencing of pooled samples.

References

Koboldt, D., Chen, K., Wylie, T., Larson, D., McLellan, M., Mardis, E., Weinstock, G., Wilson, R., & Ding, L. (2009). VarScan: Variant detection in massively parallel sequencing of individual and pooled samples Bioinformatics DOI: 10.1093/bioinformatics/btp373

Inferring Function from Mutation in Cancer

June 16, 2009 by Dan Koboldt

As we continue to apply next-generation sequencing technologies to cancer genomes, we’re discovering hundreds of putative somatic mutations. Typically we run these through our annotation pipeline, which identifies variants affecting coding regions, splice sites, and evolutionarily conserved sequences. While a growing body of evidence suggests that much of the functional variation in humans lies outside these regions, inferring the functional impact of non-coding mutations remains challenging.

I just read excellent review by Lee, Yue, and Zhang in Human Genetics on Analytical methods for inferring functional effects of single base pair substitutions in human cancers. In it, the authors review bioinformatics approaches for evaluating single-base substitutions in both coding and noncoding regions, with an emphasis on large cancer resequencing efforts including the Tumor Sequencing Project, the Cancer Genome Atlas (TCGA), and the AML1 cancer genome. Several approaches to evaluating single-base mutations were presented:

Credit: Lee W, Yue P, Zhang Z. Hum Genet. 2009

Coding Mutations: The Frequency Approach

The vast majority of somatic mutations in cancer are believed to be silent “passenger” mutations, which accumulate in cells but do not contribute to tumor growth and development. To find the smaller but more important “driver” mutations, several groups have applied a frequency approach. The idea is that mutations driving cancer are positively selected for during tumor development, and so their frequencies should be higher than those of passenger mutations. By assessing the incidence of a particular mutation or the frequency at which certain genes/pathways are altered across multiple patients (of the same tumor type), it should be possible to identify key mutations. Essentially, this approach uses recurrence as a metric for mutation significance.

Coding Mutations: The Amino Acid Approach

The frequency-based approach requires a significant number of samples, and likely favors driver mutations that are common across a patient population. Another way to go is bioinformatic analysis of individual protein-altering (nonsynonymous) mutations to score their probable effect on the protein. The widely used SIFT algorithm, for example, assesses the impact of NSS mutations using evolutionary conservation. Presumably, deleterious mutations are acted against by natural selection, and thus their amino acid residues should be conserved across species. Another algorithm, PolyPhen, applies protein structural information in a rule-based system to evaluate whether a mutation affects key protein domains. In a similar fashion, other tools assess whether mutations fall within annotated Pfam domains or signaling peptides of the encoded protein.

Coding Mutations: Classification Systems

Distinguishing driver from passenger mutations seems like a problem well suited for machine learning approaches. Indeed, groups have developed random-forest algorithms and support vector machine (SVM) approaches for this task as well. For any given mutation, a number of features (protein domain, conservation score, secondary structure, etc.) can be collected. Then, a positive training set (likely driver mutations from COSMIC) and a negative training set (common, presumably-neutral mutations) are used to build a classification system. Dave Larson in my group has applied a similar approach to distinguish between germline and somatic mutations.

Noncoding Mutations: Regulatory SNPs

Identifying functional SNPs outside of coding regions remains a challenge. One obvious category of such variants is “regulatory SNPs” whose alleles modulate the expression of nearby genes. Recently, groups have leveraged high-throughput gene expression and genotyping technologies to perform genome-wide assocation studies of gene expression, in which gene expression levels are considered a quantitative trait and significant correlations to SNP variation are identified. These studies have identified hundreds of cis-regulatory and trans-regulatory SNPs with moderate to large effects on gene expression. In cancer, some regulatory variants have already been identified. A SNP in the MDM2 gene (SNP309), for example, causes over-expression of this p53-suppressor and increases risk of colorectal cancer. Clearly, noncoding mutations affecting gene expression are of high interest in cancer genetics.

Noncoding Mutations: Post-Transcriptional Modifiers

Mutations that affect post-transcription processes – like splicing, polyadenylation, or the binding of regulatory RNAs – are also likely suspects in cancer development. Several mutations in BRCA1, for example, have been shown to alter that gene’s splicing and increase cancer risk. Mutations in the 3′ UTR of HMGA2 contribute to carcinogenesis by removing the binding sites for the let-7 microRNA, which normally represses this oncogene. To identify mutations in categories like these, we rely largely on computational tools that identify and score “motifs” – DNA sequences bound by micro-RNAs, splice enhancers, and other post-transcriptional players. Unfortunately, successes in this arena are few and far between. Nevertheless, I anticipate that interest in post-transcriptional modifiers will grow substantially as we learn more about alternative splicing, micro-RNA repression, and other relatively new areas of genetics.

The theme of the review article seemed to be this: a plethora of tools exist for evaluating mutations in cancer, and we’d better start leveraging them now. Whole-genome sequencing of cancer genomes is going to provide a flood of candidate somatic mutations. If only there were a single tool that applied all of the approaches above to help us prioritize them.

References:
Lee, W., Yue, P., & Zhang, Z. (2009). Analytical methods for inferring functional effects of single base pair substitutions in human cancers Human Genetics DOI: 10.1007/s00439-009-0677-y

Ding, L., Getz, G., Wheeler, D., Mardis, E., et al. (2008). Somatic mutations affect key pathways in lung adenocarcinoma Nature, 455 (7216), 1069-1075 DOI: 10.1038/nature07423

Ley, T., Mardis, E., Ding, L., et al. (2008). DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome Nature, 456 (7218), 66-72 DOI: 10.1038/nature07485

The Cancer Genome Atlas Research Network. (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways Nature, 455 (7216), 1061-1068 DOI: 10.1038/nature07385

Whole Genome Sequencing: How Many SNPs Remain?

June 5, 2009 by Dan Koboldt

This week’s publication of the genome of a Korean individual in Genome Research marks the fifth individual whole genome sequenced with massively parallel sequencing platforms. The fact that this was not a Nature paper speaks as loudly as anything. The window of time when single whole genome sequences merit high-profile publications is slowly closing.

It is not terribly surprising, at least to me. At some point we are bound to reach a saturation, where the whole genome of a disease-free individual no longer yields novel information. The most tangible deliverable of such studies is a set of DNA sequence variants; without a phenotype there’s little relevance to human health, and with a sample size of 1 not much can be said about population genetics. Yes, there is some value in knowing the location and alleles of additional sequence variants, but look at dbSNP’s growth since 2003:

dbsnp-growth-2003-2009

As of the last release there are over 17 million refSNPs (1 every 180 bp on average), where most experts agreed a few years back that there are probably ~10 million total in the human genome. The two rapid jumps in the plot above are the result of two massive NHGRI-driven efforts: The International HapMap Project (2003) and its sequel, the 1,000 Genomes Project (2008). Have we really not found all of the SNPs yet? Probably not. But as far as common SNPs – those prevalent enough to affect a sizeable fraction of the population – we must be getting close.

How Many SNPs Remain to Be Discovered?

The five genomes sequenced on massively parallel platforms to date, happily, have continental origins from Europe (2), Africa (1), and Asia (2). Yet despite this diversity of backgrounds, they all reported similar results for SNP discovery.

Note, for AML I’m only showing SNPs seen in both tumor *and* normal.

On average, WGS studies identified 3.3 million SNPs per genome, of which ~479,000 (15%) are novel to dbSNP. That means about 1 out of every 1,000 bases yields a SNP, but only 1 out of every 6,700 bases yields a novel SNP.

The Future of Whole Genome Sequencing

As high-throughput genome selection technologies (e.g. capture) continue to develop, one begins to question whether or not WGS will retain its cache in the research community. My guess is that for “reference” samples (those without phenotypic information), it will be difficult to justify high-profile papers after the 1,000 Genomes Project. However, for disease and phenotype-related studies, especially cancer, I think that this will remain a hot topic. Even capture technologies rely on the basic assumption that the locations of functional variants are known (i.e. exons). That’s not always true, even for Mendelian disorders. Complex diseases are likely mediated by numerous loci (both coding and noncoding) interacting with many environmental factors. To puzzle these out, we probably will need to have the full genomes of affected individuals in hand.

References

[1] Wheeler, D., et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature, 452 (7189), 872-876 DOI: 10.1038/nature06884

[2] Ley, T., Mardis, E., Ding, L., et al. (2008). DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature, 456 (7218), 66-72 DOI: 10.1038/nature07485

[3] Bentley, D., et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456 (7218), 53-59 DOI: 10.1038/nature07517

[4] Wang, J., Wang, W., et al. (2008). The diploid genome sequence of an Asian individual. Nature, 456 (7218), 60-65 DOI: 10.1038/nature07484

[5] Ahn, S., et al. (2009). The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Research DOI: 10.1101/gr.092197.109