Mappability of the human genome

December 9, 2010 by Dan Koboldt

Accurately mapping reads to a reference sequence is one of the most critical tasks for next-generation sequencing, yet it remains one of the most challenging areas of bioinformatics. Certainly, we’ve come a long way since ELAND and Maq. Dozens of new mapping and alignment algorithms have been published. Speed and sensitivity continue to improve; with BWA, we can map a lane of Illumina data (2×100 bp) to the human genome in about four hours. Nevertheless, it seems like most of the problems that arise in downstream analysis can be traced back to the read alignments.

Mapping Error: Consequences and Causes

Take variant calling, for example. The two central issues we try to address are false positives (variants that aren’t real) and false-negatives (real variants that are missed). Countless hours of manual review and extensive validation have shown that false positives rarely arise from simple sequencing errors. Such errors occur randomly, and with sufficient read depth (8x-10x), they can be readily ignored as background noise. Instead, most false positives are the result of an alignment errors – reads mapped to the wrong place, local mis-alignment around indels, etc. False-negatives are also a concern, particularly for indels. Why? Because the larger the gap between read and reference sequence, the more difficult it is to place correctly.

In my humble opinion, three factors contribute to the problem of accurate read mapping: relatively short read lengths (36-100 bp), unprecedented throughput of NGS instruments, and size/complexity of the genome. As a bioinformatician, I can’t do much about the first two. The size and complexity of the genome, however, remains somewhat fixed. As such, many groups have sought to define those portions of the genome that are “mappable”, so that they can be prioritized for analysis.

Quantifying Short Read Mappability

Recently, I noticed that several such datasets have been incorporated as tracks in the UCSC Genome Browser. For build 18 (NCBI 36), there are about a dozen of them, all under the track name “Mapability”. Here are some of them across the BRCA1 gene region on chromosome 17:

ucsc-mappability

The top three tracks (in green) represent mappability with 36, 75, and 100bp reads as calculated by CRG-GEM, a suite of genome analysis tools by a group in Spain. Then there’s Duke uniqueness (35 bp) in red, and UMass uniqueness (15 bp) in blue. Looking at the tracks together, a few patterns become apparent: first, that mappability calculations remain fairly consistent across datasets generated by different groups. Most of the drops in mappability correspond nicely to RepeatMasker regions. Also, if the CRG results are accurate, the mappability increases substantially with read length (although the near-complete mappability for 100-mers seems a bit optimistic). This dovetails nicely with the theoretical expectation that longer reads can resolve more of the genome.

Utilizing Genome-Wide Mappability Scores

How might this information assist bioinformaticians with the analysis of NGS data? I have a few ideas. It could be used to normalize sequencing coverage by local mappability, for read-depth-sensitive measurements like genome-wide copy number. It can be plotted alongside whole-genome sequencing or capture coverage read depth, to help explain holes in coverage.

It could even be used pre-emptively, for targeted sequencing projects, to assess the regions that will or will not have reads mapped. This sort of analysis is useful to set realistic expectations for collaborators, who might expect that, since they’ve heard so many nice things about next-gen sequencing, it will yield beautiful, even coverage across every single base they want. For time-sensitive projects, you could even adapt a sequencing strategy based on this information. On the Illumina platform, for example, a 36-bp fragment-end run finishes days before a paired-end run of longer reads. If one’s target regions were very mappable, and time was of the essence, a faster and less expensive sequencing run could be ordered.

Genetics and Epigenetics of Leukemia

November 10, 2010 by Dan Koboldt

A study online at the New England Journal of Medicine reports that DNMT3A mutations in acute myeloid leukemia are common and associated with poor outcome for intermediate-risk patients. Previously, our group had characterized the genomes of two patients with cytogenetically normal AML (AML1 and AML2). The first genome (AML1) was initially sequenced with Illumina short reads (1×36 bp), revealing eight novel acquired (somatic) mutations but none that were recurrent. The second genome, AML2, harbored a recurrent mutation in the isocitrate dehydrogenase 1 gene (IDH1), which had recently been implicated in glioblastoma. Subsequent work has demonstrated that mutations in IDH1 and related gene IDH2 highly recurrent in AMLs with intermediate risk karyotypes (20-30% frequency).

Resequencing the Relapse with Current NGS Technology

For this study, we resequenced relapse tumor from patient AML1 using current Illumina technology (2×100 bp paired-end reads), achieving higher diploid coverage and enabling the identification of several novel nonsynonymous mutations. One of these was a 1-bp deletion in the DNA methyltransferase gene DNMT3A predicted to cause a frameshift resulting in a truncated protein.

: DNMT3A/DNMT3A Complex with DNA

Resequencing showed that it was present in the original tumor sample, and probably missed due to alignment difficulties for short reads. Screening the exons of DNMT3A in 281 additional AML tumors revealed that 61 (22.1%) also had DNMT3A mutations with translational consequences. The most common of these was a missense mutation at residue 882, found in 37 tumors.

Mutation, Methylation, and Disease

When we realized how common this mutation was, and considered that the gene involved is a DNA methyltransferase, I have to admit that a tantalizing picture emerged. In my mind, at least. Mutation could lead to aberrant methylation of the tumor genome. Demethylation unmasks oncogene expression. Hyper-methylation leads to genome-wide instability, causing more mutations that activate oncogenes and disable tumor suppressors. DNA methylation has long been suspected in cancer, but the relationship between mutation, methylation, and disease progression has not been definitively established. At last, it seemed like we would bridge that gap.

We performed a number of experiments to determine if DNMT3A mutation status affected mutation rate, genome-wide methylation, or gene expression. First, we examined the 38 AML tumors that had undergone whole-genome sequencing (WGS) to ~25x coverage. Eleven of these carried DNMT3A mutations. There was no apparent correlation, however, between DNMT3A mutation status and the number of high-confidence mutations called genome-wide. Next, we assessed gene expression in 188 AML tumors and matched (normal) controls on microarrays. DNMT3A was expressed in all 188 tumors and matched normals, regardless of mutation status. Unsupervised clustering of gene expression patterns did reveal distinct clusters, but none correlated with DNMT3A status. We further performed targeted cDNA resequencing in tumors with mutations, and confirmed expression of most mutant alleles at the expected 50% frequency (though some were not seen in any cDNA, probably due to nonsense-mediated decay).

So no effect on mutation, and no changes in gene expression. Hold your breath, and let’s look at methylation. MeDIP assays revealed 182 regions that were differentially methylated between DNMT3A-mutated and non-mutated tumors. All were hypomethylated in the mutated samples. But there was no consistent effect on the expression of nearby genes. And, sadly, there was no global effect of DNMT3A mutaiton on DNA methylation. We were 0 for 3. Last but not least, we turned to the clinical data.

Clinical Correlation: DNMT3A and Prognosis

When we stratified AML patients by risk (based on cytogenetics) and DNMT3A mutation status, some interesting patterns emerged. First, DNMT3A mutations were completely absent from the favorable-prognosis group. Mutations were enriched, however, among patients classified as “intermediate risk” – normal or unclear karyotypes. And the outcome for DNMT3A-mutated patients was significantly poorer. The adverse-outcome association was independent of age, although older patients with DNMT3A had the worst outcomes of any group. And the association held true regardless of the presence of other commonly-mutated AML genes (NPM1, FLT3, IDH1/IDH2). Thus, DNMT3A mutation clearly contributes to AML pathogenesis, even if the mechanism by which it does so remains elusive. The fact that DNMT3A mutations are selected against in favorable-outcome patients suggests a true biological association.

A lot of work remains to be done. We still need to uncover the mechanistic effect of DNMT3A mutations that underlies the pathogenesis. But this work has furthered our understanding of AML, by identifying a highly recurrently mutated gene and providing a marker to help stratify patients of intermediate risk. As highlighted in a perspective by Shannon and Armstrong, clinical trials of DNA methlytransferase inhibitors in AML are already under way. It may not be long before genomic discoveries are translated into actionable information for the treatment of cancer patients.

Related Articles

Researchers discovery key mutation in acute myeloid leukemia (NIH News)

Mutations in single gene predict poor outcomes in adult leukemia (WashU Record)

References
Ley, T., Ding, L., Walter, M., McLellan, M., Lamprecht, T., Larson, D., Kandoth, C., Payton, J., Baty, J., Welch, J., Harris, C., Lichti, C., Townsend, R., Fulton, R., Dooling, D., Koboldt, D., Schmidt, H., Zhang, Q., Osborne, J., Lin, L., O’Laughlin, M., McMichael, J., Delehaunty, K., McGrath, S., Fulton, L., Magrini, V., Vickery, T., Hundal, J., Cook, L., Conyers, J., Swift, G., Reed, J., Alldredge, P., Wylie, T., Walker, J., Kalicki, J., Watson, M., Heath, S., Shannon, W., Varghese, N., Nagarajan, R., Westervelt, P., Tomasson, M., Link, D., Graubert, T., DiPersio, J., Mardis, E., & Wilson, R. (2010). DNMT3A Mutations in Acute Myeloid Leukemia
New England Journal of Medicine DOI: 10.1056/NEJMoa1005143

Shannon, K., & Armstrong, S. (2010). Genetics, Epigenetics, and Leukemia New England Journal of Medicine DOI: 10.1056/NEJMe1012071

Great Mutation. Is It Functional?

October 22, 2010 by Dan Koboldt

As promised, NGS instruments are yielding thousands of new genome sequences. Read lengths and throughputs are increasing. Alignment and analysis algorithms are getting more mature. Databases of sequence variants are growing exponentially. Things are looking pretty good, right? Sure, there are lots of variants still waiting to be discovered. Sure, some of those already reported simply aren’t real. But I think we’re rapidly approaching a point where finding the variants won’t be much of a problem.

Instead, we are facing two significant challenges. First, identifying the subset of variants have functional significance – separating the wheat from the chaff, if you will. Second, understanding how these functional variants contribute to a phenotype. This is soon to be the frontier in genetics and genomics. It merits, I think, a discussion of some of the strategies that have been used to go beyond variant detection, to isolate disease-causing variants and assess their functional impact.

Strategy 1: Process of Elimination

This approach (to my knowledge) is best demonstrated in whole-genome, exome, or pooled sequencing of samples from individuals with rare inherited diseases. It’s essentially a filtering strategy where you start with a list of candidate variants and whittle it down using several criteria:

Pedigree information, especially variants that do not segregate with the disease in Mendelian disorders.
Control variants, usually identified in HapMap samples or other individuals not affected by the disease.
Gene structure information, which serves to eliminate synonymous or non-coding variants.
Evolutionary conservation, to prioritize variants in sequences that are conserved across species.

This strategy has worked well for a handful of rare, inherited diseases like Miller syndrome and severe hypercholesterolemia. There are, however, so many things that can go wrong. The pedigree or assumed mode of inheritance could be wrong. The causal variant might be synonymous or even noncoding (e.g. in a transcription factor binding site). The conservation trick in particular worries me. True, many of the known disease-causing mutations map to conserved amino acid residues, but certainly not all of them.

Strategy 2: Recurrence

This is a developing strategy to identify key mutations and pathway alterations in cancer genomes. Because tumors are genetically unique, and often possess thousands of acquired (somatic) mutations, pedigree analysis and control samples are less informative. Instead, we reason that passenger mutations should occur randomly, mutations key to tumor development and progression are likely to be recurrent (i.e. found in other tumors of the same type). By this reasoning, the more important a mutation, the higher its rate of recurrence. TP53 mutations are a good example of this; in ovarian cancer, more than 80% of tumors carry a TP53 mutation. This is why databases like Sanger’s Catalogue of Somatic Mutations in Cancer (COSMIC) are such powerful tools. As these catalogues grow, having an available panel of additional tumors to screen for novel mutations may become less critical.

Strategy 3: Computational Evaluation

A growing suite of tools and annotation databases enable computational assessments of putative variants to predict their effect in vivo. SIFT and Polyphen are well-known examples of these. The UCSC Genome Browser Database contains dozens of genome-wide annotation datasets (both computational and experimental); many of these are presumed-regulatory regions that form the basis for our “Tier 2” classification (non-coding conserved/regulatory variants). There are also motif-scoring algorithms that evaluate a mutation’s effect on the binding affinity of trancription or splicing factors. These types of inferences are both interesting and helpful, when assessing a mutation’s functional effect. They’re not convincing, however, without supporting experimental evidence.

Strategy 4: Molecular Validation

This may be the most difficult strategy, but potentially the most informative one. A myriad of experimental techniques can be applied to assess a mutation’s functional effect in vivo or in vitro. For coding mutations, the first thing we typically assess is mRNA expression (by RT-PCR or RNA-Seq), to determine (1) if the affected gene is expressed in the tissue of interest (e.g. the retina for studies of retinitis pigmentosa) and (2) whether the mutant allele affects it. Many known disease-causing mutations ablate expression of the mature mRNA, because they introduce splicing defects, mRNA instability, or other effects. A number of other molecular biology tools can also be applied:

Western blot, to determine protein expression
Enzyme activity assays, such as the complex I rescue technique that has been applied to characterize mutations in patients with complex I deficiency (see my last post).
Recombinant DNA techniques, such as a luciferase assay to assess mutations in gene promoters
Colony growth assays, especially for somatic mutations, to determine if mutations confer a growth advantage or invasion potential.

Specialized Sequencing Techniques

A number of recently-developed applications of massively parallel sequencing can be used to assess the functional impact of candidate mutations. RNA-Seq can detect allele-specific expression and alternative splicing. ChIP-Seq can assess protein-DNA interactions and theoretically detect allele-specific DNA binding. Methyl-Seq can be used to profile DNA methylation, either at specific loci or (for methylation pathway mutations) genome-wide. MiRNA-Seq and HITS-CLIP, techniques that measure microRNA expression or isolate miRNA-transcript interactions, also have potential for characterizing mutation effects. Many of these high-throughput techniques stand poised to supplant their traditional experimental counterparts.

Given the wide array of experimental tools, it’s disappointing when reports of new (possible) disease-causing mutations lack sufficient functional validation. I find myself unconvinced when the answer is supported by “it segregates with the disease” or worse, “we filtered everything else.” So when I read new papers that claim to have identified disease-causing variants, my answer is this: Great mutation. Is it functional?

A Foundation for Next-Generation Analysis Tools

August 11, 2010 by Dan Koboldt

The emergence of next-generation sequencing has presented numerous significant challenges to the bioinformatics community. NGS instruments have given rise to a new generation of software tools for the alignment, assembly, management, and visualization of incredible amounts of data. New algorithms have also been developed to assess coverage, assess genomic copy number, call variants (SNPs/indels), and infer large-scale structural variation.

Regardless of their purpose, most tools for NGS data analysis are under increased demand for the same things:

Efficiency – in the face of ever-growing throughputs from NGS instruments
Flexibility – to accommodate new sequencing platforms, experimental protocols, and input formats
Scalability – to continually improve upon and enhance their features as needs evolve

The definition and widespread acceptance of the Sequence Alignment Map (SAM) as the standard format for representing NGS data was a key development for the field. Aaron McKenna and colleagues at the Broad Institute have just published another advance – the Genome Analysis Toolkit (GATK), a structured programming framework for NGS data anlysis. Essentially, GATK is a foundation of code that takes advantage of the SAM/BAM input format to simplify many of the common requirements for data analysis tools. The core system can accommodate reads from any sequencing platform, as long as they’ve been converted to SAM/BAM format. It therefore supports most sequence aligners, and also recognizes public database formats (HapMap, dbSNP) and some of the common data-exchange file formats (e.g. GLF and VCF). It’s written in Java, which means that the framework is operating-system-independent as well.

GATK implements something called a “mapreduce” paradigm to allow analysis tasks to be performed in parallel. If you’re developing a new analysis tool, there are a few different ways (traversals) to get to the data that’s in a BAM file. For example, if you wanted to compute the average read length, you could use the TraverseReads scheme to pull out every read and walk through them. Alternatively, if you wanted to calculate the average read depth across the genome, you could use the TraverseLoci scheme to pull out information (reference base, read bases, etc.) at every base in the genome. The best part is that you don’t have to write any of the code for indexing, retrieving, and parsing NGS data – that’s already done. You can focus on your analysis tool, while the GATK developers can continually improve the core engine.

Analysis Tools Built on GATK

The authors demonstrate two simple applications that were developed using the GATK framework. The first, a depth-of-coverage tool, took just 83 lines of code to generate a depth-of-coverage report for every position in a given locus (or the whole genome). This might easily be developed into a highly automated, graphic-supported system for reporting coverage on, say, an exome sequencing project. The second demonstration tool was a simple Bayesian genotyper (57 lines), which uses posterior probability to determine the most likely genotype at each position in the reference.

I’m aware of at least two more valuable NGS data analysis tools that were built on this framework. The first is actually the framework’s foundation, Picard (http://picard.sourceforge.net), which contains a number of SAM/BAM parsing elements, but perhaps more importantly, has the widely used “MarkDuplicates” tool for identifying redundant sequences in NGS data. The second tool, one that I’ve recently been evaluating, is the GATK indel genotyper. Given a pair of BAM files from a tumor sample and matched (normal) control, the GATK indel genotyper implements a stringent algorithm to call indels and determine their somatic status (Germline or Somatic) based on the evidence in both files. Optionally, this can be done with local realignment of reads around indel positions, which helps remove some false positive variant calls. Compared to other tools for indel calling, GATK seems to offer greater precision (fewer false positives), while maintaining sensitivity, in the datasets that I’ve tested.

Next-Generation Informatics

I readily admit that I don’t know enough about parallelization to discuss it in detail, but what I read in the paper seems encouraging. On a single CPU, the simple Bayesian genotyper took something like 14 hours to complete chromosome 1 of a whole-genome sequence using a single CPU. But when offered 12 CPUs, the built-in parallel processing support of GATK brought down execution time almost 12-fold, to about an hour and a half. It strikes me that frameworks such as this, coupled with the latest 4-core, 8-core, even 50-core CPUs, may finally be bioinformatics’ answer to the challenge of massively parallel sequencing.

References
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, & Depristo MA (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome research PMID: 20644199