Mappability of the human genome

December 9, 2010 by Dan Koboldt

Accurately mapping reads to a reference sequence is one of the most critical tasks for next-generation sequencing, yet it remains one of the most challenging areas of bioinformatics. Certainly, we’ve come a long way since ELAND and Maq. Dozens of new mapping and alignment algorithms have been published. Speed and sensitivity continue to improve; with BWA, we can map a lane of Illumina data (2×100 bp) to the human genome in about four hours. Nevertheless, it seems like most of the problems that arise in downstream analysis can be traced back to the read alignments.

Mapping Error: Consequences and Causes

Take variant calling, for example. The two central issues we try to address are false positives (variants that aren’t real) and false-negatives (real variants that are missed). Countless hours of manual review and extensive validation have shown that false positives rarely arise from simple sequencing errors. Such errors occur randomly, and with sufficient read depth (8x-10x), they can be readily ignored as background noise. Instead, most false positives are the result of an alignment errors – reads mapped to the wrong place, local mis-alignment around indels, etc. False-negatives are also a concern, particularly for indels. Why? Because the larger the gap between read and reference sequence, the more difficult it is to place correctly.

In my humble opinion, three factors contribute to the problem of accurate read mapping: relatively short read lengths (36-100 bp), unprecedented throughput of NGS instruments, and size/complexity of the genome. As a bioinformatician, I can’t do much about the first two. The size and complexity of the genome, however, remains somewhat fixed. As such, many groups have sought to define those portions of the genome that are “mappable”, so that they can be prioritized for analysis.

Quantifying Short Read Mappability

Recently, I noticed that several such datasets have been incorporated as tracks in the UCSC Genome Browser. For build 18 (NCBI 36), there are about a dozen of them, all under the track name “Mapability”. Here are some of them across the BRCA1 gene region on chromosome 17:

ucsc-mappability

The top three tracks (in green) represent mappability with 36, 75, and 100bp reads as calculated by CRG-GEM, a suite of genome analysis tools by a group in Spain. Then there’s Duke uniqueness (35 bp) in red, and UMass uniqueness (15 bp) in blue. Looking at the tracks together, a few patterns become apparent: first, that mappability calculations remain fairly consistent across datasets generated by different groups. Most of the drops in mappability correspond nicely to RepeatMasker regions. Also, if the CRG results are accurate, the mappability increases substantially with read length (although the near-complete mappability for 100-mers seems a bit optimistic). This dovetails nicely with the theoretical expectation that longer reads can resolve more of the genome.

Utilizing Genome-Wide Mappability Scores

How might this information assist bioinformaticians with the analysis of NGS data? I have a few ideas. It could be used to normalize sequencing coverage by local mappability, for read-depth-sensitive measurements like genome-wide copy number. It can be plotted alongside whole-genome sequencing or capture coverage read depth, to help explain holes in coverage.

It could even be used pre-emptively, for targeted sequencing projects, to assess the regions that will or will not have reads mapped. This sort of analysis is useful to set realistic expectations for collaborators, who might expect that, since they’ve heard so many nice things about next-gen sequencing, it will yield beautiful, even coverage across every single base they want. For time-sensitive projects, you could even adapt a sequencing strategy based on this information. On the Illumina platform, for example, a 36-bp fragment-end run finishes days before a paired-end run of longer reads. If one’s target regions were very mappable, and time was of the essence, a faster and less expensive sequencing run could be ordered.

The Fruits of a Thousand Genomes

November 1, 2010 by Dan Koboldt

Last week saw the publication of the 1,000 Genomes Project, which has characterized ~15 million SNPs, 1 million short insertions/deletions (indels), and 20,000 structural variants in seven human populations. This is discovery and genotyping at unprecedented scale, with an astonishing 4.9 terabases (trillion bases) sequenced – the equivalent of about 1,500 human genomes – across three pilot projects:

Deep whole-genome sequencing of trios (mother-father-daughter) from 2 populations
Low-coverage sequencing of 179 unrelated individuals from 4 populations
Exon sequencing of 906 randomly-selected genes in 697 individuals from 7 populations.

The three pilots have shed new light on sequence variation in human genomes and its distribution among human populations. Perhaps unsurprisingly, variation was not evenly distributed in the genome – certain regions (e.g. HLA and sub-telomeres) show high rates of variation, whereas (e.g. a 5 Mbp, gene-dense, highly-conserved region on chromosome 3) show very little. At the chromosomal level, different forms of variation were highly correlated (e.g. SNPs and indels), but there were exceptions for some types of structural variants implicating different mechanisms of mutation.

Novelty and Population-Specificity

The vast majority of SNPs detected were already known to dbSNP. Among known variants, 56% were present in all population panels while 25% were found in only a single panel. In contrast, only 4% of novel variants were found in all panels and 84% were found in only one. This difference supports the notion that the majority of common SNPs in human populations have already been found. There’s more work to do for other forms of variation, though. Many of the novel SVs were detected in all population panels. Half of the common short indels had never been reported.

The smallest two chromosomes – mitochondrial and Y – seemed to benefit the most. There was a lot of heteroplasmy in mitochondrial DNA within individuals – 79% of samples had length heteroplasmy, and 45% had substitution heteroplasmy. On the Y-chromosome, there were 2,870 variable sites, most of which (74%) were novel to public databases. These new variants helped identify several clear, significant sub-clades within the 12 haplotype groups represented in 1,000 Genomes samples.

Coding Regions and Loss-of-Function Variants

In total, the three pilots identified 68,300 non-synonymous variants, almost half of which were novel. Genotyping a subset of these in 620 samples revealed novel NSS variants had dramatically lower minor allele frequency (2.2%) than known ones (26.2%). From this I can draw two conclusions: most novel nonsynonymous variants are rare, and the majority could only have been identified by population-scale sequencing projects like these.

The authors estimate that an individual genome differs from the reference at 10,000 to 11,000 nonsynonymous sites and perhaps 12,000 synonymous sites. A typical genome harbors a much smaller number of loss-of-function (LOF) variants — inframe/frameshift indels, early stops, and splice-site variants — perhaps 340-400 LOF variants per individual, affecting 250-300 genes. Compared to synonymous variants, putative functional variants (nonsynonymous and LOF) tend to have lower allele frequencies and be more population-specific, presumably due to the action of purifying selection against deleterious mutations. Which means, of course, that the really important variants are much harder to find.

Signatures of Natural Selection

Looking in and around genes, the authors found diversity is lowest in exons (50% that of introns) and slightly reduced in 5′ and 3′ UTRs, compared to intronic and intergenic sequences. This signature of natural selection acting upon genes actually has a broad effect; diversity is reduced by 10% in the vicinity of genes compared to gene-distant loci, and that reduction extends up to 85 kbp away. Thus, selection on linked sites appears to restrict variation across the majority of the human genome. Looking across panels, the authors observed that SNPs with large allele frequency differences between populations were enriched for nonsynonymous sites, likely reflecting local adaptation and selection by different continental groups.

Finally, the authors examined the trios to look at a different environment for mutation and selection – immortalized cell lines. Some 952/1001 new mutations in the CEU daughter and 634/669 new mutations in the YRI daughter were not present in the germline, indicating that they occurred either in somatic cells or in the cell lines. Further, the higher number of mutations in the CEU sample may be related to the age of the lines – the CEU line is decades older than the YRI line.

Implications for Future Studies

The findings of the 1,000 Genomes Project thus far have immediate, significant impact on genetic association studies. Using publicly available gene expression data and their expanded catalogue of variants, the authors identified 20-30% more significant expression quantitative trait loci (eQTLs) than had previously been detectable. Thus, it is clear that while existing SNP arrays represent the majority of common variation, a significant amount of rare, phenotypically-relevant variation remains to be incorporated.

References
1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092

More Diagnosis by Whole Genome Resequencing

August 25, 2010 by Dan Koboldt

The cost of DNA sequencing continues to plummet, and while insurance companies might not be ready to get on board, another study has demonstrated how individual genome data can be clinically informative. Jonathan Rios and colleagues took on the case of an 11-month-old girl with severe hypercholesterolemia (1023 mg/dl) whose parents were unaffected, which suggested either an autosomal recessive disorder or a de novo mutation. A normal plasma sitosterol:cholesterol ratio ruled out sitosterolemia, and a screen of commonly-mutated hypercholesterolemia genes (LDLRAP, LDLR, PCSK9, APOE and APOB) came up empty.

Enter Whole-Genome Sequencing

Whole-genome resequencing for the patient was performed by Complete Genomics, who generated 138 Gbp of mappable sequence yielding 49x average coverage. Of the ~3.8 million sequence variants identified, 502,000 were indels or complex rearrangements that were not further evaluated (see Figure 2A). That left ~3.3 million SNPs, of which 9,726 were nonsynonymous or splice site variants. Most of these were present either in dbSNP or in the genomes of 21 unaffected individuals (16 exomes, 5 whole-genomes). About 700 variants remained.

Now the authors considered the disease pedigree:



Rios et al., Hum. Molec. Genetics 2010

The parents were unrelated (non-consanguinous) and both had normal cholesterol. No other relatives (including the patient’s 4-year-old brother) were affected. It fit with a monogenic autosomal recessive disorder, so the authors focused their search on genes with two or more nonsynonymous or splice site variants. Some 42 genes qualified, but 19 of these were known to have multiple copies in the genome. That left 23 genes, and one stood out: ABCG5, an ATP-binding cassette hemitransporter in which the patient had two heterozygous nonsense mutations. A gene, by the way, that had already been linked to the sterol-elimination deficiency (sitosterolemia) that can cause high cholesterol. There was no way the gene could be functional, and targeted sequencing showed that each parent had provided a defunct copy.

New Understanding of the Disease Mechanism

The researchers did some more tests, and the results were consistent with sitosterolemia. So how did they miss it the first time? Well, at the time of diagnosis, 80% of the patient’s diet was breast milk, which is low in plant sterols. At the time of the second test, she was on baby foot (fruits and vegetables), which provided plenty. That explains why the sterol counts were initially low. Why was the cholesterol so high? The authors conclude that in sitosterolemic patients, the severe hypercholesterolemia may reflect an failure to excrete cholesterol into bile, rather than a disruption of cholesterol homeostasis by elevated plant sterols.

Thus, whole-genome sequencing not only provided the correct diagnosis, but may have shed new light on the mechanisms of lipid disease.

References
Rios J, Stein E, Shendure J, Hobbs HH, & Cohen JC (2010). Identification by Whole Genome Resequencing of Gene Defect Responsible for Severe Hypercholesterolemia. Human molecular genetics PMID: 20719861

A Foundation for Next-Generation Analysis Tools

August 11, 2010 by Dan Koboldt

The emergence of next-generation sequencing has presented numerous significant challenges to the bioinformatics community. NGS instruments have given rise to a new generation of software tools for the alignment, assembly, management, and visualization of incredible amounts of data. New algorithms have also been developed to assess coverage, assess genomic copy number, call variants (SNPs/indels), and infer large-scale structural variation.

Regardless of their purpose, most tools for NGS data analysis are under increased demand for the same things:

Efficiency – in the face of ever-growing throughputs from NGS instruments
Flexibility – to accommodate new sequencing platforms, experimental protocols, and input formats
Scalability – to continually improve upon and enhance their features as needs evolve

The definition and widespread acceptance of the Sequence Alignment Map (SAM) as the standard format for representing NGS data was a key development for the field. Aaron McKenna and colleagues at the Broad Institute have just published another advance – the Genome Analysis Toolkit (GATK), a structured programming framework for NGS data anlysis. Essentially, GATK is a foundation of code that takes advantage of the SAM/BAM input format to simplify many of the common requirements for data analysis tools. The core system can accommodate reads from any sequencing platform, as long as they’ve been converted to SAM/BAM format. It therefore supports most sequence aligners, and also recognizes public database formats (HapMap, dbSNP) and some of the common data-exchange file formats (e.g. GLF and VCF). It’s written in Java, which means that the framework is operating-system-independent as well.

GATK implements something called a “mapreduce” paradigm to allow analysis tasks to be performed in parallel. If you’re developing a new analysis tool, there are a few different ways (traversals) to get to the data that’s in a BAM file. For example, if you wanted to compute the average read length, you could use the TraverseReads scheme to pull out every read and walk through them. Alternatively, if you wanted to calculate the average read depth across the genome, you could use the TraverseLoci scheme to pull out information (reference base, read bases, etc.) at every base in the genome. The best part is that you don’t have to write any of the code for indexing, retrieving, and parsing NGS data – that’s already done. You can focus on your analysis tool, while the GATK developers can continually improve the core engine.

Analysis Tools Built on GATK

The authors demonstrate two simple applications that were developed using the GATK framework. The first, a depth-of-coverage tool, took just 83 lines of code to generate a depth-of-coverage report for every position in a given locus (or the whole genome). This might easily be developed into a highly automated, graphic-supported system for reporting coverage on, say, an exome sequencing project. The second demonstration tool was a simple Bayesian genotyper (57 lines), which uses posterior probability to determine the most likely genotype at each position in the reference.

I’m aware of at least two more valuable NGS data analysis tools that were built on this framework. The first is actually the framework’s foundation, Picard (http://picard.sourceforge.net), which contains a number of SAM/BAM parsing elements, but perhaps more importantly, has the widely used “MarkDuplicates” tool for identifying redundant sequences in NGS data. The second tool, one that I’ve recently been evaluating, is the GATK indel genotyper. Given a pair of BAM files from a tumor sample and matched (normal) control, the GATK indel genotyper implements a stringent algorithm to call indels and determine their somatic status (Germline or Somatic) based on the evidence in both files. Optionally, this can be done with local realignment of reads around indel positions, which helps remove some false positive variant calls. Compared to other tools for indel calling, GATK seems to offer greater precision (fewer false positives), while maintaining sensitivity, in the datasets that I’ve tested.

Next-Generation Informatics

I readily admit that I don’t know enough about parallelization to discuss it in detail, but what I read in the paper seems encouraging. On a single CPU, the simple Bayesian genotyper took something like 14 hours to complete chromosome 1 of a whole-genome sequence using a single CPU. But when offered 12 CPUs, the built-in parallel processing support of GATK brought down execution time almost 12-fold, to about an hour and a half. It strikes me that frameworks such as this, coupled with the latest 4-core, 8-core, even 50-core CPUs, may finally be bioinformatics’ answer to the challenge of massively parallel sequencing.

References
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, & Depristo MA (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome research PMID: 20644199