Archives for August 2010

Rapid Human Adaptation to High Altitudes

August 31, 2010 by Dan Koboldt

Two studies in the journal Science demonstrated that genes in the hypoxia-inducible factor (HIF) oxygen signaling pathway have undergone strong, recent positive selection in Tibetan highlanders. One study was a genome-wide scan using SNP arrays; the other a large-scale exome sequencing effort. The exome study was particularly interesting; using the Nimblegen 2.1M exon capture array and Illumina GAIIx instruments, Yi et al sequenced the exons of nearly 20,000 genes (92% of CCDS) in 50 unrelated Tibetans.

Exome Sequencing Summary

To my knowledge, this represents the largest published study of human exome sequencing to date. The main text in the report to Science was necessarily brief, so I used the supplemental materials to glean the following information:

Genes Targeted:	18,654
Total Target Size:	34 Mbp
Number of Samples:	50
Data per Sample:	3.4 Gbp
Avg. Read Length:	71 bp
Reads per Sample:	47.87 m
Map Rate (SOAPaligner):	67.79%
Mapped Reads per Sample:	32.45 m
On Target (+/- 500 bp):	68.1%
Avg. Target Depth:	17.58x
Avg. Target Breadth:	95.48%

The production numbers are consistent with a single lane of 2×75 bp reads (3.4 Gbp) per exome. The low mapping rate (68%) is slightly alarming, but I’d guess (hope) that only uniquely mapped reads are counted here. The on-target mapping rate, a measure of capture specificity, was 68%, well within the expected enrichment of large-scale capture technologies.

Highly Variable Coverage Across Samples

I do feel obligated to point out that while the average target depth was 18x, which seems appropriate for variant calling, the actual target depth varies widely across the 50 samples. Here’s my plot of target coverage breadth (% of bases) by average target depth (redundancy) using data from supplemental table 1:

bgi-50exomes-st1-breadthbydepth

Almost every sample reaches 90% coverage breadth, but 7 of them have less than 10x coverage on average. This will undoubtedly affect the ability to call variants accurately, though only a statistician might be able to extrapolate the effects of such variable coverage on the study’s outcome.

Searching for Selection

To look for evidence of positive selection for altitude, they compared SNP allele frequencies to between Tibetans and 40 Han Chinese whose genomes were sequenced to low (4x) coverage as part of the 1,000 Genomes Project. About 100,000 high-confidence SNPs (>99% probability) were called in the Tibetan samples. A subset (53/56) were validated by Sanger sequencing, suggesting that ~95% of sites are valid polymorphisms. Allele frequency estimates showed an excess of low-frequency variants, particularly among nonsynonymous SNPs.

Using synonymous sites in both populations, the population historical modeling estimated that Tibetans and Han Chinese diverged 2,750 years ago, with Han expanding from a small initial population, and Tibetans shrinking from a larger ones. Migrational evidence suggests that Han Chinese migrated from the Tibetan region, with recent admixture in the opposite direction.

Exon Targets, Intron Findings

Intriguingly, though the “exome” sequencing strategy focused on coding regions, no amino-acid changing variants differed by more than 6% between Han and Tibetan populations. Fortunately, hybrid selection (capture) also captures some of the noncoding regions that flank target exons. This happens because randomly sheared DNA fragments (200-250 bp) may overlap both exon and intron sequence, yet still have enough sequence overlapping a probe to be captured. This creates a “shoulder” of coverage upstream and downstream of target exons, often in intronic or UTR sequences.

This side-benefit of exome capture proved serendipitous because intronic sequences harbored the most divergent SNP between Han (9% frequency) and Tibetan (87% frequency) populations. The gene in question was endothelial PAS-domain protein 1 (EPAS1), also known as hypoxia-inducible factor 2-alpha (HIF2A). Hypoxia in the name of a candidate gene for high altitude adaptation was a good sign. A protein-stabilizing mutation in EPAS1 had already been linked to erythrocytosis, suggesting a possible link between this gene and red blood cell production.

Even more promising was the fact that another study published in the same issue of Science had pinpointed the same gene by high-density SNP array genotyping. The irony here is priceless: an expensive exome sequencing project finds an intronic SNP, implicating a gene that was just as easily identified by genotyping. Of course, if the relevant haplotypes had been comprised of rare variants – ones absent from the Han population and not covered by current SNP arrays – only one group would have identified this gene, and the other would have gone home empty-handed.

Perspective
Storz, J. (2010). Genes for High Altitudes Science, 329 (5987), 40-41 DOI: 10.1126/science.1192481

Reports
Simonson TS, Yang Y, Huff CD, Yun H, Qin G, Witherspoon DJ, Bai Z, Lorenzo FR, Xing J, Jorde LB, Prchal JT, & Ge R (2010). Genetic evidence for high-altitude adaptation in Tibet. Science, 329 (5987), 72-5 PMID: 20466884

Yi X, Liang Y, Huerta-Sanchez E, et al. (2010). Sequencing of 50 human exomes reveals adaptation to high altitude. Science, 329 (5987), 75-8 PMID: 20595611

More Diagnosis by Whole Genome Resequencing

August 25, 2010 by Dan Koboldt

The cost of DNA sequencing continues to plummet, and while insurance companies might not be ready to get on board, another study has demonstrated how individual genome data can be clinically informative. Jonathan Rios and colleagues took on the case of an 11-month-old girl with severe hypercholesterolemia (1023 mg/dl) whose parents were unaffected, which suggested either an autosomal recessive disorder or a de novo mutation. A normal plasma sitosterol:cholesterol ratio ruled out sitosterolemia, and a screen of commonly-mutated hypercholesterolemia genes (LDLRAP, LDLR, PCSK9, APOE and APOB) came up empty.

Enter Whole-Genome Sequencing

Whole-genome resequencing for the patient was performed by Complete Genomics, who generated 138 Gbp of mappable sequence yielding 49x average coverage. Of the ~3.8 million sequence variants identified, 502,000 were indels or complex rearrangements that were not further evaluated (see Figure 2A). That left ~3.3 million SNPs, of which 9,726 were nonsynonymous or splice site variants. Most of these were present either in dbSNP or in the genomes of 21 unaffected individuals (16 exomes, 5 whole-genomes). About 700 variants remained.

Now the authors considered the disease pedigree:



Rios et al., Hum. Molec. Genetics 2010

The parents were unrelated (non-consanguinous) and both had normal cholesterol. No other relatives (including the patient’s 4-year-old brother) were affected. It fit with a monogenic autosomal recessive disorder, so the authors focused their search on genes with two or more nonsynonymous or splice site variants. Some 42 genes qualified, but 19 of these were known to have multiple copies in the genome. That left 23 genes, and one stood out: ABCG5, an ATP-binding cassette hemitransporter in which the patient had two heterozygous nonsense mutations. A gene, by the way, that had already been linked to the sterol-elimination deficiency (sitosterolemia) that can cause high cholesterol. There was no way the gene could be functional, and targeted sequencing showed that each parent had provided a defunct copy.

New Understanding of the Disease Mechanism

The researchers did some more tests, and the results were consistent with sitosterolemia. So how did they miss it the first time? Well, at the time of diagnosis, 80% of the patient’s diet was breast milk, which is low in plant sterols. At the time of the second test, she was on baby foot (fruits and vegetables), which provided plenty. That explains why the sterol counts were initially low. Why was the cholesterol so high? The authors conclude that in sitosterolemic patients, the severe hypercholesterolemia may reflect an failure to excrete cholesterol into bile, rather than a disruption of cholesterol homeostasis by elevated plant sterols.

Thus, whole-genome sequencing not only provided the correct diagnosis, but may have shed new light on the mechanisms of lipid disease.

References
Rios J, Stein E, Shendure J, Hobbs HH, & Cohen JC (2010). Identification by Whole Genome Resequencing of Gene Defect Responsible for Severe Hypercholesterolemia. Human molecular genetics PMID: 20719861

A Foundation for Next-Generation Analysis Tools

August 11, 2010 by Dan Koboldt

The emergence of next-generation sequencing has presented numerous significant challenges to the bioinformatics community. NGS instruments have given rise to a new generation of software tools for the alignment, assembly, management, and visualization of incredible amounts of data. New algorithms have also been developed to assess coverage, assess genomic copy number, call variants (SNPs/indels), and infer large-scale structural variation.

Regardless of their purpose, most tools for NGS data analysis are under increased demand for the same things:

Efficiency – in the face of ever-growing throughputs from NGS instruments
Flexibility – to accommodate new sequencing platforms, experimental protocols, and input formats
Scalability – to continually improve upon and enhance their features as needs evolve

The definition and widespread acceptance of the Sequence Alignment Map (SAM) as the standard format for representing NGS data was a key development for the field. Aaron McKenna and colleagues at the Broad Institute have just published another advance – the Genome Analysis Toolkit (GATK), a structured programming framework for NGS data anlysis. Essentially, GATK is a foundation of code that takes advantage of the SAM/BAM input format to simplify many of the common requirements for data analysis tools. The core system can accommodate reads from any sequencing platform, as long as they’ve been converted to SAM/BAM format. It therefore supports most sequence aligners, and also recognizes public database formats (HapMap, dbSNP) and some of the common data-exchange file formats (e.g. GLF and VCF). It’s written in Java, which means that the framework is operating-system-independent as well.

GATK implements something called a “mapreduce” paradigm to allow analysis tasks to be performed in parallel. If you’re developing a new analysis tool, there are a few different ways (traversals) to get to the data that’s in a BAM file. For example, if you wanted to compute the average read length, you could use the TraverseReads scheme to pull out every read and walk through them. Alternatively, if you wanted to calculate the average read depth across the genome, you could use the TraverseLoci scheme to pull out information (reference base, read bases, etc.) at every base in the genome. The best part is that you don’t have to write any of the code for indexing, retrieving, and parsing NGS data – that’s already done. You can focus on your analysis tool, while the GATK developers can continually improve the core engine.

Analysis Tools Built on GATK

The authors demonstrate two simple applications that were developed using the GATK framework. The first, a depth-of-coverage tool, took just 83 lines of code to generate a depth-of-coverage report for every position in a given locus (or the whole genome). This might easily be developed into a highly automated, graphic-supported system for reporting coverage on, say, an exome sequencing project. The second demonstration tool was a simple Bayesian genotyper (57 lines), which uses posterior probability to determine the most likely genotype at each position in the reference.

I’m aware of at least two more valuable NGS data analysis tools that were built on this framework. The first is actually the framework’s foundation, Picard (http://picard.sourceforge.net), which contains a number of SAM/BAM parsing elements, but perhaps more importantly, has the widely used “MarkDuplicates” tool for identifying redundant sequences in NGS data. The second tool, one that I’ve recently been evaluating, is the GATK indel genotyper. Given a pair of BAM files from a tumor sample and matched (normal) control, the GATK indel genotyper implements a stringent algorithm to call indels and determine their somatic status (Germline or Somatic) based on the evidence in both files. Optionally, this can be done with local realignment of reads around indel positions, which helps remove some false positive variant calls. Compared to other tools for indel calling, GATK seems to offer greater precision (fewer false positives), while maintaining sensitivity, in the datasets that I’ve tested.

Next-Generation Informatics

I readily admit that I don’t know enough about parallelization to discuss it in detail, but what I read in the paper seems encouraging. On a single CPU, the simple Bayesian genotyper took something like 14 hours to complete chromosome 1 of a whole-genome sequence using a single CPU. But when offered 12 CPUs, the built-in parallel processing support of GATK brought down execution time almost 12-fold, to about an hour and a half. It strikes me that frameworks such as this, coupled with the latest 4-core, 8-core, even 50-core CPUs, may finally be bioinformatics’ answer to the challenge of massively parallel sequencing.

References
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, & Depristo MA (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome research PMID: 20644199