Mutation Detection in Rare Disease by Pooled Sequencing

October 13, 2010 by Dan Koboldt

When it comes to massively parallel sequencing, few areas of human health stand to benefit as much as rare genetic diseases. Indeed, both whole-genome and exome sequencing strategies have identified disease-causing mutations in probands with Charcot-Marie Tooth disease, Miller syndrome, severe brain malformations, and a few other disorders. The Mito10K project took a different approach. They assembled a cohort of mostly unrelated individuals with complex I deficiency (n=103), the most common cause of human respiratory chain diseases.

Mitochondrial Electron Transport Chain (Wikipedia)

Forty-two HapMap samples were included as controls. Instead of employing a whole-genome or exome strategy, they performed deep resequencing of carefully-chosen candidate genes in pools of ~20 samples. And they did it all using a single Illumina flowcell.

Pooled Sequencing of Candidate Genes

The candidates included 103 genes that (i) encoded known complex I proteins, (ii) were implicated in the disease, or (iii) were identified by phylogenetic profiling. The 145 kb target space comprised 653 exons from nuclear genes (138 kb) and two mtDNA regions (7 kb). About 90% of target regions achieved at least 100x coverage; the median redundancy was 3,359x per pool, which works out to ~168x per individual. Next, the authors developed a method (“Syzygy”) to model sequencing error and call variants at very low frequencies. A comparison of calls for the HapMap samples to existing genotype data suggested 92% sensitivity and 99.6% specificity, at sites where coverage was 100x or greater.

Although the pooling strategy worked well for nuclear DNA, there were some problems with the targeted regions in mtDNA. Basically, the distribution of mtDNA was not uniform between samples. That may be due to the fact that while each cell contains exactly 2 copies of each nuclear chromosome, it contains numerous mitochondria and thus numerous copies of the MT chromosome (possibly 20-25 per cell, by one estimate). The resulting shift in sample representation can be quite dramatic. In one pool, for example, 96% of the mtDNA came from a single individual (5% of the pool). The bottom line is that sensitivity to call mutations in pooled samples is going to be lower for mtDNA.

Variant Calling and “Deleteriousness” Prioritization

The unfortunately-named Syzygy method identified 652 variants (high confidence); to boost sensitivity, the authors also employed an ad-hoc approach that called 246 more variants supported by at least 3 reads on each strand (low confidence). The 898 calls were filtered to prioritize variants that seemed likely to underlie a rare and devastating phenotype. In short, the authors removed:

Variants present in healthy individuals (HapMap controls) or public databases (dbSNP, mtDB, 1000 Genomes).
Synonymous or noncoding variants, unless they affected tRNA or splice sites.
Missense variants at positions of low evolutionary conservation

Of 898 detected variants, 216 remained and were validated by multiplexed Sequenom genotyping. Some 82 sites were also Sanger-sequenced to assess the accuracy of the genotyping platform. The comparison revealed 11% false positives and 2% het/hom miscalls, for an overall error rate of 13% for Sequenom assays. Ouch. As for the variant calls, the validation rate was pretty good for high-confidence calls (91/109, or 84%) but rather abysmal for the low-confidence ones (12/107, or 11%). Intriguingly, validation assays identified 12 additional pathogenic variants that were missed by the discovery screen. Based on these data, the sensitivity of the Syzygy method alone was 79.1% (91/115). That’s not bad, but probably not enough for a study whose goal is to identify rare disease-causing variants.

New Diagnoses from Validated Mutations

Some 60 of the sequenced cases lacked a previous molecular-genetic diagnosis. Among these, the authors were able to provide 11 new diagnoses based on mutations in known disease-causing genes. Several lines of supporting evidence were given to support the diagnoses:

6 patients had mutations that were previously known to be disease-causing.
3 patients were homozygous for deleterious mutations that caused splicing defects (observed in cDNA) and no detectable protein (by SDS-page and protein blot).
2 patients had mutations in highly conserved protein domains.

Intriguingly, half of the cases with known mutations (3/6) were compound heterozygotes; that is, they inherited a different defect in the same gene from mother and father. This apparent prevalence of compound hets in monogenic disease is unsettling because they tend to make pedigree analysis complicated and require detection of both variants in heterozygous form, which is more difficult to do by sequencing.

Detection and Characterization of Novel Disease Genes

The key finding of this paper (as suggested by the title) was the implication of two new genes in complex I deficiency: NUBPL and FOXRED1. Pathogenicity of each mutated genes was confirmed by a “rescue” assay in which introduction of wild-type cDNA into patient fibroblasts restored complex I activity. In the absence of rescue, residual complex I activity was markedly reduced (19-40%) in the NUBPL-mutated fibroblasts and strikingly reduced (9-15%) in the FOXRED1-mutated fibroblasts.

The case with NUBPL mutations was particularly interesting. RT-PCR showed that the dominant mRNA species was truncated, and the full-length transcript hardly expressed at all. Sequencing revealed that the shortened fragment had a branch site mutatation that likely caused exon 10 skipping, as well as a missense mutation (Gly56Arg), both on the paternal chromosome. The maternal allele wasn’t expressed. Array-based copy number analysis, however, showed that the maternal chromosome had a complex rearrangement of NUBPL in which exons 1-4 were deleted and exon 7 was duplicated. Obviously this structural variation was not detected in the discovery screen. I think this highlights two things: the importance of structural variation in human disease, and the limitations of targeted sequencing on NGS platforms.

Success and Limitations

As the authors note in their discussion, key to the success of this study was the availability of cellular models of disease, with which the pathogenicity of newly discovered mutations in individual patients could be established. With the two new findings, the 11 newly diagnosed cases, and the 40 or so already-diagnosed cases, the authors now have identified the genetic defect for about half of the cases in their cohort. What about the rest? The authors admit that the causal mutations were likely missed because:

They occur in genes not targeted in this study
They affect targeted genes, but reside in noncoding regulatory regions or novel/unknown exons
They were targeted, but not detected due to limited sensitivity (especially in mtDNA)
They were detected, but filtered out as not likely to be deleterious
They are large-scale deletions or rearrangements, which this approach can’t detect

Despite these limitations, the authors have demonstrated that sequencing carefully-chosen candidate genes in pooled samples, with follow-up validation and experimental support, can successfully identify disease-causing mutations in a good-sized patient cohort. Not bad for a single flowcell.

References

Calvo, S., Tucker, E., Compton, A., Kirby, D., Crawford, G., Burtt, N., Rivas, M., Guiducci, C., Bruno, D., Goldberger, O., Redman, M., Wiltshire, E., Wilson, C., Altshuler, D., Gabriel, S., Daly, M., Thorburn, D., & Mootha, V. (2010). High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency Nature Genetics, 42 (10), 851-858 DOI: 10.1038/ng.659

Ng SB, Buckingham KJ, Lee C, et al (2010). Exome sequencing identifies the cause of a mendelian disorder. Nature genetics, 42 (1), 30-5 PMID: 19915526

Bilgüvar K, Oztürk AK, Louvi A, et al (2010). Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature, 467 (7312), 207-10 PMID: 20729831

Lupski JR, Reid JG, Gonzaga-Jauregui C, et al (2010). Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. The New England journal of medicine, 362 (13), 1181-91 PMID: 20220177

Lalonde E, Albrecht S, Ha KC, et al (2010). Unexpected allelic heterogeneity and spectrum of mutations in Fowler syndrome revealed by next-generation exome sequencing. Human mutation, 31 (8), 918-23 PMID: 20518025

CSHL 2010: Genomes Get Personal

September 22, 2010 by Dan Koboldt

Last week I attended the third annual “Personal Genomes” meeting at Cold Spring Harbor. The meeting opened with a keynote talk by NHGRI director Eric Green, who reminded us that finding the pathway to genomic medicine is the central mission of NHGRI. He mentioned several of the past successful initiatives that have yielded key findings concerning human genetic variation and its relationship to phenotype: The HapMap Project (common variation), the ENCODE Project (functional variation), and the 1,000 Genomes Project (rare variation), to name a few. He showed the absolutely stunning growth of the NHGRI-hosted genome-wide association study (GWAS) catalog, which currently holds ~2,600 associations from 780 publications.

Dr. Green also discussed the dichotomy of genetic architecture underlying human diseases, and took the position that while we’ve made substantial progress studying rare, monogenic, mendelian disorders (predominantly caused by coding mutations), we face a more daunting task with common, complex, multigenic diseases because he believes that these arise from primarily noncoding mutations.

Theme 1: Human Mutation Rates

Several talks addressed the topic of mutation rate in human genomes. Donald Conrad, who will be joining the WashU Genetics Department next year, presented mutation rate as a quantitative trait based on 1,000 Genomes Project trio data. Three of the primary sources of variation in mutation rate are age (males have 3x-6x higher rates), environment, and genetic variation (e.g. inherited aging disorders).

Lee Hood gave an excellent keynote on “Systems Genetics and P4 Medicine”, part of which was a discussion of mutation rate. His group uses whole-genome sequencing (WGS) of family cohorts (in this case, the Miller syndrome family quartet), focusing on the ~2.3 GBP of non-repetitive reference sequence. Using the family information and inheritance modeling, they identify de novo mutations in the offspring, which manifest as errors of Mendelian inheritance. Validation using a custom capture array for 60,000 candidate sites followed by deep sequencing showed that only 1/1,000 “new” mutations in the offspring were real; the vast majority proved to be sequencing errors. That works out to a mutation rate of 1.1 x 10-8, or roughly 70 mutations per child.

Lynn Jorde (Univ. of Utah) later gave a talk on directly estimating human mutation rate by WGS, also using the Miller syndrome quartet. Sequencing by Complete Genomics yielded >50x fold coverage per subject; there were ~4 million positions in the 1.8 Gbp of “useful” reference sequence in which at least one subject differed from the reference. Only 330,000 or so SNPs were novel (not known to dbSNP), and 20% of these proved to be sequencing errors. More array validation, more calculations, and the same answer as given by Dr. Hood: a mutation rate of 1.1 x 10-8.

Theme 2: Personal Cancer Genomes

Cancer genomes were another focus of the meeting. Sean Grimmond (Univ. of Brisbane, Queensland, Australia) presented some of his group’s work on pancreatic cancer as part of the International Cancer Genome Consortium (ICGC). Pancreatic is one of the most deadly forms of cancer; about 90% of patients diagnosed die within one year. Brisbane has assembled a very nice workflow from sample collection to sequencing, that includes pathology review, tumor dissection, QA, and microarray analysis to determine tumor cellularity. The sequencing strategy (WGS, exome, and RNA-seq) differs between high-cellularity (70-100%) and low-cellularity (~30%) tumors. The ultimate deliverable is a “tumor report” documenting cellularity estimates, microarray findings, cytogenetics, what sequencing was done, and what mutations were found.

James Brugarolas (UT Southwestern Medical Center) described the genome evaluation and functional studies of a patient with clear cell renal carcinoma. I learned a bit more about this form of cancer – 85% of tumors prove to be the “clear cell” carcinoma; common lesions include 3p loss (VHL gene) and 5q35 gain. This particular tumor underwent Illumina whole-genome sequencing to 35x coverage; some 46 somatic mutations were validated. One of these was in a gene whose protein product complexes with mTOR, the central player in a known cancer pathway. The tumor was successfully xenografted to a mouse model; some 43/46 somatic mutations were retained, and all had higher frequencies (similar to our findings on basal-like breast cancer). The xenograft let them test a few different cancer drugs – erlotinib (an EGFR inhibitor that had no effect), sunitinib (the front-line therapy for these patients, also no effect), and others. Intriguingly, however, the tumor was sensitive to an mTOR inhibitor compound.

Rick Wilson (The Genome Center at Washington University) gave a talk on whole-genome sequencing of leukemia patients at WashU. Of the 50+ leukemia patients sequenced to date, most have less than 20 valid protein-altering mutations. For most patients, low-resolution cytogenetic screens are the paradigm for disease classification and treatment decisions. Favorable-risk patients (17% of cases) undergo light chemotherapy. For adverse-risk patients (22% of cases), an all0-matched bone marrow transplant is the standard of care. That leaves a large body of patients (~61%) with “intermediate” risk according to cytogenetics; here, the correct treatment decision is harder to make. Better stratification of intermediate-risk patients is the first goal. Dr. Wilson related a fascinating case study, a 39-year-old female with suspected acute promyelotic leukemia, in which rapid-turnaround WGS was able to provide an accurate diagnosis that was not obtained by conventional FISH, and ultimately guided her treatment.

Theme 3: Genome Regulation and Epigenetics

Peter Laird (Univ. Southern California, LA) led us out of the genome to the epigenome with his talk on mining the cancer methylome. He argued that the first steps in oncogenesis may be epigenetic changes, specifically, the dysrgeulation of genes due to abnormal methylation. Dr. Laird presented what he’s calling the first cancer methylome – a tumor sample and matched normal control that underwent bisulfite treatment and sequencing to ~30x coverage. As expected, bisulfite sequencing yielded very accurate estimates of DNA methylation (r=0.97 with Illumina Infinium) but was able to do so across the complete human genome with base-pair resolution.

Theme 4: Exome Sequencing

There is a ton of exome sequencing going on. I saw at least two posters describing “whole” exome sequencing in 1,000 cases and 1,000 controls. I put “whole” in quotes because it’s not true at this point; people really shouldn’t be going around saying that the “whole exome” was sequenced. It’s more like 80-90% of known genes. Rick Lifton spoke about some of the valuable applications of exome sequencing – finding dominant reproductive lethal mutations, unraveling recessive traits with high locus heterogeneity, characterizing somatic mutations in cancer, and identifying rare variants associated with common disease. He described recently published work in which recessive mutations in WDR62 were linked to severe brain malformations by exome sequencing. Matt Bainbridge gave a nice overview of the exome sequencing currently under way at Baylor. So yes, it turns out that groups outside of WashU are doing exome sequencing too.

This Week at Personal Genomes

September 8, 2010 by Dan Koboldt

Later this week, I’ll attend the Personal Genomes meeting at Cold Spring Harbor Laboratory. This is a smaller meeting (less than a hundred participants), but an excellent one by all accounts.

Keynotes

There are three keynotes, including one from NHGRI director Eric Green. His and the keynote from Stanford’s Henry Greely seem focused on applying genomic information in the clinic, a theme that will undoubtedly resonate throughout the meeting.

Greely, H.	Preparing for the coming tsunami of clinical genomic information
Green, E.D.	Genomics in 2K10 and beyond—Charting a course for genomic medicine
Hood, L.	Systems genetics and systems biology

Talks

I’d heard that the talks at this meeting were of exceptional quality, and by the look of the abstracts, this trend seems likely to continue. The diversity of subject matter is impressive: there are updates on sequencing technology (J. Beechem on quantum-dot nanosequencing, Jonathan Rothberg on IonTorrent) and studies of human genetic variation in general (Conrad). I’m looking forward to a talk by my friend Matthew Bainbridge of Baylor College of Medicine, who will report on mutation discovery [likely by exome sequencing] for autosomal dominant diseases. There will be talks on sequencing to study other heritable complex diseases, such as Crohn’s disease and atherosclerosis.

Bainbridge, M.	Mutation discovery for autosomal dominant diseases
Beechem, J.	Single molecule real-time DNA sequencing on the surface of a quantum-dot nanocrystal
Brugarolas, J.B.	Genome evaluation, functional studies, and research translation in renal cell carcinoma
Conrad, D.	Variation in genome-wide mutation rates within and between human families
Dimitrova, N.	Correlating genotyping and gene expression data with next-generation whole genome sequencing data
Gibson, N.W.	A comparison of two methods for digitally quantifying mRNAs
Grimmond, S.M.	Studying pancreatic cancer at single nucleotide resolution
Jones, S.J.	Clinical utility of genomic sequencing of a rare adenocarcinoma
Jorde, L.B.	Direct estimates of the human mutation rate using whole-genome sequence data
Laird, P.W.	Mining the cancer methylome
Lunshof, J.E.	Personal genomes and phenomes—Reframing health
Myers, R.M.	Personal functional genomics
Patil, P.	Refining a method for processing an individual’s whole genome to clinical utility
Pérez-Llamas, C.	IntOGen, Integrative OncoGenomics for personal cancer genomes
Ritz, A.	Algorithms for resequencing and assembly using strobe sequencing data
Rothberg, J.M.	PostLight sequencing with semiconductor chips
Schadt, E.	Enabling a more comprehensive understanding of your risk of infection from viral pathogens via the construction of a real-time disease weather map
Schreiber, S.	Whole genome sequence of a Crohn disease trio—A paradigm for complex disease etiology discovery
Teer, J.K.	Comparison and application of whole exome and genome sequencing on an individual with high risk for atherosclerosis
Trevino, L.R.	Screening for germline variants that predispose to cancer from next-generation sequencing data
Varley, K.	Allele-specific DNA methylation in a three-generation family reveals genetic influence on epigenetic regulation
WANG, J.	Personal genomes are personalized
Wilson, R.	Whole genome sequencing, analysis and diagnosis of a patient with acute promyelocytic leukemia (APL)
Worthey, E.	Personal genomics in a clinical setting—Experience from an academic medical college and children’s hospital
Wyman, S.K.	Post-transcriptional modification of microRNAs is a common, conserved mechanism that increases complexity in the microRNA transcriptome
Yandell, M.D.	Automated high-throughput analysis of personal genome sequences—Towards clinical interpretation

Cancer will feature prominently, with talks on pancreatic cancer, adenocarcinoma, and renal cell carcinoma. Peter Laird of USC, a member of the Cancer Genome Atlas research consortium and methylation expert, has a talk on mining the cancer methylome. Rick Wilson, director of the Genome Center at Washington University in St. Louis, will present some very recent work on sequencing and diagnosis of a patient with acute promyelotic leukemia (APL).

Posters

No matter the recent debate on the usefulness of poster sessions at scientific conferences, I’m looking forward to this one. I’ll be presenting my group’s work on somatic mutation detection by whole-genome and exome sequencing of five patients with ovarian cancer. These are pre-publication results and (in my opinion) make for an interesting comparison. The question is very pertinent: how do current exome sequencing approaches like Agilent SureSelect perform relative to whole-genome sequencing, when it comes to detecting somatic mutations in coding regions of the genome? Nathan Dees from my group has a poster on another interesting project: whole genome sequencing of a primary breast tumor, liver metastasis, and lung metastasis samples from a single patient. There are many interesting posters, too many to talk about. Here’s the full list:

Adams, D.R.	The NIH Undiagnosed Diseases Program—Application of genome-scale sequencing to diagnostic mysteries in single families
Ahn, S.	Comparing and combining two next-generation sequencing technologies for human genome re-sequencing
Bolser, D.M.	The social, political, and economic impact of personal genomes
Brodzik, A.K.	Recent advances in sequence homology assessment in the difference set space with application to the analysis of human genomes
Brunham, L.	Differential effect of the rs4149056 variant in SLCO1B1 on myopathy associated with simvastatin and atorvastatin
Calvo, S.	Targeted sequencing identifies causal disease genes in individual patients with mitochondrial disease
Caruccio, N.	Improved methods for rRNA removal and mRNA-Seq library preparation
Casals, F.	Medical genomics of primary immunodeficiencies
Cho, V.E.	Identification of individuals within study cohorts with unusual intermediate phenotypes
Choi, M.	A compilation of rare functional variations from human exomes
Craig, D.W.	Whole-genome sequencing of autosomal recessive autism
Decker, B.	Clinical analysis of whole genome sequence data at the Medical College of Wisconsin
Dees, N.	Disease progression from primary breast tumor to liver and lung metastases
Dewal, N.	Haplotype specific amplification in high-throughput tumor sequence data
Dimitrova, N.	Multi-modal suite for disease specific analysis of next-generation sequencing data
Dinwiddie, D.L.	Carrier screening of recessive genetic disorders by target enrichment and next-generation sequencing
Dorkins, H.R.	Personal genomes and tomorrow’s doctors
Gonzaga-Jauregui, C.	Assessment of copy-number variation in a family using both whole genome sequencing and array CGH
Gusev, A.	Whole genome low-pass sequencing combined with GWAS data detects variants associated with cholesterol and hemoglobin levels in individuals from the island of Kosrae, Micronesia
Hall, I.M.	Capturing the full spectrum of coding variation with de novo exon assembly
Hambuch, T.	Experiences of whole genome sequencing in the clinical laboratory
Huang, A.L.	Genetic basis of human sleep behaviors—Studies from familial sleep phase syndromes
Ju, Y.	The fine-scale structure of genomic variants and its functional influence on gene expression
Koboldt, D.C.	Somatic mutation discovery in ovarian cancer by whole genome and exome sequencing
Kuersten, S.	Enhanced method to capture the small RNA transcriptome
Lerner-Ellis, J.	Implementing 2nd generation sequencing in the clinic
Markello, T.C.	Whole exome and whole genome sequencing in the NIH Undiagnosed Diseases Program
Metzker, M.L.	Molecular and biochemical characterization of novel syndromes of ketosis-prone diabetes (KPD)
Parla, J.	A comparative evaluation of SNP discovery in human whole exome sequence data versus human whole genome sequence data
Phan, L.	dbSNP and dbVar—NCBI databases of simple and structural variations
Quinlan, A.R.	The landscape of functional mutation in the human exome
Reid, J.	miRNA precursor variants and their possible effects on expression and function
Repo, S.	CAGI—The Critical Assessment of Genome Interpretation, a community experiment to evaluate phenotype prediction
Ross, M.	An approach to clinical interpretive tools for whole genome sequencing
Sabo, A.	The ARRA autism sequencing collaboration, Phase 1—Deep whole exome sequencing in 1000 autism cases and 1000 matched controls
Saito, T.L.	Managing genome databases with UTGB Toolkit
Sen, S.K.	Transcriptome profiling of cardiovascular disease by massively parallel short-read DNA sequencing
Shah, A.	Massively parallel screening of genetic alterations in common cancers
Stong, N.E.	Telomere analysis using next-gen sequence data
Swan, M.	The application of genome-wide association studies of aging in a patient-driven clinical trial
West, J.	Whole-genome sequencing of a family of four—Educational and ethical perspectives
White, L.D.	The emerging role of core sequencing facilities in the personal genomes era
Xing, E.P.	Exploiting a hierarchical clustering tree of gene-expression traits in eQTL analysis
Xing, E.P.	Leveraging genetic interaction networks for joint mapping of marginal and epistatic eQTLs
Xing, E.P.	MoGUL—Detecting common insertions and deletions in a population
Yan, J.	Using genetic information in risk prediction for alcohol dependence in the Collaborative Study on the Genetics of Alcoholism GWAS sample
Yu, F.	Low coverage personal genomics enabled by an integrative SNP pipeline

As many members of the NGS blogosphere are aware, CSHL has implemented some strict rules of blogging while at their meetings. Thus, I’m likely signing off until next week, when I’ll post a full report.

A Foundation for Next-Generation Analysis Tools

August 11, 2010 by Dan Koboldt

The emergence of next-generation sequencing has presented numerous significant challenges to the bioinformatics community. NGS instruments have given rise to a new generation of software tools for the alignment, assembly, management, and visualization of incredible amounts of data. New algorithms have also been developed to assess coverage, assess genomic copy number, call variants (SNPs/indels), and infer large-scale structural variation.

Regardless of their purpose, most tools for NGS data analysis are under increased demand for the same things:

Efficiency – in the face of ever-growing throughputs from NGS instruments
Flexibility – to accommodate new sequencing platforms, experimental protocols, and input formats
Scalability – to continually improve upon and enhance their features as needs evolve

The definition and widespread acceptance of the Sequence Alignment Map (SAM) as the standard format for representing NGS data was a key development for the field. Aaron McKenna and colleagues at the Broad Institute have just published another advance – the Genome Analysis Toolkit (GATK), a structured programming framework for NGS data anlysis. Essentially, GATK is a foundation of code that takes advantage of the SAM/BAM input format to simplify many of the common requirements for data analysis tools. The core system can accommodate reads from any sequencing platform, as long as they’ve been converted to SAM/BAM format. It therefore supports most sequence aligners, and also recognizes public database formats (HapMap, dbSNP) and some of the common data-exchange file formats (e.g. GLF and VCF). It’s written in Java, which means that the framework is operating-system-independent as well.

GATK implements something called a “mapreduce” paradigm to allow analysis tasks to be performed in parallel. If you’re developing a new analysis tool, there are a few different ways (traversals) to get to the data that’s in a BAM file. For example, if you wanted to compute the average read length, you could use the TraverseReads scheme to pull out every read and walk through them. Alternatively, if you wanted to calculate the average read depth across the genome, you could use the TraverseLoci scheme to pull out information (reference base, read bases, etc.) at every base in the genome. The best part is that you don’t have to write any of the code for indexing, retrieving, and parsing NGS data – that’s already done. You can focus on your analysis tool, while the GATK developers can continually improve the core engine.

Analysis Tools Built on GATK

The authors demonstrate two simple applications that were developed using the GATK framework. The first, a depth-of-coverage tool, took just 83 lines of code to generate a depth-of-coverage report for every position in a given locus (or the whole genome). This might easily be developed into a highly automated, graphic-supported system for reporting coverage on, say, an exome sequencing project. The second demonstration tool was a simple Bayesian genotyper (57 lines), which uses posterior probability to determine the most likely genotype at each position in the reference.

I’m aware of at least two more valuable NGS data analysis tools that were built on this framework. The first is actually the framework’s foundation, Picard (http://picard.sourceforge.net), which contains a number of SAM/BAM parsing elements, but perhaps more importantly, has the widely used “MarkDuplicates” tool for identifying redundant sequences in NGS data. The second tool, one that I’ve recently been evaluating, is the GATK indel genotyper. Given a pair of BAM files from a tumor sample and matched (normal) control, the GATK indel genotyper implements a stringent algorithm to call indels and determine their somatic status (Germline or Somatic) based on the evidence in both files. Optionally, this can be done with local realignment of reads around indel positions, which helps remove some false positive variant calls. Compared to other tools for indel calling, GATK seems to offer greater precision (fewer false positives), while maintaining sensitivity, in the datasets that I’ve tested.

Next-Generation Informatics

I readily admit that I don’t know enough about parallelization to discuss it in detail, but what I read in the paper seems encouraging. On a single CPU, the simple Bayesian genotyper took something like 14 hours to complete chromosome 1 of a whole-genome sequence using a single CPU. But when offered 12 CPUs, the built-in parallel processing support of GATK brought down execution time almost 12-fold, to about an hour and a half. It strikes me that frameworks such as this, coupled with the latest 4-core, 8-core, even 50-core CPUs, may finally be bioinformatics’ answer to the challenge of massively parallel sequencing.

References
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, & Depristo MA (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome research PMID: 20644199