June 2010

There is growing interest in applying next-generation sequencing to targeted regions of interest, particularly the “exome” – the set of coding exons in the human genome. A paper in Genome Biology from Matthew Bainbridge and colleagues at Baylor describes solution-phase exome capture and sequencing of a HapMap sample with just 3 GB of data. The 1,000 Genomes Project recently announced a new pilot study focused on exome sequencing for hundreds of individuals. A few studies of human exome resequencing to identify disease genes have been published, and more are sure to come as genome centers ramp up their exome capabilities.

Yet this week’s In Sequence magazine writes that there are concerns about what exome capture is missing. For example, at CHI’s Beyond Sequencing meeting this week, researchers from NCI reported that current exome capture projects omit some medically important genes, such as insulin, ABO blood group, and HLA. Of course, some of this can be attributed to GC-rich exons and other tough-to-capture regions. The concern is that many RefSeq coding sequences aren’t even targeted by the two commercial platforms – 23% are missing from Nimblegen’s 2.1m array, and 17% are missing from Agilent’s SureSelect (according to the NCI group).

Exome Sequencing on Illumina and SOLiD

Even so, exome sequencing is rapidly reaching maturity. The Baylor study, led by Matt Bainbridge, used a customized Nimblegen solution-phase capture product to target 36 Mbp of consensus coding sequence (CCDS), and sequenced capture libraries on both ABI SOLiD and Illumina GAII platforms. Six individual capture libraries were generated from HapMap sample NA12812. Four were sequenced as technical replicates on SOLiD, while two more libraries went to Illumina single-end and paired-end sequencing.

On average, some 49.6% of mappable reads from the four SOLiD libraries were derived from target regions, with the remainder mapping elsewhere in the genome. The target coverage correlation between the four replicates was 98%, suggesting that reproducibility across capture and SOLiD sequencing was pretty good.

Duplication Rates in Exome Capture

The authors performed a detailed analysis of duplication rates in their data, a metric that is critical to the unique coverage and downstream analysis. The duplication rate for three SOLiD libraries with 3GB of data was ~22%, and highly consistent between replicates. Duplication was higher (~33%) in the fourth SOLiD library, which is not surprising since it had more than three times (10 GB) the data.

Intriguingly, the authors used simulations to demonstrate that the “expected” duplication rates for 3GB and 10GB of data are 14% and 22% by random chance, suggesting that as many as one-third of observed duplicates are not artifactual, but chance events.

Paired-end sequencing offers the opportunity to identify duplicates using both reads in a read pair. Theoretically, this should help distinguish artifacts from chance events. Indeed, the authors observed a dramatic difference in duplication rate between the Illumina fragment-end (30.97%) and paired-end (8.3%) libraries, even though both generated about 2.5 GB of data. They surmised that the improved identification of duplicates from paired-end sequencing, not a difference in library construction, was the reason. When pairing information was ignored, the duplication rate in the PE library nearly quadrupled to 27.6%.

SNP Discovery and HapMap Concordance

Because this was a HapMap sample, the authors were able to compare SNPs identified in sequencing to known genotypes from the HapMap Project. Genotype concordance in the target regions was 82% for 3GB libraries and 92% for 10GB libraries, but importantly, this considered all sites regardless of coverage. When the authors limited comparisons to sites with >=9x unique read depth, concordance was ~95%. That’s still a bit low for my taste, but within the realm of expectation for sequence-to-genotype comparisons.

SOLiD Versus Illumina Sequencing

I was pleased that Bainbridge and his colleagues made some direct comparisons between SOLiD and Illumina sequencing. This is a delicate issue, from the point of view of the sequencing vendors, but one of great interest to the NGS community. The Illumina PE data yielded ~25% more SNP calls in target regions, with higher HapMap concordance (98%) than ABI SOLiD data (95%). The authors attribute this to the better mapping, higher coverage, and low duplication rate made possible by paired-end sequencing. Considering only HapMap heterozygous SNPs, SOLiD out-performed Illumina at low (<9x) coverage, but Illumina consistently yielded 2-3% higher concordance at high coverage.

In their concluding section, the authors write “Interestingly, Illumina sequencing consistently shows higher levels of enrichment than SOLiD sequencing. This is unexpected because both sequencing platforms yield similar coverage distributions in whole genome sequencing data… therefore we suspect that differences in efficiency are due to an increase in initial library complexity from better annealing efficiencies of the Illumina adapter.”

Such a frank conclusion, from a group that’s highly invested in SOLiD sequencers, is especially poignant. When it comes to exome sequencing, Illumina seems to have the advantage.

References
Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jedeloh JA, Muzny D, Albert TJ, & Gibbs RA (2010). Whole exome capture in solution with 3Gbp of data. Genome biology, 11 (6) PMID: 20565776

A letter to Nature this week presents the whole-genome sequencing of a non-small-cell-lung cancer tumor. Over 500 validated mutations (530 SNVs and 43 structural variants) offer an unprecedented view of genetic variation and selection in solid tumors.

Using arrays of self-assembling DNA nanoballs (DNBs, i.e., the Complete Genomics platform), Lee et al sequenced a primary lung tumor (to 60x) and matched normal tissue (to 46x). They also performed SNP genotyping (Affy 6.0) and array-CGH (Agilent 244A) to assess genome-wide DNA copy number, allelic imbalance, and loss of heterozygosity (LOH). The tumor sample “bears many of the hallmark copy number alterations commonly found in smoking-associated lung cancer” including copy number loss of TP53, amplification of CDK4 and KRAS, and copy-number-neutral LOH of chromosome 13 across the RB1 locus.

Somatic Mutations and the Cost of Smoking

A comparison of tumor and normal sequences yielded some 83,000 predicted somatic SNVs. After validating 70% of predicted coding region SNVs, the authors re-tuned their prediction algorithms (to 90% specificity and 82% sensitivity) and called 50,675 high-confidence somatic mutations genome-wide. The patient was a 51-year-old man who’d reported smoking 25 cigarettes a day for 15 years prior to surgery. What was the cost of his unhealthy habit? At $3.50 a pack, it works out to around $24,000. At 50,000 mutations, it works out to one mutation for every 2.7 cigarettes. Consider that, smokers, the next time you decide to light up.

Mutation Rate and Spectrum

Compared to the observed germline variation, the pattern of somatic mutations was strikingly different, favoring changes at G-C base pairs (78%), the majority of which were G/C->T/A transversions (46%). Similar patterns were observed in the lung cancer cell line recently sequenced by Pleasance et al, and underscores the strong influence of smoking-induced DNA damage.

The authors estimated an overall mutation rate was 17.7 mutations per megabase. Some 17 mutations occurred in the set of 623 genes sequenced by our group (TSP) in 188 lung adenocarcinomas. In that study, non-smokers had fewer than five mutations in the gene set, while smokers had as many as 49. Thus, the authors’ observed mutation rate fits well within the expected range for lung cancers set forth by TSP.

Evidence of Selection in Expressed Genes and Upstream Promoters

The greatest strength of this study was the authors’ analysis of somatic mutation patterns relative to gene structures. They found, for example, that the mutation rate was lower for expressed genes (8.3 per Mb) compared to non-expressed genes (17.5 per Mb), suggesting selective pressure against mutations in active coding regions. Further, mutations were less prevalent on the transcribed strand than the non-transcribed strand, likely due to transcription-coupled DNA repair mechanisms. Intriguingly, the mutation rate in regions 2kb immediately upstream of transcription start sites, i.e., the 2-kb promoters, was 10.5 per Mb, or 40% lower than the genome-wide average. Such an observation suggests that upstream promoters, like coding sequences, are under purifying selection – and supports the notion that these regions harbor key regulatory elements that are disrupted by mutation.

Genetic Complexity and Redundancy

The authors also validated some 43 somatic structural variations. However, only 27 had breakpoints in genic regions, suggesting that the majority of somatic structural events are passenger mutations. Notably, most somatic SVs map near regions of DNA copy number changes, suggesting that structural events and copy number are inter-related.

Taken together, the results of this study suggest that lung cancer tumors can harbor a surprisingly large number of mutations ranging from single-nucleotide events to megabase-scale structural variation. At least eight genes in the EGFR pathway were mutated or amplified in this tumor sample, indicating a multiplicity of partially redundant mutations. The authors conclude that the tumor tissue might therefore represent a heterogeneous mixture of sub-clonal populations, many of them with distinct mutational landscapes. Unfortunately, the authors did not include deep read count data for the validated mutations, which would have yielded precise mutation frequencies and perhaps given additional support to such a conclusion. If true, however, the genetic complexity and redundancy of lung cancer tumors might help explain why they are so difficult to treat.

We need more studies like these – more patients, more tumor types, more validation – before we can truly get a picture of the full spectrum of mutations that underlie tumor development and progression.

References
Lee W, Jiang Z, Liu J, Haverty PM, Guan Y, Stinson J, Yue P, Zhang Y, Pant KP, Bhatt D, Ha C, Johnson S, Kennemer MI, Mohan S, Nazarenko I, Watanabe C, Sparks AB, Shames DS, Gentleman R, de Sauvage FJ, Stern H, Pandita A, Ballinger DG, Drmanac R, Modrusan Z, Seshagiri S, & Zhang Z (2010). The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature, 465 (7297), 473-7 PMID: 20505728

Not-so-whole Exome Sequencing

Exome Sequencing on Illumina and SOLiD

Duplication Rates in Exome Capture

SNP Discovery and HapMap Concordance

SOLiD Versus Illumina Sequencing

Mutation and Selection in a Lung Cancer Genome

Somatic Mutations and the Cost of Smoking

Mutation Rate and Spectrum

Evidence of Selection in Expressed Genes and Upstream Promoters

Genetic Complexity and Redundancy

Archives for June 2010

Exome Sequencing on Illumina and SOLiD

Duplication Rates in Exome Capture

SNP Discovery and HapMap Concordance

SOLiD Versus Illumina Sequencing

Somatic Mutations and the Cost of Smoking

Mutation Rate and Spectrum

Evidence of Selection in Expressed Genes and Upstream Promoters

Genetic Complexity and Redundancy