The Fruits of a Thousand Genomes

November 1, 2010 by Dan Koboldt

Last week saw the publication of the 1,000 Genomes Project, which has characterized ~15 million SNPs, 1 million short insertions/deletions (indels), and 20,000 structural variants in seven human populations. This is discovery and genotyping at unprecedented scale, with an astonishing 4.9 terabases (trillion bases) sequenced – the equivalent of about 1,500 human genomes – across three pilot projects:

Deep whole-genome sequencing of trios (mother-father-daughter) from 2 populations
Low-coverage sequencing of 179 unrelated individuals from 4 populations
Exon sequencing of 906 randomly-selected genes in 697 individuals from 7 populations.

The three pilots have shed new light on sequence variation in human genomes and its distribution among human populations. Perhaps unsurprisingly, variation was not evenly distributed in the genome – certain regions (e.g. HLA and sub-telomeres) show high rates of variation, whereas (e.g. a 5 Mbp, gene-dense, highly-conserved region on chromosome 3) show very little. At the chromosomal level, different forms of variation were highly correlated (e.g. SNPs and indels), but there were exceptions for some types of structural variants implicating different mechanisms of mutation.

Novelty and Population-Specificity

The vast majority of SNPs detected were already known to dbSNP. Among known variants, 56% were present in all population panels while 25% were found in only a single panel. In contrast, only 4% of novel variants were found in all panels and 84% were found in only one. This difference supports the notion that the majority of common SNPs in human populations have already been found. There’s more work to do for other forms of variation, though. Many of the novel SVs were detected in all population panels. Half of the common short indels had never been reported.

The smallest two chromosomes – mitochondrial and Y – seemed to benefit the most. There was a lot of heteroplasmy in mitochondrial DNA within individuals – 79% of samples had length heteroplasmy, and 45% had substitution heteroplasmy. On the Y-chromosome, there were 2,870 variable sites, most of which (74%) were novel to public databases. These new variants helped identify several clear, significant sub-clades within the 12 haplotype groups represented in 1,000 Genomes samples.

Coding Regions and Loss-of-Function Variants

In total, the three pilots identified 68,300 non-synonymous variants, almost half of which were novel. Genotyping a subset of these in 620 samples revealed novel NSS variants had dramatically lower minor allele frequency (2.2%) than known ones (26.2%). From this I can draw two conclusions: most novel nonsynonymous variants are rare, and the majority could only have been identified by population-scale sequencing projects like these.

The authors estimate that an individual genome differs from the reference at 10,000 to 11,000 nonsynonymous sites and perhaps 12,000 synonymous sites. A typical genome harbors a much smaller number of loss-of-function (LOF) variants — inframe/frameshift indels, early stops, and splice-site variants — perhaps 340-400 LOF variants per individual, affecting 250-300 genes. Compared to synonymous variants, putative functional variants (nonsynonymous and LOF) tend to have lower allele frequencies and be more population-specific, presumably due to the action of purifying selection against deleterious mutations. Which means, of course, that the really important variants are much harder to find.

Signatures of Natural Selection

Looking in and around genes, the authors found diversity is lowest in exons (50% that of introns) and slightly reduced in 5′ and 3′ UTRs, compared to intronic and intergenic sequences. This signature of natural selection acting upon genes actually has a broad effect; diversity is reduced by 10% in the vicinity of genes compared to gene-distant loci, and that reduction extends up to 85 kbp away. Thus, selection on linked sites appears to restrict variation across the majority of the human genome. Looking across panels, the authors observed that SNPs with large allele frequency differences between populations were enriched for nonsynonymous sites, likely reflecting local adaptation and selection by different continental groups.

Finally, the authors examined the trios to look at a different environment for mutation and selection – immortalized cell lines. Some 952/1001 new mutations in the CEU daughter and 634/669 new mutations in the YRI daughter were not present in the germline, indicating that they occurred either in somatic cells or in the cell lines. Further, the higher number of mutations in the CEU sample may be related to the age of the lines – the CEU line is decades older than the YRI line.

Implications for Future Studies

The findings of the 1,000 Genomes Project thus far have immediate, significant impact on genetic association studies. Using publicly available gene expression data and their expanded catalogue of variants, the authors identified 20-30% more significant expression quantitative trait loci (eQTLs) than had previously been detectable. Thus, it is clear that while existing SNP arrays represent the majority of common variation, a significant amount of rare, phenotypically-relevant variation remains to be incorporated.

References
1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-73 PMID: 20981092

Transcriptome Genetics with HapMap and RNA-Seq

April 22, 2010 by Dan Koboldt

Two papers in Nature this month leverage the power of second-generation sequencing technologies to investigate gene expression variation in human cell lines. By performing RNA-Seq in HapMap cell lines, the authors generated the most extensive gene expression data to date for these samples, and were able to use publicly available HapMap genotypes to associate expression differences with genetic variation. This strategy was applied to the HapMap samples two years ago using expression microarrays. Using RNA-Seq instead of microarrays, however, offers a few key advantages:

More accurate quantification of highly abundant transcripts, where microarrays reach saturation
Access to rare transcripts below the sensitivity threshold for microarrays
Detection of novel gene structure from alternative splicing and unannotated exons
Identification of allele-specific expression

The first study, from Jonathan Pritchard’s lab at the University of Chicago, sequenced RNA from 69 Yoruban (African) individuals on the Illumina GAII platform. They generated at least two lanes per individual, for a total of 1.2 billion reads, of which 964 million (80%) mapped uniquely to the genome or to exon-exon boundaries. The second study, from Emmanouil Dermitzakis’s group at the Sanger center, sequenced RNA from 60 CEU (CEPH Europeans from Utah) individuals, also on the Illumina GAII platform. They generated one lane of paired-end data per individual, for a total of about 1.0 billion reads. Since neither study provided a table summarizing their data (which I’d have liked), I put one together:

Study	Pickrell et al.	Montgomery et al.
Samples	69, African descent (YRI)	60, European descent (CEU)
Sequencing	Illumina 1X35 or 1X46	Illumina 2X37
Reads/Sample	17.4 million	16.9 million
SNP Dataset	HapMap II/III	HapMap III
Total SNPs	3.8 million	1.2 million

Up to this point, the two studies sounded nearly identical. For the data analysis, however, each group went in a different (and interesting) direction.

Pooled Data for Discovery of Novel Gene Structures

Novel Exons. The Pritchard group pooled all data to examine the completeness of current gene annotations. Some 86% of uniquely mapped reads corresponded to known exons. Using conservation data from alignments of 28 vertebrate exomes, the authors identified 4,031 regions that are evolutionarily conserved and show evidence of transcription. About one-quarter of these appear to be part of spliced transcripts, but most appeared to be novel untranslated regions (UTRs). Some 115 regions, however, had sequences consistent with protein-coding exons. To investigate the possibility that their novel exons are real, the authors used RNA-Seq data from several human tissues and chimpanzee cell lines. The evidence suggests that their regions do represent novel exons, but ones that are expressed in a more tissue-specific fashion than annotated exons.

Novel Poly-A Sites. The authors next screened the ~70 million unmapped sequence reads for long runs of A or T nucleotides, which might indicate novel poly-adenylation sites. Of the ~8,000 novel sites that they identified, some 45% fell within 10 bp of a known cleavage site. To further validate their findings, they screened their poly-A regions for the binding site of the CPSF polyadenylation factor, and found a 32-fold enrichment for the CPSF target hexamer. The net result was a high confidence set of 3,481 cleavage sites that show evidence of poly-A (from RNA-Seq data) and CPSF binding.

RNA-Seq: 10 million Reads Is All You Need

The Dermitzakis study generated 16.9 million (+/- 5.9 million) reads per individual, which were mapped to the NCBI 36 reference sequence using Maq with a maximum insert size of 2 megabases). The resulting alignments were filtered to remove alignments with low mapping quality or to the X, Y, or MT chromosomes. Discordant read pairs (by distance or orientation) were also removed. To quantify the expression of known exons/transcripts/genes, the authors scaled read counts for each individual to a theoretical yield of 10 million reads, and only considered exons with data in >90% of individuals. This resulted in data for 90,064 exons from 10,777 genes, of which 95% had at least 10 reads (on average) per individual. While the normalization seems to reduce the dataset to less than half of known genes, it nevertheless provided an extensive view of gene expression across these 60 individuals.

Cis-Regulatory Effects on Gene Expression

Using HapMap genotypes for 1.2 million SNPs, the Dermitzakis group identified 836 genes associated with cis-regulatory variants (compared to 539 genes identified in microarray studies of the same individuals). Even when normalized for the number of genes tested, the increased resolution of RNA-Seq over microarrays yielded a larger number of genetic regulatory effects. The RNA-Seq exon eQTLs (expression quantitative trait loci) were enriched for abundant transcripts, suggesting that saturation of highly expressed exons reduces the sensitivity for microarrays to detect some cis-regulatory effects.

The Pritchard group searched for cis-regulatory variation with an even larger dataset – RNA-Seq for 69 individuals and 3.8 million HapMap SNPs. They identified 929 genes with local eQTLs (4.6% of annotated genes); consistent with previous findings, virtually all SNPs associated with expression level were near the corresponding gene. They also reported the overlap with the CEU study results: the top 500 associations reported in CEU samples were enriched 10 to 40-fold for significant eQTLs in YRI samples. Given the marked genetic differences between these two populations, this result suggests that these studies are identifying replicable cis-regulatory events.

Mechanism of Cis-Regulatory Effects

An important feature of RNA-Seq data is that it can be used not only to detect cis-regulatory variation, but to assess the mechanism by which these variants act. The Pritchard group looked at 222 of their 929 eQTLs for which the associated SNPs fell within the gene exons. They classified the RNA-Seq reads as originating from the high-expression haplotype or the low-expression haplotype, and found that for 195 of the genes (88%), more than 50% of the expressed transcripts carried the allele associated with high expression. Therefore, the modulation of gene expression is a direct result of the associated variation (probably by activating nearby cis-regulatory elements). In other words: the eQTL tells us that variants near the gene are associated with its expression. That means something nearby is regulating it. The fact that the haplotype associated with increased expression is the haplotype that predominates tells us that the high-expression allele is what drives the expression of its nearby gene. As opposed to, say, driving expression of the gene from both chromosomes.

Allelic Effects on Splicing

Finally, both groups looked at the actual content of expressed transcripts, to find SNPs associated with alternative splicing. The Pritchard group calls these splicing quantitative trait loci (sQTLs), and found 187 genes with significant associations. Binding sites for known splice factors (U1 snRNP and U2AF) were enriched for sQTLs, as were SNPs within 2 bp of a canonical splice site. The Dermitzakis group found 110 genes with significant associations, and stratified splicing-associated variants according to their position in the gene structure. When tested against the exons upstream and downstream of where they resided, splice donor variants were enriched 3.17-fold with the upstream (5′) exon, while splice acceptor variants were enriched 7.02 fold with the downstream (3′) exon. Thus, these SNPs affect the inclusion/exclusion of their exons in the mature transcript.

Dermitzakis’s group visually examined their most significant associations to characterize the mechanism of splicing regulation. Of the 110 significant sQTLs identified in CEU samples:

41% were single exon skipping events
17% created an alternate acceptor
13% were double or triple exon skipping events
6% created an alternate donor
5% were mutually exclusive exons
5% were retained introns.

In summary, these studies establish the feasibility of transcriptome sequencing to assess gene expression and characterize regulatory variation. Indeed, as the title of one study suggests, RNA sequencing is a powerful tool for studying the mechanisms underlying human gene expression variation, and will undoubtedly yield better understanding of the complex relationships between genotype and phenotype.

References
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, & Pritchard JK (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464 (7289), 768-72 PMID: 20220758

Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, & Dermitzakis ET (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464 (7289), 773-7 PMID: 20220756

The Four Dimensions of a Breast Cancer Genome

April 15, 2010 by Dan Koboldt

Published today in the journal Nature is the whole-genome sequencing of a basal-like breast cancer tumor, metastasis, and xenograft. There’s also a News and Views article by Joe Gray of Lawrence Berkeley National Laboratory, as well as a news feature on large-scale cancer projects.

brc1-nature08989screenshot

This study is a bit unlike our previous cancer genomes (AML1 and AML2). By my count it is the sixth cancer genome to be sequenced, and the third to come out of the Genome Center at Washington University. Obviously, it’s our first solid tumor. What’s particularly interesting about this study, however, is that we sequenced four DNA samples from a single patient with “double-negative” breast cancer: the primary tumor, peripheral blood (normal), a brain metastasis, and a mouse xenograft derived from the primary tumor. The xenograft is a success story in itself – we managed to create a human-in-mouse (HIM) transplant of the primary tumor that was >90% pure when harvested 101 days after engraftment.

The genomes of these four samples (tumor, normal, metastasis, and xenograft), examined with the incredible power of Illumina massively parallel sequencing, offer an unprecedented view of the somatic changes that underlie breast cancer development, growth, and metastasis.

Repertoire of Somatic Mutations

We validated a total of 50 somatic sites in at least one of the three cancer genomes, including:

28 missense mutations predicted to alter the sequence of an encoded protein
11 synonymous (silent) mutations in coding sequences
4 small insertions ranging in size from 1 to 6 bp
3 small deletions ranging in size from 1 to 13 bp
2 splice site mutations at intron-exon junctions
1 nonsense mutation predicted to result in a truncated protein
1 RNA mutation in a gene encoding a signal recognition particle (SRP) RNA.

We employed deep Illumina sequencing of PCR amplicons to assess the frequencies of each mutation across all four tissues. Intriguingly, more than half of them exhibited differential frequencies between primary tumor, metastasis, and/or xenograft. Two mutations (a nonsense mutation in MYCBP2 and a missense mutation in TGFBI) were significantly enriched in the primary tumor (88-89% vs 14-44%). Some 26 mutations were significantly enriched in the metastasis and/or xenograft. Perhaps most interesting, however, were two sites (a missense mutation in SNED1 and a silent mutation in FLNC) that appear to be de novo mutations unique to the metastasis.

Acquired Structural Variation

Using our internally developed tools for structural variant prediction (BreakDancer) and de novo assembly (TIGRA), we predicted 59 deletions and 18 inversions that were putative somatic events. Validation by PCR and 454/3730 sequencing showed that 73/77 (94.8%) were real structural variants, of which 34 (28 deletions and 6 inversions) were somatic alterations not present in the normal genome. Among them was a 46.5 kbp heterozygous deletion affecting FBXW7 (a known cancer gene) and two overlapping 500-kb deletions affecting CTNNA1 and a handful of other genes. The latter was particularly interesting, because loss of CTNNA1 has been shown to result in global loss of cell adhesion in human breast cancer cell lines.

We also validated seven translocations with a combination of manual review (Pairoscope), assembly, and PCR/3730 sequencing. One translocation that we assembled in all three tumor samples involves a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and the ABCA2 gene on chromosome 9. Two other validated translocations that assembled in all three tumors are on chromosome 2, and separated only by a 393-bp TcMar-Tigger repeat.

Insights from Comparisons of Tumor, Metastasis, and Xenograft

One of the most intriguing findings from our study was the differential mutation frequencies and structural variation patterns that we observed in the metastasis and xenograft, compared to the primary tumor. More than half of the somatic mutations (26/50) were significantly enriched in the metastasis and xenograft, while observed at relatively low frequencies in the primary tumor. This suggests that a sub-population of tumor cells, not the primary clone, gave rise to the cerebellar metastasis that eventually killed the patient.

Is there a fitness cost to the mutations that enabled metastasis? Can we develop sensitive tests to detect the cells that are likely to spread? Genome sequencing has brought us to a point where we can begin to ask these questions, and answering them brings us one step closer to unraveling the complex, devastating, deadly disease that is cancer.

References
Li Ding, Matthew J. Ellis, Shunqiang Li, David E. Larson, Ken Chen, John W. Wallis, Christopher C. Harris, Michael D. McLellan, Robert S. Fulton, Lucinda L. Fulton, Rachel M. Abbott, Jeremy Hoog, David J. Dooling, Daniel C. Koboldt, Heather Schmidt, Joell (2010). Genome remodelling in a basal-like breast cancer metastasis and xenograft Nature, 464 (15), 999-1005 : 10.1038/nature08989

AGBT: PacBio Somewhat Unveiled

February 27, 2010 by Dan Koboldt

Yesterday the Pacific Biosciences commercial instrument (photo) was at last unveiled to a packed room of conference attendees. The road to this third generation sequencer’s release has been paved with nearly $300 million of investment capital since leaving a basement at Cornell University. PacBio, in addition to becoming something of a media darling, has quietly swelled to a several-hundred-employee company.

Since last year, PacBio claims to have achieved read lengths of up to 10.3 kbp, although I haven’t spoken to anyone outside the company who has seen reads that long. Even so, a few vignettes presented in the workshop told of how PacBio has been applied to influenza strain identification and detection of stuctural variants (SVs).

Strobe Sequencing in Real Time

Of particular interest is the “strobe sequencing” mode of the instrument, in which the detection laser is turned off for precise amounts of time to generate mate-pair-like reads spanning large fragments. This feature relies on the real time sequencing, which occurs at a very consistent per-base rate. In fact, it’s possible to infer sequence insertions and deletions as spikes or dips (respectively) in the time required to sequence a template of known size.

Kinetic Variation Applications

The kinetics of real-time sequencing offer an informative new dimension of information from the PacBio data. In a talk today, Eric Schadt of PacBio showed that the kinetics of sequencing vary significantly for “modified” bases, i.e. methylated residues. In a collaboration with Carrie Harwood (UW), PacBio is sequencing the genomes and transcriptomes of 132 isolates of a hydrogen-producing species of Rhodopseudomonas. It turned out that kinetic variation exists at many bases as a “mixture” of sequencing times; by mining these, they identified thousands of methylated bases that caused up to 12-fold variation in sequencing kinetics.

Burning Questions Unanswered

Personally, I was not entirely satisfied with the PacBio workshop. When it opened for questions, I asked the first: whether PacBio had improved any upon the “dark bases” that go by undetected in single molecule sequencing. The presenter — Stephen Turner of PacBio — first gave me a nice 2-minute lecture on why there are no such thing as “dark bases” on PacBio’s sequencing platform due to its inherent awesomeness (sarcasm mine). There is still a problem with “missed bases” but Turner was almost comically evasive (as Daniel MacArthur put it) in stating how often they occur. The next question concerned read lengths, a second topic on which Turner refused to provide concrete information.

Thus, I find myself cautious in my excitement about this new platform, and will reserve judgment until later this year, when the first of the golden-ticket early access partners begin generating data on their own PacBio SMRT sequencers.