Archives for August 2008

Ten Favorite Things About Maq

August 26, 2008 by Dan Koboldt

Heng Li’s brilliant short read alignment tool finally went legit with a publication in Genome Research that came online this month. It’s an important milestone for the open source tool that, by most accounts, out-performs just about every next-gen alignment algorithm to come out.

To commemorate the occasion, I decided to put together this list of my Ten Favorite Things About Maq:

10. The map file. This single file is a one-stop shop. It keeps the alignments, sequences, everything you need to process Solexa data.

9. Random placement. Reads in repeats are assigned alignment scores of zero and randomly placed, which helps paint a more accurate picture of the sequencing coverage across a genome.

8. Conversion tools. Although you have to convert just about any input file 2-3 times, at least maq provides all of the conversion scripts.

7. No gaps, please. Maq generally won’t even try for gapped alignments for short reads, a decision that I wholeheartedly support.

6. The version. It’s widely used and well documented, yet the version’s not even to 0.7.

5. Binary files. You know a program’s fast when it won’t touch ASCII input.

4. Alignment qualities. The real reason maq is superior to most aligners: maq uses individual base qualities when searching for a read’s best alignment.

3. Read simulation. Maq will “train” itself on a real data set, then generate simulated Solexa reads from a reference sequence based on the “real” data characteristics.

2. Good docs. For once, software that comes with complete, usable documentation.

1. The name. You might think “maq” is confusing, but it’s better than the old name, mapASS.

It reminds me of this gem of dialogue from “The Princess Bride”:

Prince: Such an unusual name, “Latrine.” How did your family come by it?
Latrine: We changed it in the 9th century.
Prince: You mean you changed it TO “Latrine”?
Latrine: Yeah. Used to be “Shithouse.”

Maq is good stuff. Thanks, Brian, for showing me the light.

The False Positives in Deep Resequencing

August 22, 2008 by Dan Koboldt

At last the PNAS article previewed earlier this week by In Sequence is available on the journal’s site. Subcloncal phylogenetic structures in cancer revealed by ultra-deep sequencing had two aspects that appealed strongly to me – the use of massively parallel sequencing to study leukemia, and a formalized algorithm to distinguish true variants from false-positives.

The authors set out to examine clonal evolution in cancer with next-generation sequencing of B-cell chronic lymphocytic leukemia (CLL) samples. CLL was an appealing model for this study because its high mutation rate in the short stretch of DNA that encodes the IG heavy chain (IGH). The short size of the locus was ideal for 454 sequencing, and because single-molecule reads are generated, the authors were able to identify haplotypes of somatic hypermutations carried by individual leukemic cells.

A key part of this study was the characterization of sequencing error rates and their causes. Three patterns of sequence errors were apparent:

Errors found near runs of 4 or more bases of the same nucleotide (homopolymers). This well-known artifact of pyrosequencing accounted for many false indel calls, and created false SNP calls as well.
Errors near the end of the sequence. These arise from a reduced signal-to-noise ratio after about 200 bases have been read.
Polymerase misincorporation during PCR. These are not sequencing errors, but random polymerase errors that created a low rate of substitutions through the length of the amplicon.

Weeding out false-positives is one of the greatest challenges facing those of us who analyze massively parallel sequencing data. Often this issue is addressed *after* the sequencing is done, with concordance estimates, decision trees, and the like. What I like about this study is that the authors looked at sequencing errors first, to precisely classify the sources of false-positives, and then built their variant-calling algorithm around the results.

The evolutionary biology aspect of this study is fascinating as well. Cancer is a powerful micro-system to study evolution, since subclones of cells have a mixture of shared and private somatic mutations and compete with one another to grow. Subclones with the best evolutionary fitness will, in time, come to dominate the population. It’s Darwinian fitness at its best.

By identifying haplotypes from single-molecule reads, the authors were able to construct phylogenetic trees of the leukemic cells in a single patient, something that could only be done on the 454 platform. Intriguingly, the initiating driver mutation of leukemogenesis occurred before the earliest branching of trees. Yet there were numerous different subclone haplotype – one came to dominate, but the others persisted as well. This suggests that every subclone persisting in the population picked up at least one additional mutation that gave it a competitive advantage. Thus even the rare subclones carry driver mutations that contribute to cancer cell survival.

The more rare subclones we can detect, the more mutations we can find, and the better we can come to understand the complex set of disease mechanisms that play a role in cancer.

You there, with the Typhoid!

August 4, 2008 by Dan Koboldt

There’s an interesting study in this month’s Nature Genetics in which the authors performed 454 and Solexa whole-genome sequencing on 19 isolates of Salmonella enterica serovar Typhi, the pathogen that causes typhoid fever. Typhi is different from many other Salmonella pathogens in that it’s human-restricted; the main reservoir driving transmission of Typhi is thought to be human carriers. Also, Typhi isolates exhibit very low levels of genetic variation (something like 1 SNP every 2,300 bp).

Among the 19 isolates selected for this study, 10 were sequenced on 454 (average depth: 10.8x) and 12 were sequenced on Solexa (average depth: 20.4x) with 3 isolates done on both platforms. Following Newbler assembly, 454 contigs were aligned to the finished Typhi sequence with MUMmer. Solexa reads were “too short to be assembled effectively using current software” and thus were mapped directly to the finished sequence with Maq (v0.6.0). Cut-offs for SNP calling were determined by comparing data from 454, Solexa, and published sequences for the three strains done on both platforms.

The authors offered little discussion of the relative performance of 454 and Solexa technology for the three strains sequenced on both platforms. However, I got the scoop from Supplemental Table 1. After applying their filtering criteria, the platform-specific performance was as follows. On 454, the mean false positive rate was 1.8% and the mean sensitivity was 85.0%. On Solexa, the mean false positive rate was higher (2.7%) and the sensitivity lower (77.1%). These estimates, by the way, are based on the assumption that the SNPs detected independently by *both* platforms represent the true set of SNPs for each isolate.

The authors claim that a careful sampling strategy designed to capture the full phylogenetic tree, coupled with whole-genome sequencing, allowed them to capture much of the variation present in the Typhi population. Their analysis supports the previously proposed small population size and genetic drift, with little evidence for purifying selection, antigenic variation, or recombination between isolates. The vast majority of genes (72%) contained no SNPs; for the remainder, the distribution of SNPs per gene followed a Poisson distribution. The only gene with a strong signal of positive selection was gyrA, where mutations at codons 83 and 87 are associated with fluoroquinolone resistance; this no doubt reflects selective pressure on Typhi associated with antibiotic use in human populations. However, the sparse evidence for antigenic variation within Typhi suggests that this pathogen is not under strong selective pressure from the human immune system.

The low levels of purifying selection, antigenic variation, and recombination in Typhi are consistent with the role of human carriers as the main consistent reservoir for the pathogen. In other words, the disease persists because there are a number of people who are infected, but asymptomatic. The authors conclude that vaccination may be a crucial long-term strategy for control of typhoid fever because it would treat asymptomatic carriers as well as the infirm. I might add that the apparently-healthy carriers of the Typhi pathogen might be a promising population for immunogenetics studies.