Faster and More Accurate Sequence Alignment with SNAP

# Faster and More Accurate Sequence Alignment with SNAP

Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David Patterson,
Scott Shenker, Ion Stoica, Richard M. Karp, Taylor Sittler111Corresponding Author: Department of Laboratory Medicine, University of California San Francisco 185 Berry Street, Suite 100 San Francisco, CA 94115. taylor.sittler@ucsf.edu

U.C. Berkeley Microsoft Research U.C. San Francisco

We present the Scalable Nucleotide Alignment Program (SNAP), a new short and long read aligner that is both more accurate (i.e., aligns more reads with fewer errors) and 10–100 faster than state-of-the-art tools such as BWA. Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a simple hash index of short seed sequences from the genome, similar to BLAST’s. However, SNAP greatly reduces the number and cost of local alignment checks performed through several measures: it uses longer seeds to reduce the false positive locations considered, leverages larger memory capacities to speed index lookup, and excludes most candidate locations without fully computing their edit distance to the read. The result is an algorithm that scales well for reads from one hundred to thousands of bases long and provides a rich error model that can match classes of mutations (e.g., longer indels) that today’s fast aligners ignore. We calculate that SNAP can align a dataset with 30 coverage of a human genome in less than an hour for a cost of $2 on Amazon EC2, with higher accuracy than BWA. Finally, we describe ongoing work to further improve SNAP. ## 1 Introduction As massively parallel sequencing methods become common in biological research, high throughput sequence information is making its way into a variety of fields, from plant biology to human infectious disease, cancer research, and clinical medicine. With the advent of newer sequencing machines, hundreds of millions to billions of short nucleotide fragments are now generated in a single experiment. Given recent trends and the current demand for these technologies, monitoring the cost of sequencing provides a good measure of the amount of sequence generated (lower cost means more sequencing). That cost is decreasing vanishingly fast. Resequencing a human genome cost several billion dollars at the turn of the millennium, reduced to$10,000 at the start of 2011 [15]. By year’s end, the equivalent genome is projected to cost only 2,000. Far outstripping Moore’s Law, the amount of generated sequence threatens to overwhelm current storage and compute infrastructures. Laboratories and sequencing centers are turning to “the cloud”: renting machines and storage from companies like Amazon, Rackspace, and Microsoft. However, given the rate of decrease in cost and increase in the amount of sequence data, simply scaling out across more machines cannot keep pace with these advances. New algorithms are required to handle this data. The most popular traditional algorithms for sequence alignment are Smith-Waterman and BLAST [1]. By themselves, these algorithms are quite accurate, but they take an inordinate amount of time to perform adequate alignments. As a result, many new aligners have been developed for the purpose of handling large amounts of short read data over the past two to four years. BWA, SOAP, Bowtie and SSAHA [7, 10, 6, 13] are but a few recent examples of such tools. While these aligners outperform Smith-Waterman and BLAST by several orders of magnitude, they are still computationally expensive, taking multiple CPU-days to align a single human genome. Furthermore, the faster aligners’ speed comes at the cost of accuracy: these aligners only support limited numbers and types of errors, e.g., one insertion/deletion (indel) of limited size per read. They can thus miss larger differences between sequences that are biologically significant. For example, short insertions and deletions are thought to comprise between 15-30% of known genetic variation [12] and to contribute to common diseases [17]. Additionally, human cancer cells often contain a host of anomalies [3], including important base substitutions and short indels. As DNA repair mechanisms degrade in tumor cells, accumulation of these mutations is thought to drive cancer progression. Accurate identification of these changes is therefore critical to understanding these diseases. In this article, we present SNAP, a new aligner that is both substantially faster and more accurate than the current state-of-the-art algorithms. SNAP has several properties that make it attractive and broadly applicable for truly high throughput sequence analysis: 1. It runs 10–100 faster than existing tools on reads from current sequencing technologies, while providing higher accuracy (more reads aligned with fewer errors). 2. It supports a rich error model: SNAP can find alignments with an arbitrary number of substitutions, insertions, or deletions from the reference genome, as long as there is one contiguous “seed” of 20 bases matching exactly. 3. It can run on readily available hardware, such as several Amazon EC2 server types. 4. The same algorithm can be used across a wide range of read lengths (from 100 to 10000 base pairs) and error rates, making it applicable to both current an upcoming sequencing technologies. Surprisingly, unlike the fastest current aligners (e.g., Bowtie and BWA), which search a trie index of the genome encoded using the Burrows-Wheeler transform, SNAP uses a simple hash index similar to BLAST’s, based on short “seed” substrings of the genome. However, SNAP leverages several observations to run faster than previous hash-based aligners. First, read lengths have increased since the advent of BLAST: whereas sequencers used to produce 25–30 base pair reads, they are now producing reads of 100 bp or more. This lets us use longer seeds (20 bases as opposed to BLAST’s 10–12), which have a higher probability of containing an error and thus not being found in the index, but match far fewer “false positive” locations by chance. Second, we eliminate most of these locations without fully computing their edit distance from the read, by using a local alignment algorithm that can quickly reject locations with a higher edit distance than the best we found so far [16] and eliminating some locations based on number of matching seeds. This gives us a 10–50 speedup over the textbook edit distance check [18]. Third, SNAP leverages the higher memory capacities on today’s servers to index more seeds and perform fewer hash lookups.222The current version uses 39 GB for an index of the human genome, which is readily available on commodity servers and on cloud services. We believe that the memory usage could be reduced with little loss in speed, as most of the algorithm’s time is spent in local alignment rather than index lookup. Combined, these optimizations yield an algorithm that would not have performed well on the ultrashort reads available when BLAST was introduced, but is highly effective on the 100+ bp reads available today, as well as next-generation 1000+ bp long read technology (e.g., Pacific Biosciences). Indeed, we were able to align a 100 bp read dataset with 30-fold coverage of a human genome in 20 minutes on a 32-core server that cost11,000 in November 2011, using parameters that provide higher accuracy than BWA, and we estimate that the same dataset could be aligned in less than an hour for 2.40 on a “cc2.8xlarge” Amazon EC2 server. More importantly, since trends in both current and next-generation sequencing technologies indicate that read lengths will keep increasing, we believe that the techniques in SNAP will continue to be relevant in the future. ## 2 Results Before presenting the SNAP algorithm in detail, we start with a speed and accuracy comparison against existing aligners. We evaluated the algorithms using simulated reads, for which we can know the true genome location and can compute error rates. We computed three metrics: • Aligned reads: Percent of reads that the aligner confidently mapped to a location. • Errors: Percent of confidently aligned reads that are mapped to the wrong location. • Speed: Reads per second aligned on a single CPU core. All the aligners parallelize on machines with multiple cores. To determine confident reads, we used a quality threshold of 10 for the aligners that report quality scores. This value was also used in the evaluation of BWA [7]. We start by showing results for short simulated reads (100 and 200 bp) with mutation frequencies matching human data (Section 2.1). These reads are representative of current sequencing technologies, such as Illumina. Next, we compare performance on simulated long read data with high indel rates from sequencing errors, consistent with “third-generation” sequencing technologies (Section 2.2). Finally, we report preliminary results for a multicore version of SNAP (Section 2.3). Unless otherwise noted, the measurements are from an “m2.4xlarge” Amazon EC2 machine with 68 GB RAM. ### 2.1 Short Reads We compared SNAP’s performance on short reads against Bowtie [6], SOAP2 [10] and BWA [7]. We created one million simulated reads of 100 and 200 bp from the hg19 human reference genome, using the wgsim program in SAMtools [14]. We simulated a 0.1% mutation rate (0.09% SNPs and 0.01% indels) representative of human data, and base sequencing error rates of 2%, 5% and 10%, similar to the evaluation in [7]. Table 1 reports our results. We see that the existing aligners vary in terms of speed and accuracy, with SOAP2 being the fastest but the least accurate, and BWA being slightly slower but more accurate. However, SNAP performs 10–50 faster than these tools while also aligning more reads and making fewer errors. Furthermore, SNAP continues to perform well when the error rate increases, whereas the number of reads aligned by BWA, SOAP2 and Bowtie drops substantially because these algorithms are designed for small numbers of errors per read. In practice, this means that SNAP can also match reads with more mutations collected using lower-error sequencing technologies. For a direct comparison with other reports of aligner performance, we also ran a test with 125 bp reads simulated as in the BWA paper [7]. The paper reports that BWA aligned 93.0% of the reads at 0.05% error and 662 reads/s, while we found that SNAP aligned 94.1% of the reads at 0.05% error and 34100 reads/s. ### 2.2 Long Reads Third-generation sequencing technologies, such as Ion Torrent or Pacific Biosciences, will generate long reads of several thousand base pairs with a high frequency of indels due to sequencing errors. Intuitively, SNAP should work well for these types of reads: a larger read length allows us to safely use long seeds in hashtable lookups (see Section 3.1) and find good candidate locations quickly, while the change to indels from substitutions does not affect our edit distance algorithm. To evaluate SNAP for this type of data, we generated reads with varying sequencing error rates, in which 20% of the sequencing errors were indels and the other 80% were substitutions, as in the evaluation of BWA-SW [8]. We then compared SNAP against BWA-SW. Table 2 shows the results. We see that SNAP performs substantially faster than existing tools—from as much as 164 faster for the 1000 bp reads with 2% error, to 3.6 faster for 10,000 bp with 10% error. We have not yet optimized SNAP’s local alignment performance for very long reads, so we believe that we can further improve performance in these cases. Crucially, however, SNAP’s accuracy also exceeds that of existing tools: it often aligns 0.5% to 1% more reads, while achieving a similar or smaller error rate. ### 2.3 Multi-Core Implementation We built a parallel version of SNAP that runs alignments on all of the cores of a multi-processor. It works by assigning a range of the input file to a core, having the core process the entire range and then taking another. The size of each the ranges is about half of the total remaining work divided by the number of cores, with a minimum size that is about a half second’s work. This scheme balances the efficiency of having large chunks (and so amortizing the per-chunk overhead over more work) with the desire to have all of the cores finish at the same time, without having some exit early for lack of work while others are still processing. Our preliminary implementation works well even though we’ve observed some cores running at the speed of some others. Because alignments of different reads do not depend on one another, one might think that running an alignment on multiple cores of a multiprocessor machine would result in perfect speedup. However, while the individual alignments are logically independent, the computational resources used to process them are not. The cores contend for memory bandwidth to access the index, reference genome and input file; the cores on a single chip contend for L3 cache; and, even though there is almost no shared memory that is written, the memory system still has to run a coherence protocol to verify that there have been no writes. All of these effects reduce the scaling. We have not yet tried to quantify their relative effects, nor have we tuned the parallel version of SNAP. We ran tests on a Dell machine with four 8-core 2.6 GHz AMD Opteron 6140 processors, for a total of 32 cores. The machine has 256GB of memory and an IO system consisting of 6 1TB 2.5” 7200 RPM SAS disks attached to a PERC H700 RAID controller configured as one drive for the operating system and a 5 disk RAID-0 stripe to hold the indices and read data. This machine cost about11,000 in November of 2011.

To measure scaling, we ran a simulated dataset containing 100 million 125-base reads with a 2% error rate, while varying the number of cores used. SNAP produced about 94% confident alignments with an error rate of 0.05%. On a single core, SNAP ran at about 37,000 reads/s. On all 32 cores, it ran at 723,000 reads/s, just under a factor of 20 faster, or about 60% of perfect speedup. Figure 1 shows the scaling performance.

## 7 Acknowledgements

We would like to thank Matthew Meyerson for providing motivation and encouragement for this work, and for his feedback on its utility in cancer and pathogen discovery. This research is supported in part by gifts from the following Berkeley AMP Lab sponsors: Google, SAP, Amazon Web Services, Cloudera, Ericsson, General Electric, Huawei, IBM, Intel, Mark Logic, Microsoft, NEC Labs, Network Appliance, Oracle, Quanta Computer, Splunk and VMware, and by DARPA (contract #FA8650-11-C-7136).

## References

• [1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.
• [2] R. A. Baeza-Yates and C. H. Perleberg. Fast and practical approximate string matching. In Combinatorial Pattern Matching, pages 185–192. Springer-Verlag, 1992.
• [3] A. Balmain, J. Gray, and B. Ponder. The genetics and genomics of cancer. Nature Genetics, 33:238–244, 2003.
• [4] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical Report SRC-RR-124, Digital Equipment Corporation, 1994.
• [5] CEPH trio dataset from Illumina.
• [6] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25+, 2009.
• [7] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
• [8] H. Li and R. Durbin. Fast and accurate long read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589–595, 2010.
• [9] H. Li and N. Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5):473–483, 2010.
• [10] R. Li, C. Yu, Y. Li, T.-W. W. Lam, S.-M. M. Yiu, K. Kristiansen, and J. Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, Aug. 2009.
• [11] Y. Li, A. Terrell, and J. M. Patel. WHAM: a high-throughput sequence alignment method. In ACM SIGMOD, pages 445–456, 2011.
• [12] R. Mills, C. Luttig, C. Larkins, A. Beauchamp, C. Tsui, W. Pittard, and S. Devine. An initial map of insertion and deletion (indel) variation in the human genome. Genome research, 16(9):1182, 2006.
• [13] Z. Ning, A. J. Cox, and J. C. Mullikin. SSAHA: A fast search method for large DNA databases. Genome Research, 11(10):1725–1729, 2001.
• [14] SAMtools.
• [15] NHGRI data on DNA Sequencing Costs.
• [16] E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1–3):100–118, 1985.
• [17] F. Vallania, T. Druley, E. Ramos, J. Wang, I. Borecki, M. Province, and R. Mitra. High-throughput discovery of rare insertions and deletions in large cohorts. Genome research, 20(12):1711, 2010.
• [18] R. A. Wagner and M. J. Fischer. The string-to-string correction problem. Journal of the ACM, 21:168–173, January 1974.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters