Matei Zaharia, et al., Faster and More Accurate Sequence Alignment with SNAP, arXiv:1111.5572v1, 2011
SNAP uses a simple hash index of short seed sequences from the genome, similar to BLAST’s. … It uses longer seeds to reduce the false positive locations considered, leverages larger memory capabilities to speed index lookup, and excludes most candidate locations without fully computing their edit distance to the read. The result is an algorithm that scales well for reads from one hundred to thousands of bases long and provides a rich error model that can match classes of mutations (e.g., longer indels) that today’s fast aligners ignore. We calculate that SNAP can align a dataset with 30x coverage of a human genome in less than an hour for a cost of $2 on Amazon EC2.
Like the original BLAST algorithm, SNAP is based on a hash index of short substrings of the genome (or other reference database) called seeds, of a fixed size s. Given a read to align, SNAP draws multiple substrings of length s from it and performs an exact lookup in the hash index to find locations in the database that contain the same substrings. It then compute the edit distance between the read and each of these candidate locations to find the best alignments. Two features differentiate SNAP from previous algorithms, however, and give it to reduce computation, and a set of optimizations for the local alignment step that greatly reduce the cost of testing a read against its candidate locations.