Samia N. Naccache, et al., A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples, Genome Research, 2014, doi:10.1101/gr.171934.113
Conventional diagnostic testing for pathogens is narrow in scope and fails to detect the etiologic agent in a significant percentage of cases.
Unbiased NGS holds the promise of identifying all potential pathogens in a single assay without a priori knowledge of the target. Given sufficiently long read lengths, multiple hits to the microbial genome, and a well-annotated reference database, nearly all microorganisms can be uniquely identified on the basis of their specific nucleic acid sequence.
Computational analysis of metagenomic NGS data for pathogen identification remains challenging for several reasons. First, alignment/classification algorithms must contend with massive amounts of sequence data. Second, only a small fraction of short NGS reads in clinical metagenomic data typically correspond to pathogens. Finally, novel microorganisms with divergent genomes, particularly viruses, are not adequately represented in existing reference databases and often can only be identified on the basis of remote amino acid homology.
To address these challenges, the most widely used approach is computational subtraction of reads corresponding to the host (e.g., human) followed by alignment to reference databases that contain sequences from candidate pathogens.
Traditionally, the BLAST algorithm is used for classification of human and nonhuman reads at the nucleotide level (BLASTn), followed by low-stringency protein alignments using a translated nucleotide query (BLASTx) for detection of divergent sequences from novel pathogens. However, BLAST is too slow for routine analysis of NGS metagenomics data, and end-to-end processing times, even on multicore computational servers, can take several days to weeks.
Here we describe SURPI (“sequence-based ultra rapid pathogen identification”), a cloud-compatible bioinformatics analysis pipeline that provides extensive classification of reads against viral and bacterial databases in fast mode and against the entire NCBI nt DB in comprehensive mode. Novel pathogens are also identified in comprehensive mode by amino acid alignment to viral and/or NCBI nr protein databases. Notably, SURPI generates results in a clinically actionable timeframe of minutes to hours by leveraging two alignment tools, SNAP and RAPSearch, which have computational times that are orders of magnitude faster than other available algorithms.