Jacob J Michaelson & Jonathan Sebat, forestSV: structural variant discovery through statistical learning, Nature Methods 9 (2012), doi:10.1038/NMETH.2085
Detecting genomic structural variants from high-throughput sequencing data is a complex and unresolved challenge. We have developed a statistical learning approach, based on Random Forests, that integrates prior knowledge about the characteristics of structural variants and leads to improved discovery in high-throughput sequencing data.
Although some of the tested methods were clear leaders, no single method produced results with a sensitivity comparable to that of a careful merging of calls from all the methods.
We used previous studies as a basis for extracting the multidimensional patterns that discriminate calls that validate experimentally from this that do not. Beyond incorporating the typical read depth and read-pair signatures used by most structural variant discovery approaches, we constructed additional features from the sequencing data, thereby representing the genome in a higher-dimensional space. We trained a Random Forest classifier to partition this space in a way that optimizes the classification of deletions, duplications and false positives whose identities are stored in the class-label vector Y.