forestSV: structural variant discovery through statistical learning

Jacob J Michaelson & Jonathan Sebat, forestSV: structural variant discovery through statistical learning, Nature Methods 9 (2012), doi:10.1038/NMETH.2085

Detecting genomic structural variants from high-throughput sequencing data is a complex and unresolved challenge. We have developed a statistical learning approach, based on Random Forests, that integrates prior knowledge about  the characteristics of structural variants and leads to improved discovery in high-throughput sequencing data.

Although some of the tested methods were clear leaders, no single method produced results with a sensitivity comparable to that of a careful merging of calls from all the methods.

We used previous studies as a basis for extracting the multidimensional patterns that discriminate calls that validate experimentally from this that do not. Beyond incorporating the typical read depth and read-pair signatures used by most structural variant discovery approaches, we constructed additional features from the sequencing data, thereby representing the genome in a higher-dimensional space. We trained a Random Forest classifier to partition this space in a way that optimizes the classification of deletions, duplications and false positives whose identities are stored in the class-label vector Y.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s