Alexej Abyzov et al., CNVnator: An approach to discover, genotype and characterize typical and atypical CNVs from family and population genome sequencing, Genome Res. 21 (2011), doi:10.1101/gr.114876.110
We have developed a method, CNVnator, for CNV discovery and genotyping from read-depth(RD) analysis of personal genome sequencing.
Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate(3%-20%), high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approach: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair.
Originally, CNVs were detected from the analysis of SNP and CGH array data, and this is still a cost-effective method for CNV discovery and genotyping. However, new sequencing-based approaches offer a valuable alternative as they enable the discovery of more CNVs of all types (inversions and translocations that are not seen by CGH) and size (including indels).
Here we present a novel method, CNVnator, to detect CNV’s from a statistical analysis of mapping density, i.e., RD, of short reads from next-generation sequencing platforms. Previous approaches using RD were limited to only unique regions of the genome, discovered only large CNVs with poor breakpoint resolution, or could not perform genotyping. CNVnator is able to discover CNVs in a vast range of sizes, from a few hundred bases to megabases in length, in the whole genome.
Indeed, we observed that the overall strength to discover CNVs, measured as the ratio of the mean to sigma of the Gaussian fit of the RD distribution, correlates with sequencing coverage. However, the uniformity of coverage across the genome is also of extreme importance.
Discovering duplications (compared to deletions) by the mean of the RD represents a greater discovery challenge for several reasons, First, mismapping in repeats can look like a duplication. Second, reads that originate from genomic regions that are not in the reference (i.e.., gaps) will map to the homologous regions.