Min Zhao et al., Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives, BMC Bioinformatics 14 (2013)
CNV refers to a type of intermediate-scale SVs with copy number changes involving a DNA fragment that is typically greater than one kilobases (Kb) and less than five megabases (Mb)
It is estimated that approximately 12% of the genome in human populations is subject to copy number change.
So far, approximately half of the reported CNVs overlap with protein-coding regions.
Because the lengths of CNVs vary greatly, the current computational tools usually target a specific range of CNV sizes. The readers may be aware that this review focuses on all types of CNVs including CNV events less than 1 Kb, intermediate structural variants greater than 1 Kb, and large chromosomal events over 5 Mb.
In 2003, genome-wide detection of CNVs was achieved using more accurate array-based comparative genomic hybridization (arrayCGH) and single-nucleotide polymorphism (SNP) array approaches; these approaches, however, have suffered from several inherent drawbacks, including hybridization noise, limited coverage for genome, low resolution, and difficulty in detecting novel and rare mutations.
We mainly focus on: (i) the key features for CNV calling tools using NGS data, (ii) the key factors to consider before pipeline design, and (iii) developing combinatorial approach for more accurate CNV calling.
So far, the NGS based CNV detection methods can be categorized into five different strategies, including: (1) paired-end mapping (PEM), (2) split read (SR), (3) read depth (RD), (4) de novo assembly of a genome (AS), and (5) combination of the above approaches (CB).
Most PEM-, SR-, and CB-based tools are not specific to CNV detection but rather for SV identification, while the majority of RD- and AS-based tools are developed for the detection of CNVs instead of SVs.
Notably, PEM is only applicable to paired-end reads but not single-end reads.
PEM identifies SVs/CNVs from discordantly mapped paired-reads whose distances are significantly different from the predetermined average insert size. PEM methods can efficiently identify not only insertions and deletions but also mobile element insertions, inversions, and tandem duplications. However, PEM is not applicable to insertion events larger than the average insert size of the genome library. Another limit ion of PEM is its inability to detect CNVs in low-complexity regions with segmental duplication.
SR methods start from read pairs in which one read from each pair is aligned to the reference genome uniquely while the other one fails to map or only partially maps to the genome. Those unmapped or partially mapped reads potentially provide accurate breaking points at the single base pair level for SVs/CNVs. SR methods split the incompletely mapped reads into multiple fragments. The first and last fragments of each split read are then aligned to the reference genome independently. This remapping step therefore provides the precise start and end positions of the insertion/deletion events. However, the SR-based approach heavily relies on the length of reads and is only applicable to the unique regions in the reference genome.
The underlying hypothesis of RD-based methods is that the depth of coverage in a genomic region is correlated with the copy number of the region, e.g., a agin of copy number should have a higher intensity than expected. Compared to PEM and SR-based tools. RD-based methods can detect the exact copy numbers, which the former approaches are lacking because PEM/SR methods only use the position information. Moreover, RD-based methods can detect large insertions and CNVs in complex genomic region classes, which are difficult to detect using PEM and SR methods.
Compared with mode-free approach, approaches using mathematical models to detect CNVs generate more reliable results after filtering false positive regions.
Different from the RD, PEM and SR approaches that first align NGS reads to a known reference genome before the detection of CNVs, the AS-based methods first reconstruct DNA fragments, i.e., contigs, from short reads by assembling overlapping reads. By comparing the assembled contigs to the reference genome, the genomic regions with discordant copy numbers are then identified. This direct assembly of short reads without using a reference is called de novo assembly. Assembly can also use a reference genome as a guide to improve its computational efficiency and config quality.
Without relying on read alignment, the AS methods potentially provide an unbiased approach to discover novel genetic variants ranging from the single base pair level to large structure variation. However, AS methods are rarely used in CNV detection in non-human eukaryotic genomes due to the low quality of the assembled contigs and their overwhelming demand on computational resources.
With distinct advantages and weaknesses these four strategies could be complementary to each other. PEM-based methods can detect all types of SVs (e.g., deletion, novel insertion, translocation, inversion, interspersed and tandem duplication), especially for deletions of less than 1 Kb. However, PEM tools cannot accurately estimate the actual numbers of copies, and they are not applicable to insertions larger than the average insert size of the library. In contrast, RD-based methods can accurately predict copy numbers, have good performance to detect large CNVs, and are applicable to both WGS and WES data. However, they are not applicable for detection of precise breakpoints, or small CNVs (e.g., < 1 Kb). For the AS-based tools, their exclusive advantage lies in that they do not require a reference genome as input, and, importantly, they allow the discovery of novel mutation sequences. However, AS-based tools required extensive computation and perform poorly on repeat and duplication regions. For the SR-based tools, they could detect breakpoints at base pair resolution and performed well on deletion and small insertions. However, SR-based tools have low sensitivity in regions with low-complexity, as they rely on unique mapping information to the genome.
Despite improvements to NGS technologies and CNV-detecting tools, the identification of low coverage CNVs still remains a challenge. Although the RD-based approach has to correct distortions caused by NGS biases, the relationship between the read count and true copy number can be distorted by several effects. The PCR process is known to be one major cause of this distortion, where genomic fragments with a lower PCR amplification rate often result in less reads. Furthermore, sequencing process can also introduce system noise. It was reported that NGS has lower sequencing coverage in regulatory regions. In particular, the capability of exome capture in the library preparation process complicates the connection between true copy number and read count for WES data.
One well-investigated bias in RD-based methods is related to GC content, the percentage of guanine and cytosine bases in a genomic region. GC content varies markedly along the human genome and across species and has been found to influence read coverage on most sequencing platforms.
Another major bias that affects CNV calling when using the RD-based approach originates from read alignment. In the alignment step, a significant portion of reads are mapped to multiple positions due to a short read length and the presence of repetitive regions in the reference genome.
Competing for higher sequencing throughput, data accuracy and longer read lengths, today’s sequencing technologies is the creating of third generation sequencing (TGS) technologies which promise to provide dramatically longer read lengths.
Longer reads will greatly ease read alignment and CNV detection in repetitive regions of genome. These longer reads will significantly reduce mapping error sdue to incorrect sequencing. The increased size of the short read will also strengthen the statistical power of the RD methods. In addition, all these improvements on mapping quality will benefit PEM methods to avoid false positives caused by chimera in the genome.
Additionally, longer reads may improve assembly quality when implementing AS methods.