Seungtai Yoon et al., Sensitive and accurate detection of copy number variants using read depth of coverage, Genome Research 19 (2009)
The application of next-generation sequencing platforms to genetic studies promises to improve sensitivity to detect CNVs as well as inversions, indels, and SNPs. New computational approaches are needed to systematically detect these variants from genome sequence data.
We developed a method for CNV detection using read depth of coverage. Event-wise testing(EWT) is a method based on significance testing. In contrast to standard segmentation algorithms that typically operate by performing likelihood evaluation for every point in the genome, EWT works on intervals of data points, rapidly searching for specific classes of events. Overall false-positive rate is controlled by testing the significance of each possible event and adjusting for multiple testing.
Our current power to detect SVs in disease studies is limited by the resolution of microarray analysis. Currently available array platforms that consist of more than 1 million probes have a lower limit of detection of ~10-25kb.
To date, multiple approaches have been developed for the detection of SVs that are based on paired-end read mapping (PEM), which detects insertions and deletions by comparing the distance between mapped read pairs to the average insert size of the genomic library. Advantages of this approach include the sensitivity for detecting deletions < 1kb in size, and localizing the breakpoint within the region of a small fragment.
PEM-based methods have poor ascertainment of SVs in complex genomic regions rich in segmental duplications and have limited ability to detect insertions larger than the average insert size of the library.
To detect CNVs based on read depth (RD), we developed a pipeline consisting of three steps, as illustrated in Figure 1: (1) First, we estimated the coverage or RD in non overlapping intervals across an individual genome, (2) we implemented a novel CNV-calling algorithm to detect events, and (3) we compared data from multiple individuals to distinguish events that are polymorphic.
For each sample, RD was measured by counting the number of mapped reads in 100-bp windows, assigning each read only once by its start position.
A larger window size (e.g., of 1000 bp) would provide less precision in defining the breakpoints of CNVs. A larger window size could also make the detection of small (~1000 bp) CNVs problematic, because in many cases these CNVs would only partially span one or two windows. In addition, at 30x coverage, the distribution of read count of 100-bp windows are well approximate by a normal distribution, thus permitting us to assume normality in our statistical calculations.
Sequence coverage on the Illumina Genome Analyzer platform is influenced by GC content.
We sought to adjust the 100-bp window read counts based on the observed deviation in coverage for a given G+C percentage.
We use the GC-adjusted RD within 100-bp windows as a quantitative measurement of genome copy number. A deletion or duplication is evident as a decrease or increase in coverage across multiple consecutive windows.
There is on feature of the MAQ alignment algorithm that is important to point out here. When a single read has multiple exact matches in the genome, it is assigned to a single location randomly. Consequently, coverage across a repetitive or segmentally duplicated region does not differ from the mean if the copy number of those regions in the sample is the same as the copy number of the reference genome. Therefore, the observed events in our data constitute regions of copy number difference between the sample and the reference genome. These events may represent CNVs. They may also represent fixed segmental duplications that are not correctly mapped in the genome, or they may represent a region where the reference genome has a rare allele. Therefore, we must compare the RD of the region in multiple genomes in order to distinguish between events that are clearly polymorphic and those that are not.
While RD analysis overcomes some of the limitations of other methods, it has limitations of its own. RD analysis is not able to ascertain balanced rearrangements. In addition, ascertainment of SVs that involve highly repetitive sequences is limited. RD analysis cannot determine the precise location of an insertion, nor can it find novel insertions that are not already in the reference genome. These classes of SVs are more easily detectable using a PEM-based approach. Given the relative strengths of PEM and EWT, the two methods are quite complementary.