Transcript Slide 1
Biostatistics-Lecture 20 Copy number variation detection Ruibin Xi Peking University School of Mathematical Sciences Copy number variation (CNV) CNVs: gains or losses of genomic segments CNVs account for a substantial proportion of human genomic variations In the Database of Genomic Variation (DGV), over 30% of human genome can be influenced by CNVs CNV Distributions in 200+ Normal Human Genomes (Nature 2006) CNVs are associated with many diseases CNV in cancer genome Kim et. al. 2013 Genome Res. CNV in cancer genome Kim et. al. 2013 Genome Res. CNV detection strategies with HTS data CNV detection using read-depth Read-depth: read density in a genomic region If there is no bias, the read-depth in a genomic region should be roughly proportional to the copy number But there are often biases in the NGS data. Algorithms CNV-seq (Xie and Tammi 2009) SegSeq (Chiang et al. 2009) rSW-seq (Kim et al. 2010) BIC-seq (Xi et al. 2011) and NBIC-seq FREEC (Boeva et al. 2011) Circular Binary Segmentation (CBS) CBS (Olshen et al. 2004) a sequence of random variables An index γ is called a change point if ~ ~ Binary segmentation Test statistic for one change point Circular Binary Segmentation (CBS) Circular Binary Segmentation Test statistic CNV-seq Use sliding window Model the read count in each window as Poisson distribution N: number of reads in the window W: window size G: Genome size When λ is large, this can be approximated by a Gaussian distribution Given a case and a control, the copy ratio z is the read count ratio CNV-seq The distribution of Gaussian ratio distribution is cumbersome, instead use Approximately a standard Gaussian distribution P-value Seg-seq In a window a length L, the number of reads for normal genome follows a Poisson distribution with A: reference genome size : total normal reads The number of reads for tumor in the window r: copy ratio : total tumor reads Seg-seq The read count ratio Log(R) approximately a log-normal distribution when To test if a position x is a change point use Seg-seq Algorithm: 1. For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w 2. Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position 3. Merge the initial segments if two adjacent segments has p-value more than Seg-seq Algorithm: 1. For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w 2. Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position 3. Merge the initial segments if two adjacent segments has p-value more than Poisson or Negative binomial model ? BIC-seq: Statistical Model • Given a short read R that is mapped to the reference genome, it consists of two pieces of information – The position S on the reference genome – The read type Y : tumor ( Y 1 ) or normal ( Y 0). • Assume the distribution of R (Y , S ) is f ( y, s) . • By Bayes’ theorem f ( y, s) Pr(Y y | S s) Pr(S s) Pr(Y y | S s) f ( s), where f (s) is the marginal distribution of S . 19 Statistical Model (cont. 1) • Denote qs be the probability of a read at position s being a tumor read, i.e. qs Pr(Y 1 | S .s) • Given N mapped short reads R1 ( y1, s1 ),, RN ( yN, , sN ) the joint likelihood is N LN qsyii (1 qsi )1 yi f ( si ) i 1 • To identify CNV regions, it is enough to identify the breakpoints. Larger q Larger qs s smaller q smaller qs s 20 Statistical Model (cont. 2) • Assume that qs is a constant between any two neighboring breakpoints. • Given the breakpoints 0 0 1 m m1 Lc on a chromosome c, where Lc is the length of the chromosome c. • Let p j be the common probabilitiesqs between the breakpoints j and j 1 . The likelihood can be written as m LN p j 0 j si yi j (1 p j )1 yi f ( si ), j 1 • One set of breakpoints corresponds to one model. Then, we could use a model selection criterion such as the Bayesian information criterion (BIC) to select the breakpoints. 21 Bayesian information criterion (BIC) The general definition of the BIC of a model is – L: the likelihood function evaluated at the MLE – k: the number of parameters in the model – n: the total number of observations 22 BIC (cont.) Given the breakpoints 0 0 1 m m1, Lc the BIC is – – – – k j : the number of tumor reads between j and j .1 n j : the total number of reads between j and j 1 pˆ j k j / n j: the MLE of the parameter pj λ>0 : tuning parameter Note that the term is common for all different models. Therefore, we can drop it when comparing different models. 23 Asymptotic result Assume f ( s) 0 for all s . Then, the breakpoint set that minimizes the BIC is a consistent estimator of the true breakpoint set, i.e. it will converge to the true breakpoint set in probability as N, the number of observations, goes to infinity. 24 BIC-seq: an algorithm for detecting somatic CNVs in tumor genomes Some remarks Used red-black tree to accelerate the algorithm Outlier removal: Look at local genomic window to determine if the read count at a nucleotide position is an outlier Assign credible interval to a breakpoint Gibbs sampling An outlier example Application of BIC-seq on a GBM tumor genome Applied BIC-seq on a GBM tumor genome Tumor: 10X Normal: 7x Detected 291 putative CNVs ranging from 40bp to 5.7 Mb Compare the copy ratio estimate given by BIC-seq and an array-based platform Application of BIC-seq on a GBM tumor genome (Cont.) Selected 16 small CNVs ranging from 110 bp to 14 kb for qPCR validation 14 out of 16 (87.5%) were validated Application of BIC-seq on a GBM tumor genome (Cont.)