Transcript Slide 1

Biostatistics-Lecture 20
Copy number variation detection
Ruibin Xi
Peking University
School of Mathematical Sciences
Copy number variation (CNV)
 CNVs: gains or losses of genomic segments
CNVs account for a substantial proportion of human genomic
variations
 In the Database of Genomic Variation (DGV), over 30% of
human genome can be influenced by CNVs
CNV Distributions in 200+ Normal Human Genomes
(Nature 2006)
CNVs are associated with many diseases
CNV in cancer genome
Kim et. al. 2013 Genome Res.
CNV in cancer genome
Kim et. al. 2013 Genome Res.
CNV detection strategies with HTS data
CNV detection using read-depth
 Read-depth: read density in a genomic region
 If there is no bias, the read-depth in a genomic region should
be roughly proportional to the copy number
But there are often biases in the NGS data.
Algorithms
CNV-seq (Xie and Tammi 2009)
SegSeq (Chiang et al. 2009)
rSW-seq (Kim et al. 2010)
BIC-seq (Xi et al. 2011) and NBIC-seq
FREEC (Boeva et al. 2011)
Circular Binary Segmentation (CBS)
 CBS (Olshen et al. 2004)
a sequence of random variables
An index γ is called a change point if
~
~
 Binary segmentation
Test statistic for one change point
Circular Binary Segmentation (CBS)
 Circular Binary Segmentation
Test statistic
CNV-seq
 Use sliding window
 Model the read count in each window as Poisson distribution
N: number of reads in the window
W: window size
G: Genome size
When λ is large, this can be approximated by a Gaussian distribution
Given a case and a control, the copy ratio
z is the read count ratio
CNV-seq
 The distribution of Gaussian ratio distribution is cumbersome,
instead use
Approximately a standard Gaussian distribution
P-value
Seg-seq
 In a window a length L, the number of reads for normal
genome follows a Poisson distribution with
A: reference genome size
: total normal reads
 The number of reads for tumor in the window
r: copy ratio
: total tumor reads
Seg-seq
 The read count ratio
Log(R) approximately a log-normal distribution when
To test if a position x is a change point use
Seg-seq
 Algorithm:
1.
For each position in the genome, choose a window such that its left
and right window contain a fixed number of reads w
2.
Test if the window is a possible change point (i.e.
)
if yes, remove the all position in this window and consider next
position
3.
Merge the initial segments if two adjacent segments has p-value
more than
Seg-seq
 Algorithm:
1.
For each position in the genome, choose a window such that its left
and right window contain a fixed number of reads w
2.
Test if the window is a possible change point (i.e.
)
if yes, remove the all position in this window and consider next
position
3.
Merge the initial segments if two adjacent segments has p-value
more than
Poisson or Negative binomial model ?
BIC-seq: Statistical Model
• Given a short read R that is mapped to the reference
genome, it consists of two pieces of information
– The position S on the reference genome
– The read type Y : tumor ( Y  1 ) or normal ( Y  0).
• Assume the distribution of R  (Y , S ) is f ( y, s) .
• By Bayes’ theorem
f ( y, s)  Pr(Y  y | S  s) Pr(S  s)
 Pr(Y  y | S  s) f ( s),
where f (s) is the marginal distribution of S .
19
Statistical Model (cont. 1)
• Denote qs be the probability of a read at position s being a
tumor read, i.e. qs  Pr(Y  1 | S  .s)
• Given N mapped short reads R1  ( y1, s1 ),, RN  ( yN, , sN ) the
joint likelihood is
N
LN   qsyii (1  qsi )1 yi f ( si )
i 1
• To identify CNV regions, it is enough to identify the
breakpoints.
Larger q
Larger qs s
smaller q
smaller qs s
20
Statistical Model (cont. 2)
• Assume that qs is a constant between any two neighboring
breakpoints.
• Given the breakpoints 0   0  1     m   m1  Lc on a
chromosome c, where Lc is the length of the chromosome c.
• Let p j be the common probabilitiesqs between the
breakpoints  j and j 1 . The likelihood can be written as
m
LN  
 p
j 0  j  si 
yi
j
(1  p j )1 yi f ( si ),
j 1
• One set of breakpoints corresponds to one model. Then, we
could use a model selection criterion such as the Bayesian
information criterion (BIC) to select the breakpoints.
21
Bayesian information criterion (BIC)
 The general definition of the BIC of a model is
– L: the likelihood function evaluated at the MLE
– k: the number of parameters in the model
– n: the total number of observations
22
BIC (cont.)
 Given the breakpoints 0   0  1     m   m1, Lc the BIC is
–
–
–
–
k j : the number of tumor reads between  j and  j .1
n j : the total number of reads between  j and  j 1
pˆ j  k j / n j: the MLE of the parameter
pj
λ>0 : tuning parameter
 Note that the term
is common for all different models.
Therefore, we can drop it when comparing different models.
23
Asymptotic result
 Assume f ( s)  0 for all s . Then, the breakpoint set that
minimizes the BIC is a consistent estimator of the true
breakpoint set, i.e. it will converge to the true breakpoint set
in probability as N, the number of observations, goes to
infinity.
24
BIC-seq: an algorithm for detecting somatic
CNVs in tumor genomes
Some remarks
Used red-black tree to
accelerate the algorithm
 Outlier removal:
 Look at local genomic window
to determine if the read count
at a nucleotide position is an
outlier
 Assign credible interval to a
breakpoint
 Gibbs sampling
An outlier example
Application of BIC-seq on a GBM tumor genome
 Applied BIC-seq on a GBM tumor genome
 Tumor: 10X
 Normal: 7x
 Detected 291 putative CNVs ranging from 40bp to 5.7 Mb
 Compare the copy ratio estimate given by BIC-seq and an array-based platform
Application of BIC-seq on a GBM tumor genome (Cont.)
 Selected 16 small CNVs ranging from 110 bp to 14 kb for qPCR
validation
 14 out of 16 (87.5%) were validated
Application of BIC-seq on a GBM tumor genome (Cont.)