Transcript Slide 1
Biostatistics-Lecture 20
Copy number variation detection
Ruibin Xi
Peking University
School of Mathematical Sciences
Copy number variation (CNV)
CNVs: gains or losses of genomic segments
CNVs account for a substantial proportion of human genomic
variations
In the Database of Genomic Variation (DGV), over 30% of
human genome can be influenced by CNVs
CNV Distributions in 200+ Normal Human Genomes
(Nature 2006)
CNVs are associated with many diseases
CNV in cancer genome
Kim et. al. 2013 Genome Res.
CNV in cancer genome
Kim et. al. 2013 Genome Res.
CNV detection strategies with HTS data
CNV detection using read-depth
Read-depth: read density in a genomic region
If there is no bias, the read-depth in a genomic region should
be roughly proportional to the copy number
But there are often biases in the NGS data.
Algorithms
CNV-seq (Xie and Tammi 2009)
SegSeq (Chiang et al. 2009)
rSW-seq (Kim et al. 2010)
BIC-seq (Xi et al. 2011) and NBIC-seq
FREEC (Boeva et al. 2011)
Circular Binary Segmentation (CBS)
CBS (Olshen et al. 2004)
a sequence of random variables
An index γ is called a change point if
~
~
Binary segmentation
Test statistic for one change point
Circular Binary Segmentation (CBS)
Circular Binary Segmentation
Test statistic
CNV-seq
Use sliding window
Model the read count in each window as Poisson distribution
N: number of reads in the window
W: window size
G: Genome size
When λ is large, this can be approximated by a Gaussian distribution
Given a case and a control, the copy ratio
z is the read count ratio
CNV-seq
The distribution of Gaussian ratio distribution is cumbersome,
instead use
Approximately a standard Gaussian distribution
P-value
Seg-seq
In a window a length L, the number of reads for normal
genome follows a Poisson distribution with
A: reference genome size
: total normal reads
The number of reads for tumor in the window
r: copy ratio
: total tumor reads
Seg-seq
The read count ratio
Log(R) approximately a log-normal distribution when
To test if a position x is a change point use
Seg-seq
Algorithm:
1.
For each position in the genome, choose a window such that its left
and right window contain a fixed number of reads w
2.
Test if the window is a possible change point (i.e.
)
if yes, remove the all position in this window and consider next
position
3.
Merge the initial segments if two adjacent segments has p-value
more than
Seg-seq
Algorithm:
1.
For each position in the genome, choose a window such that its left
and right window contain a fixed number of reads w
2.
Test if the window is a possible change point (i.e.
)
if yes, remove the all position in this window and consider next
position
3.
Merge the initial segments if two adjacent segments has p-value
more than
Poisson or Negative binomial model ?
BIC-seq: Statistical Model
• Given a short read R that is mapped to the reference
genome, it consists of two pieces of information
– The position S on the reference genome
– The read type Y : tumor ( Y 1 ) or normal ( Y 0).
• Assume the distribution of R (Y , S ) is f ( y, s) .
• By Bayes’ theorem
f ( y, s) Pr(Y y | S s) Pr(S s)
Pr(Y y | S s) f ( s),
where f (s) is the marginal distribution of S .
19
Statistical Model (cont. 1)
• Denote qs be the probability of a read at position s being a
tumor read, i.e. qs Pr(Y 1 | S .s)
• Given N mapped short reads R1 ( y1, s1 ),, RN ( yN, , sN ) the
joint likelihood is
N
LN qsyii (1 qsi )1 yi f ( si )
i 1
• To identify CNV regions, it is enough to identify the
breakpoints.
Larger q
Larger qs s
smaller q
smaller qs s
20
Statistical Model (cont. 2)
• Assume that qs is a constant between any two neighboring
breakpoints.
• Given the breakpoints 0 0 1 m m1 Lc on a
chromosome c, where Lc is the length of the chromosome c.
• Let p j be the common probabilitiesqs between the
breakpoints j and j 1 . The likelihood can be written as
m
LN
p
j 0 j si
yi
j
(1 p j )1 yi f ( si ),
j 1
• One set of breakpoints corresponds to one model. Then, we
could use a model selection criterion such as the Bayesian
information criterion (BIC) to select the breakpoints.
21
Bayesian information criterion (BIC)
The general definition of the BIC of a model is
– L: the likelihood function evaluated at the MLE
– k: the number of parameters in the model
– n: the total number of observations
22
BIC (cont.)
Given the breakpoints 0 0 1 m m1, Lc the BIC is
–
–
–
–
k j : the number of tumor reads between j and j .1
n j : the total number of reads between j and j 1
pˆ j k j / n j: the MLE of the parameter
pj
λ>0 : tuning parameter
Note that the term
is common for all different models.
Therefore, we can drop it when comparing different models.
23
Asymptotic result
Assume f ( s) 0 for all s . Then, the breakpoint set that
minimizes the BIC is a consistent estimator of the true
breakpoint set, i.e. it will converge to the true breakpoint set
in probability as N, the number of observations, goes to
infinity.
24
BIC-seq: an algorithm for detecting somatic
CNVs in tumor genomes
Some remarks
Used red-black tree to
accelerate the algorithm
Outlier removal:
Look at local genomic window
to determine if the read count
at a nucleotide position is an
outlier
Assign credible interval to a
breakpoint
Gibbs sampling
An outlier example
Application of BIC-seq on a GBM tumor genome
Applied BIC-seq on a GBM tumor genome
Tumor: 10X
Normal: 7x
Detected 291 putative CNVs ranging from 40bp to 5.7 Mb
Compare the copy ratio estimate given by BIC-seq and an array-based platform
Application of BIC-seq on a GBM tumor genome (Cont.)
Selected 16 small CNVs ranging from 110 bp to 14 kb for qPCR
validation
14 out of 16 (87.5%) were validated
Application of BIC-seq on a GBM tumor genome (Cont.)