CLUSTERIFICATION OF THE CLONAL VARIATIONS PRESENT IN A

Download Report

Transcript CLUSTERIFICATION OF THE CLONAL VARIATIONS PRESENT IN A

IDENTIFICATION
IDENTIFICATION OFOF
CLONAL
VARIATIONS
CLONAL VARIATIONS
PRESENT
PRESENT ININATUMOR
TUMOR
THROUGHCLUSTERING
CLUSTERING
THROUGH
Introduction
Cancer is a class of disease in which a
group of cells display uncontrolled growth.
We hypothesize that the driver mutations
arise early in the original cancer cells
providing it a selective advantage to form
distinct clones.
Aim: We try to partition different mutations
in distinct clusters according to the
proportion of occurrence in tumor and
compare that with variation in normal cells
(blood). These clusters will provide an
insight about the clones present and hence
the driver mutations.
Description of the Problem
Mutations in blue is expressed at α
proportion of the tumor cell. So we
cluster them in a clone. We wish to
find the no. of clones and also their
proportions. Situation can get
complicated if a particular locus is
affected by more than one clone.
Here is a hypothetical situation with 3 clones and unknown
proportions p1, p2 & p3 We want to estimate this pi’s. We even
don’t know how many clones are there . So we want to find
no. of clones as well as their proportions.
Some Basic Terminology
• Mutation : Alteration in genome sequence
• Clone : A cluster of mutations occurring in
the same proportion
• Reference base : Ideally what should be
present
• Variation base: What is present instead
• Depth or coverage refers to the number of
times a particular locus is examined.
Description of the data available
• Different locus positions, reference base and variation base is given.
• Coverage and no. of times variation is expressed is given.
• The actual number of clones or their proportion is missing.
• Suppose ni → the coverage of the forward and reverse strand.
• Xi → No. of times variation showed up among the ni coverage.
• So, Xi ~ Bin (ni,pi) where pi are not known apriori.
• The pi serve as a naïve estimate of the proportion in which the variation is
present in the tumor.
• As the datasize is huge we first cluster the data suitably and then try to
figure out the clone from the initial clusters.
Clustering with sample
estimates
• X/n is a consistent estimator of the unknown
p.
• To obtain the initial clusters we obtain the
sample X/n estimates,apply following two
clustering algorithm and compare their
performance.
1. Use the idea of dendogram to merge two
closest estimates in each step.To determine
number of cluster use AIC and BIC.
2. Cluster by k-means and determine no. of
cluster by Gap Statistics
A Picturization
of dendogram
How to update estimate in each step
At the very first step we started with n cluster where n is the total no. of
sample points. And reduce no. of cluster in each step.
 Then we order all the estimated values say e1<e2<….<en
 Next we compute dist (ei,ei+1) and take the minimizer of that say k.
 Join ek and ek+1 and obtain ek’ as (nkek+nk+1ek+1)/(nk+nk+1) The reason
behind this choice is the we assume that ek and ek+1 are actually sample
fluctuation of the same proportion p. And the mle of this p in this case
would be ek’ as described above.
Determining no. of clusters
• No. of unknown parameters are decreasing. So, Lk>Lk-1>…>L1
where L k is expected likelihood at k clusters.
• We use the idea of penalized likelihood and obtain the actual
number of cluster with AIC and BIC
Method
Quantity to be minimized
AIC
2k- 2ln Lk
BIC
k ln (n)– 2ln Lk
To compare this two we worked on a simulated dataset of 1000 datapoints,
where we actually started with 4( and 5 )different values of p.
We generated a dataset by simulating Bin(n, p) where n lies in (500,1000) and
p randomly one of the 4(and 5) chosen values.
Clustering according to algorithm ,we saw BIC is more robust than AIC
• Among the 673 ‘successful ‘(no. of cluster obtained=no. of initial value of
p) clusterings by the BIC method, we looked at the average deviations of
the clustered p values and the original p-values also plot a histogram.
Initial values
.05
.35
.65
.95
Cluster-Centers(avg)
.055
.343
.638
.961
The n term in BIC penalty
In BIC method penalty was
k log n where n is no. of sample
points.No. of clusters were
determined using both n= no. of
datapoints and n= ∑ni where ni is
coming from every individual
datapoint. As in the later case
penalty was more, it showed better
result.
Value
Of n
Histogram of cluster-center in BIC with 4
initial cluster in the successful clusterings
> actual
number
= actual
number
>actual
number
n=1000
43
47
10
n= ∑ni
27
63
20
K-means and Gap Statistic
• K-means is used to cluster and then Gap Statistics( due to Hastie,Tibshirani,
Walterer) is used.
http://gremlin1.gdcb.iastate.edu/MIP/gene/MicroarrayData/gapstatistics.pdf
• A dispersion measure was taken. Then for total k cluster we define Wk and find
appropriate no. of cluster as described in the paper.
• Relative performance of the linking method along with BIC is somewhat better.
• Maybe because in k-means we don’t incorporate ni’s to cluster.
Frequency table of no. of cluster
for 4 initial values of p
≤2
3
4
5
Frequency table of no. of cluster
for 5 initial values of p
≥6
≤3
4
5
6
≥7
AIC 106 270 327 187 120
AIC 139 178 353 202 128
BIC
52
BIC
13
GAP 301 378 210
38
173 617 120
GAP 318 403 209
57
49
187 673 110
68
81
57
Only initial clustering is not enough
After initial clustering we need
to figure out the actual clones.
We look back to the previous
hypothetical situation
We will only know the total
proportions of variations present
in each locus.
We don’t know actual no. of cluster nor the clonal proportions. Only initial
cluster values q1,q2..qk. We try to find the minimum m for which we can get
(p1,p2,..pm) so that (p1,p2..pm) generate (q1,..qk)
Mathematically, qj= ∑aipi where ai is 0 or 1
If we dont get exact pi satisfying this we wish to find the most probable pi so
that a close approximation to qi s can be generated
How to solve that???
• Start with initial qi values and corresponding ni,xi values.(ni→ sum of all n
in the cluster centered at qi. Similar definition for xi
• Find out i,j,k for which |qi+qj-qk| is minimum. qk can be thought to be
generated by qi and qj
• Apply EM algorithm to obtain qi* qj* maximizing the likelihood under
H0: qi+qj= qk
• Thus reduce no. of effective clusters by 1 and calculate the expected
likelihood under each model.
• Keep track of the i,j,k for which i and j generate k. Some extra restriction
will be imposed in every step as we want the coefficients ai to be only
between 0 and 1.
• Suppose q3≈ q1+q2 and q5 ≈ q3+q4 So, we conclude q5 ≈ q1+q2+q4. And we
replace q5 by q1+q2+q3 and q1,q2,q3 by their corresponding EM estimate
• Select the best model using maximum likelihood method ( penalized
likelihood if necessary)
Simulation model for checking
We need to check if our method works on a simulated data.
Different simulations were done. Two are shown below
Model 1
Model 2
3
4
• No. of Clones
• Initial clone
proportions
0.10, 0.20, 0.40
0.05, 0.10, 0.25, 0.45
• Proportions to
generate data
0.1,0.2,0.3(0.1+0.2),0.4,
0.6(0.2+0.4),0.7(.1+.2+.4)
.05,.10,.25,.30(.25+.05), .45,
.55(.45+.1),.75(.05+.25+.45)
• Initial clusters
obtained
.1001,.1944,.2927,.3995,5998,.7011
1
2
3
4
5
6
.0504,.1002,.2484,.3441,.4468,.5547,.7462
1
2
3
4
5
6
7
(1,5,6), (1,2,3), (2,4,5),(1,3,4)[NV]
(2,5,6),(3,4,6)(NV),(4,5,7),(1,3,4)
q6=q1+q5,q3=q1+q2,q5=q2+q4
So initial clone proportions were q1,q2
and q3
q6=q2+q5,q7=q4+q5,q4=q1+q3
Hence q7=q1+q3+q5 and initial clone
proportions are q1,q2,q3 and q5
• i,j,k in order
of |qi+qj-qk|
• Conclusions
NV denotes not valid. For model 1 we cannot assume q4=q1+q3 as q3 is already q1+q2
Collection of real data
• After the success in simulated dataset, it’s time to work on real
data. National biomedical institute of genomics provided us real
data. This was generated in 454 platform (Roche sequencing). Data
was collected according to 3 different categorization.
Normal/Tumor
Somatic status
We collect blood data Data was collected
(Normal) as well as
on both Germline
tumor data from the and Somatic cells
same patient
Mutation type
2 different type mutation
New-position A new base
replacing ref. base
Insertion-Deletion Insertion
or deletion of base occurred
• Moreover in tumor data, extra information was collected on how
the variation shown is distributed in forward and reverse strand.
• These categorizations were needed as we wish to run our algorithm
on every combination of these and try to figure out the biological
significance , if any.
Analyzing the real data
First, we reduce the huge data in 200 clusters by k-means. Empty clusters if formed
were removed. No. of clusters is our ‘effective’ datasize. In every cluster n values
& x values are added up to give the (∑ni,∑xi) as ‘effective’ (n,x) for the reduced data.
Tumor/
Normal
Mutation Somatic
type
Status
New
Position
Normal
Effective
Datasize
# intial
cluster
Range
Max
Min
64480
113
75
.2019 .9999
Somatic
4364
38
10
.0150 .4151
Insertion- Germline
Deletion Somatic
62595
118
27
.1738 .8099
33122
13
10
.0000 .1756
Germline
64480
124
94
.2012 .9919
Somatic
4364
84
18
.2120 .9956
Insertion- Germline
Deletion Somatic
62595
111
34
.1713 .9593
33122
90
23
.1686
New
position
Tumor
Germline
DataSize
.7710
The initial clustering
Circles- cluster center ,
.
Dots- initial estimates
Comparisons
Tumor vs normal data
 Somatic cell variation profile is significantly low in tumor data. Germline
cells are showing comparable results. So, we can say somatic cells are
those which are introducing new variation in a tumor.
Germline vs Somatic cell
 Number of clusters, clones and proportions of variation is significantly less
in somatic cell compared to the germline cells.(only tumor insertion is
more or less comparable)
Insertion-deletion vs new-position data
 Insertion- deletion data showed significantly less variation compared to
new-position cell. In somatic cell, insertion-deletion variation is almost
absent.( There were 30898 zero variation among 31222 locus)
Identifying the clones
After obtaining initial clusters, we try to figure out the clones and their proportions
Here we show how the clones were obtained in tumor somatic new-position data
We classify the initial clusters according to the no. of clones they’re generated by:




category-1->clusters that are individual clone
category-2->clusters generated by 2 clones
category-3-> clusters generated by 3 clones
category-4->clusters generated by more than 3 clones
Tumr/nr Mutation
ml
type
Strand
# intial
cluster
Germline
75
33
23
14
5
33
.2019 .5432
Somatic
10
5
4
1
0
5
.0150 .0849
Insertion
Deletion
Germline
27
13
9
4
1
13
.1738 .3368
Somatic
10
4
4
2
0
4
.0000 .0742
New
position
Germline
94
41
31
13
7
41
.2012 .5671
Somatic
18
8
7
3
0
8
.2120 .5035
Insertion
Deletion
Germline
34
15
10
7
2
15
.1713 .4122
Somatic
23
12
8
3
0
12
.1686 .3713
New
Position
Nor
mal
Tumor
Cluster Category
1
2
3
4
Total # clone range
Clones Min
Max
• In almost every case no. of clone is 35 to 50 % of total no of initial clusters
and the proportions are ranging in between the lowest & median value
• From the table above it is clear, clusters generated by more than 3 clone is
quite rare. This is possibly happening because we are assuming each clone
is individually expressed atleast once. If this is not true then some internal
clones are mixed in the structure which is very hard to capture.
Equality of p
• For the tumor data, we have extra information specifying
no. of variation in forward and backward strand. So,first
we test whether the two proportions are ‘statistically’
same or not.
• Intersection H0: p1i=p2i for i=1,2..n ( data size)
• Bonferroni conservative test will lead to very high type-2
error probability. So, LRT was used. As n > 10000,
asymptotically – 2ln Λ ~ χ2 with d.f . n.
• Real data showed we have to reject the hypothesis at level
0.05% for both new-position data and insertion-deletion
data.
• So, we wish to see if the clonal proportions or the pattern
of cluster vary significantly for forward and reverse
strand in tumor.
Initial clustering in two strands
Circles- cluster center ,
.
Dots- initial estimates
Table for the two strands
Mutati Somatic
on type status
NewPositn
Data
Grmline
Somatic
Insertn
Deletn
Data
Grmline
Somatic
Strand
Efctiv # intial
Size cluster
Cluster Range
Max
Min
No of
Clone
Clone range
Max
Min
Forward
92
82
.0032 .9999
37
.0032 .4961
Backward
94
20
.4457 .8721
11
.4457 .6518
Forward
63
19
.0015 .9982
9
.0015 .4863
Backward
58
18
.0666 .9979
9
.0666 .5509
Forward
85
35
.0002 .9850
16
.0002 .4583
Backward
87
37
.0005 .9696
15
.0005 .4101
Forward
63
34
.0001 .9987
17
.0001 .5234
Backward
67
25
.0001 .9987
12
.0001 .4372
• We see that though at individual loci the proportions in two strands are not
same, except germline cell of new-position mutation the variation in two
strand are following a more or less similar pattern.
• We also note that some clusters with small proportions are expressed in the
individual strands, but not when the two strands are seen together.
Summary
• Looking at the performance at various simulated data and a real data we
summarize the most optimum method.
• From the dataset, using xi and ni obtain the estimates. If necessary, reduce
the data size effectively by k-means clustering.
• Obtain initial clustering linking closest estimates in each step.(dendogram)
• Use BIC penalized likelihood to determine no. of cluster
• After initial clustering find out which estimates are generated by sum of
two or more than two estimates. In each step replace the two generator
estimates by their EM estimate.
• For each step, calculate the expected likelihood with EM estimates and use
BIC to determine the actual no. of steps.
• If additionally, forward and reverse strand show ‘unequal’ proportion, run
the same algorithm for both of them and compare.
Conclusion and application
• We saw that this study of pattern and clone is showing some
significant contrasts between tumor cell and normal cell. This
method is applicable to any kind of gene data in general. This might
enlighten some unknown areas in cancer genetics.
• Let’s conclude this slideshow with a few of the possible applications
of this study.
Applications
• Better understanding of the mechanism of the disease as well as a
better understanding of the biology of a system.
• It will identify novel pathways and explain specific pathways which
would provide distinct selection advantage to the tumor cells.
• Identification of the pathways might lead to better therapeutics for
the disease. We can run our algorithm on the tumor data before
and after applying some drug to decide upon the effectiveness of
the drug.