Transcript Slide 1

3rd Place Winning
Project , 2009
USPROC
Author:
Kinjal Basu
Sujayam Saha
Sponsor Professor:
S. Ghosh
A. K. Ghosh
Indian Statistical Institute, Kolkata, India
Statistics- An Integral
Part of Genetic Research
Our quest is to estimate p,
the frequency of allele A,
from a mixture distribution
with mixing proportions p2,
2pq and q2, due to
genotypes AA, Aa and aa.
• The problem is of localization of
a bi-allelic gene controlling a
quantitative trait.
• The (unknown) distribution of
trait data depends on genotype,
i.e. we have a mixture of 3
distributions each corresponding
to a genotype.
• Cluster analysis gives us estimates
which are used both on their own
or as initial guesses for other
methods.
• For sake of algebraic simplicity, to
begin with we assume the data to
follow a mixture Gaussian Model .
We test two methods-based on
EM and CEM, respectively.
• We next investigate two categories
of departure from normality:
a. Asymmetric Distributions
b. Heavy-tailed Distributions
Using 3-Means algorithm we find the
three clusters. Now we need to decide
which cluster corresponds to which
genotype. We connect the bigger of
the extreme clusters to AA and the
smaller one to aa.
If n1, n2 and n3 be the cluster sizes
corresponding to AA, Aa and aa genotypes
respectively, then the MLE of p is given by
p = (2n1 + n2) /2(n1 + n2 + n3)
To analyze the data assuming an underlying mixture Gaussian
distribution, we make use of EM and CEM algorithms using
the posterior expectations of indicator variables given the
data in E-step and the standard results for Gaussian Model in
M-step (here mean and variance is interpreted as weighted
mean and variance with the indicator variables as weights).
A mixture of N(3,1) N(0,1) and
N(-3,1) with p=0.45
Original
p
Method
µ=1
 As the separation
between the means
increases the MSE
decreases.
EM gives better
results than 3-means.
CEM is unsatisfactory.
3-Means 0.0066
0.6
0.7
0.8
µ=2
µ=3
µ=1
µ=2
µ=3
0.0041
0.0006
0.0009
0.0006
0.0008
EM
0.0067
0.0048
0.0004
0.0162
0.0078
0.0014
CEM
0.0044
0.0034
0.0003
0.0104
0.0210
0.0046
3-Means 0.0313
0.0203
0.0027
0.0144
0.0076
0.0028
EM
0.0157
0.0106
0.0002
0.0132
0.0065
0.0033
CEM
0.0203
0.0135
0.0003
0.0124
0.0201
0.0058
3-Means 0.0739
0.0489
0.0339
0.0460
0.0326
0.0219
EM
0.0471
0.0349
0.0003
0.0333
0.0296
0.0037
CEM
0.0508
0.0361
0.0135
0.0349
0.0339
0.0138
 As p approaches 1 the performance of all the methods detoriate. This is
probably because the cluster corresponding to q2 vanishes at a quadratic rate.
In multi-dimensioned
data, treating each
variable separately
means information on
interdependencies
between the variables
is not used at all.
Thus, a vector-valued
estimation algorithm is called
for. We choose multivariate
normal to model the data and
use a multivariate analog of
the theory in Slide 5 to
estimate p.
Overall, EM was better than the other two methods. EM and
CEM gave comparable MSE mostly, but their superiority over
3-means was not evident in some cases, especially for p=0.6.
0.4
0.6
0.3
0.5
0.4
0.2
0.3
• Here we transform the original
asymmetric
data
into
a
symmetric data by using an
appropriate value of λ.
• yoriginal  (yλ – 1)/ λ if λ ≠ 0
ln(y)
if λ = 0
0.2
0.1
0.1
1
2
3
4
5
4
2
2
4
Log Normal Dist. to Normal Dist. by λ = 0
0.4
0.15
0.3
0.10
0.2
• Criterion for choice of λ:
Maximizing between group to
within group variance ratio.
0.05
0.1
5
10
15
20
4
2
2
Chi Squares Dist. to Normal Dist. by λ = 0.5
4
Using a regular grid of points for λ we see that almost always (more than
95% time) the correct λ or a nearby value is chosen by the algorithm.
Original
p
Method
µ=1
• The performance
under different
values remain
similar under the
variations,
however there is a
drop of
performance due
to the added
variation for the
choice of λ.
3-Means 0.0040
0.6
0.7
0.8
µ=2
µ=3
µ=1
µ=2
µ=3
0.0023
0.0007
0.0040
0.0034
0.0016
EM
0.0039
0.0059
0.0008
0.0109
0.0024
0.0019
CEM
0.0114
0.0032
0.0004
0.0101
0.0131
0.0045
3-Means 0.0188
0.0109
0.0038
0.0172
0.0136
0.0034
EM
0.0170
0.0135
0.0040
0.0122
0.0068
0.0005
CEM
0.0163
0.0060
0.0002
0.0174
0.0122
0.0026
3-Means 0.0494
0.0444
0.0258
0.0565
0.0393
0.0256
EM
0.0414
0.0252
0.0041
0.0392
0.0268
0.0060
CEM
0.0333
0.0346
0.0105
0.0422
0.0282
0.0220
• Many heavy-tailed distributions such as Cauchy and T-2 do not have finite
first two moments. In these cases we cannot use the sample mean and
variance to estimate the location and scale parameters of the population
• Instead we use sample median and quartile deviation to estimate the
location and scale parameters.
• Use of quantiles instead of moments also help increase the robustness of
the algorithms towards outliers in the data. So this algorithm can also be
used when robustness is required even though the distribution is not
suspected to be heavy-tailed.
Using p=0.5, the
classification should have
been as 250, 500 and 250
The three clusters
are of size 984, 15
and 1
The Outlier and the
single element of the
cluster
The 3 clusters have comparable
no of elements and actual
classification has been done
The three clusters
are of size 299, 421
and 280
Thus, 3-Medoids gives much
better results in the presence of
outliers.
Original p
•The robust algorithms
protect us from outliers
messing with the
estimates too much but
at a cost of loss of
efficiency of the EM
algorithm
0.6
0.7
0.8
Method
µ=1
µ=2
µ=3
µ=1
µ=2
µ=3
3-Means
0.0024
0.0035
0.0009
0.0079
0.0054
0.0052
EM
0.0071
0.0030
0.0004
0.0031
0.0027
0.0046
CEM
0.0529
0.0406
0.0017
0.0469
0.0439
0.0526
3-Means
0.0229
0.0139
0.0059
0.0095
0.0231
0.0254
EM
0.0133
0.0039
0.0007
0.0087
0.0233
0.0228
CEM
0.0755
0.0781
0.0406
0.0802
0.0649
0.0580
3-Means
0.0499
0.0477
0.0603
0.0409
0.0542
0.0536
EM
0.0348
0.0188
0.0236
0.0533
0.0521
0.0439
CEM
0.0387
0.0406
0.0266
0.0431
0.0440
0.0461
 Data was collected from an ongoing clinical
survey at Madras Diabetes Research Foundation,
Chennai, India on Type 2 Diabetes from roughly
500 patients on 9 different fields.
 Preliminary analysis revealed some perfect
linear dependencies which helped us reduce
dimensionality of the multivariate estimates.
 We have run the data through both the
univariate algorithms, each variable separately,
and also the multivariate routine using 6 fields.
i) Results from multivariate Analysis:
3-medoids: 0.6857 EM: 0.6477 CEM: 0.6576
The consistency of the results shows that multivariate normal is a good fit for
the data.
ii) Result from univariate analysis
Data
BMI
FBS
FBS-INS
IR
CHO
TRI
LDL
HDL
HBA1C
3-medoids
0.5532
0.7028
0.6325
0.5321
0.5040
0.6938
0.5542
0.5994
0.6847
EM
0.5589
0.7909
0.7801
0.9011
0.8911
0.9728
0.6659
0.9613
0.7524
CEM
0.5958
0.8394
0.7400
0.6496
0.5552
0.9829
0.7329
0.6566
0.8233
Robust EM
0.8250
0.7402
0.5764
0.5400
0.6632
0.5549
0.7354
0.5296
0.7095
Robust CEM
0.5833
0.6647
0.5733
0.5562
0.5572
0.6446
0.7912
0.5783
0.8022
 We see that in phenotypes FBS-INS, IR, CHO, TRI and
HDL , the estimate of p is almost consistent except for the
EM and CEM Algorithms. The reason must be that the
distribution does not follow a Gaussian Model or the data
contained extreme outliers .
 In LDL, robust EM and CEM give consistent values, but
the initial cluster analysis does not, implying that though 3medoids was not entirely accurate, that initial estimate
yielded a consistent solution.
 In BMI and FBS, we have consistent solution for EM
and CEM algorithm but its sensitivity decreases during
robustification. This implies that the underlying model is
most likely Gaussian.
If some phenotypes return same p, and we have prior
biological knowledge that their controlling genes may be
same, it is probably true that the same gene controls those
specific phenotypes. This work will immensely help in
identifying those phenotypes.
Using the simulated result we propose the following method as the most
optimum method for calculating the allele frequency :
We first execute the 3-medoids algorithm to estimate the location and scale
parameters of the 3 clusters and also a crude estimate of p. Using EM
algorithm, starting with the crude estimates for a grid of λ values we choose
the one with the maximum between to within variance ratio.
We graphically check if the data contains outliers. If yes, we use the robust EM
or else we follow the usual EM to get the final Estimate of p, the allele
frequency.
Madras Diabetes Research Foundation, Chennai,
India
http://www.mvdsc.org/mdrf/about.htm