Transcript Slide 1

Multifactor
Dimensionality
Reduction
Laura Mustavich
Introduction to Data Mining
Final Project Presentation
April 26, 2007
The Inspiration For
a Method
The Nature of Complex Diseases
Most common diseases are complex
 Caused by multiple genes
 Often interacting with one another

This interaction is termed Epistasis
Epistasis

When an allele at one locus masks the effect of
an allele at another locus
The Failure of Traditional Methods
Traditional gene hunting methods
successful for rare Mendelian (single
gene) diseases
 Unsuccessful for complex diseases:

 Since
many genes interact to cause the
disease, the effect of any single gene is too
small to detect
 They do not take this interaction into account
MDR: The Algorithm
Multifactor Dimensionality
Reduction
A data mining approach to identify
interactions among discrete variables that
influence a binary outcome
 A nonparametric alternative to traditional
statistical methods such as logistic
regression
 Driven by the need to improve the power
to detect gene-gene interactions

Multifactor Dimensionality
Reduction
MDR Step 0
Divide data (genotypes, discrete
environmental factors, and affectation
status) into 10 distinct subsets
Multifactor Dimensionality
Reduction
MDR Step 1
Select a set of n genetic or environmental
factors (which are suspected of epistasis
together) from the set of all variables in the
training set
Multifactor Dimensionality
Reduction
MDR Step 2
Create a contingency table for these
multilocus genotypes, counting the
number of affected and unaffected
individuals with each multilocus genotype
Multifactor Dimensionality
Reduction
MDR Step 3
Calculate the ratio of cases to controls for
each multilocus genotype
Multifactor Dimensionality
Reduction
MDR Step 4
Label each multilocus genotype as “highrisk” or “low-risk”, depending on whether
the case-control ratio is above a certain
threshold
****This is the dimensionality reduction step

Reduces n-dimensional space to 1 dimension with 2 levels
Multifactor Dimensionality
Reduction
MDR Step 5
Use labels to classify individuals as cases or
controls, and calculate the
misclassification rate
Multifactor Dimensionality
Reduction
Repeat steps 1-5 for:
 All possible combinations of n factors
 All possible values of n
 Across all 10 training and testing sets
The Best Model

Minimizes prediction error:
the average misclassification rate across all the 10
cross-validation subsets

Maximizes cross-validation consistency:
the number of times a particular model was the best
model across cross-validation subsets
Hypothesis test of best model:

Evaluate magnitude of cross-validation
consistency and prediction error estimates by
permutation testing:
 Randomize
disease labels
 Repeat MDR analysis several times to get distribution
of cross-validation consistencies and prediction errors
 Use distributions to determine p-values for your actual
cross-validation consistencies and prediction errors
Permutation Testing: An illustration
Sample Quantiles:
10
An Example Empirical Distribution
0.045754
25%
0.168814
50%
0.237763
75%
0.321027
90%
0.423336
95%
0.489813
99%
0.623899
99.99%
0.872345
100%
1
6
0.4500
4
2
0
Frequency
8
0%
0.2
0.4
0.6
0.8
1.0
The probability that we would see
results as, or more, extreme than
0.4500, simply by chance, is between
5% and 10%
Strengths




Facilitates simultaneous detection and
characterization of multiple genetic loci
associated with a discrete clinical endpoint by
reducing the dimensionality of the multilocus
data
Non-parametric – no values are estimated
Assumes no particular genetic model
False-positive rate is minimized due to multiple
testing
Weaknesses
Computationally intensive
(especially with >10 loci)
 The curse of dimensionality:
decreased predictive ability with high
dimensionality and small sample due to
cells with no data

MDR Software
The Authors
Multifactor dimensionality reduction software
for detecting gene-gene and geneenvironment interactions. Hahn, Ritchie,
Moore, 2003.
www.sourceforge.net
Values Calculated by MDR
Measure
Formula/Interpretation
Balanced Accuracy
(Sensitivity+Specificity)/2; fitness measure
Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives
equal weight to each class
Accuracy
(TP+TN)/(TP+TN+FP+FN)
Proportion of instances correctly classified
Sensitivity
TP/(TP+FN); proportion of actual positives correctly classified
Specificity
TN/(TN+FP); proportion of actual negatives correctly classified
Odds Ratio
(TP*TN)/(FP*FN); compares whether the probability of a certain event is the same
for two groups
X2
Chi-squared score for the attribute constructed by MDR from this attribute
combination
Precision
TP/(TP+FP); the proportion of relevant cases returned
Kappa
2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)]
A function of total accuracy and random accuracy
F-Measure
2*TP/(2*TP+FP+FN); a function of sensitivity and precision
Sign Test
n = number of cross-validation intervals
C = number of cross-validation intervals with testing
accuracy ≥ 0.5
 n  1 
p  
k 
 2 
k c 
 
n
n
The probability of observing c or more crossvalidation intervals with testing accuracy ≥ 0.5 if
each case were actually classified randomly
The Problem of
Alcoholism
A Case Study
Genes Associated With Alcoholism
ADH
(alcohol dehydrogenase)
and
Alcohol
ADH enzymes
ALDH2
(acetaldehyde dehydrogenase 2)
genes are
associated with alcoholism
Acetaldehyd
e
ALDH2 enzyme
involved in alcohol metabolism
Acetate
ADH Genes
Chromosome 4
370 kb
5’
ADH7
Class IV
ADH1C ADH1B
Class I
ADH1A
ADH6
ADH4
ADH5
Class V
Class II
Class III
3’
Taste Receptors and Aversion to
Alcohol
• a person must be willing to drink in order to be an
alcoholic
PTC
• TAS2R38 affects the amount of alcohol a person is
willing to drink
TAS2R38
Tasters
Non-Tasters
• therefore, it is related to
alcoholism, although no direct
association has been found
• we hope to provide a direct
Alcohol Tastes Bitter
Alcohol Tastes Sweet
Drink Less Alcohol
Drink More Alcohol
link between TAS2R38 and
alcoholism, by demonstrating
that it acts epistatically with
other genes associated with
alcoholism
Actual Analysis
Data


A sample of cases and controls
(alcoholics and non-alcoholics) from
three East Asian populations: the Ami,
Atayal, and Taiwanese
Genotyped for 98 markers within several
genes: ALDH2, all ADH genes, and 2
taste receptor genes, TAS2R16 and
TAS2R38 (PTC)
Computational Limitations
1.
The software package has a problem reading
missing data
I was forced to use only complete records,
dwindling my (already small) sample to 79
complete records
Computational Limitations
2.
The computation time is way too long for
higher order models, especially for high
numbers of attributes
I was advised to restrict my attributes to
markers within ADHIC, and the 2 taste
receptor genes, which left me with 36
attributes
I considered models only up to order 4
Summary of Results:
All Populations
Instances: 79
Order
Attributes: 36
Model
Training Bal. Acc.
Ratio: 1.3235
Testing Bal. Acc.
Sign Test (p)
CV Consistency
1
X.04..ADH1C.dwstrm.Te
0.6049
0.4278
0 (1.0000)
5/10
2
X.07..TAS2R16.C_11431
X.04..ADH1C.dwstrm.Te
0.7076
0.4438
3 (0.9453)
6/10
3
X.07..TAS2R16.C_11431
X.04..ADH1C.dwstrm.Te
X.04..ADH1C.rs3762896
0.785
0.3186
1 (0.9990)
4/10
4
X.07..TAS2R16.C_11431
X.07..PTC.C_8876291_1
X.07..PTC.C_8876482_1
X.04..ADH1C.dwstrm.Te
0.8453
0.3564
2 (0.9893)
6/10
Summary of Results: Ami
Instances: 30
Attributes: 36
Ratio: 0.8750
Order
Model
Training Bal. Acc.
Testing Bal. Acc.
Sign Test (p)
CV Consistency
1
X.07..TAS2R16.C_11431
0.7331
0.4598
5 (0.6230)
5/10
2
X.07..TAS2R16.C_11431
X.04..ADH1C.C_2688508
0.8284
0.3476
2 (0.9893)
3/10
3
X.07..TAS2R16.C_11431
X.07..PTC.C_8876467_1
X.04..ADH1C.C_2688508
0.9688
0.9545
10 (0.0010)
10/10
4
X.07..TAS2R16.C_11431
X.07..TAS2R16.C_11431.1
X.07..PTC.C_8876467_1
X.04..ADH1C.C_2688508
0.9722
0.8712
8 (0.0547)
9/10
Cross Validation Statistics
Set
Measure
Training
Testing
Balanced Accuracy
0.9688
0.9545
Accuracy
0.9667
0.95
Sensitivity
1
1
Specificity
0.9375
0.9091
Odds Ratio
∞
∞
23.6250 (p < 0.0001)
1.6364 (p = 0.2008)
Precision
0.9333
0.9
Kappa
0.9333
0.9
F-Measure
0.9655
0.9474
χ2
Sign Test:
Cross-validation Consistency:
10 (p = 0.0010)
10/10
Whole Dataset Statistics:









Training Balanced Accuracy:
Training Accuracy:
Training Sensitivity:
Training Specificity:
Training Odds Ratio:
Training Χ²:
Training Precision:
Training Kappa:
Training F-Measure:
0.9688
0.9667
1.0000
0.9375
∞
26.2500 (p < 0.0001)
0.9333
0.9333
0.9655
Graphical Model
Classification Rules
IF
X.07..TAS2R16.C_11431
X.07..PTC.C_8876467_1
X.04..ADH1C.C_2688508
Class
A\A
C\G
C\C
0
A\A
C\G
C\T
1
A\A
C\G
T\T
0
A\A
G\G
C\C
0
A\A
G\G
C\T
0
A\A
G\G
T\T
1
A\G
C\C
C\T
1
A\G
C\G
C\C
0
A\G
AND
C\G
AND
C\T
THEN
0
A\G
C\G
T\T
0
A\G
G\G
C\C
1
A\G
G\G
C\T
1
A\G
G\G
T\T
0
G\G
C\G
C\T
1
G\G
G\G
C\C
0
G\G
G\G
C\T
1
G\G
G\G
T\T
1
Locus Dendrogram
Future Work



Simulations to calculate the power of MDR,
especially in relation to sample size
Comparison of MDR with logistic regression, and
other proposed methods to detect epistasis, with
respect to the current data set and simulated
data
Research how different methods to search the
sample space can be incorporated into MDR
implementation to improve computational
feasibility