Working with Categorical Data

Download Report

Transcript Working with Categorical Data

Working with Categorical
Data
Elizabeth Prom-Wormley and
Hermine Maes
Special Thanks to Sarah Medland
Transitioning from Continuous Logic to
Categorical Logic
 Ordinal data has 1 less degree of freedom
compared to continuous data
 MZcov, DZcov, Prevalence
 No information on the variance
 Thinking about our ACE/ADE model
 4 parameters being estimated
 A C E mean
 ACE/ADE model is unidentified without adding a
constraint
Two Approaches to the Liability Threshold
Model
Traditional
Maps data to a standard normal distribution
Total variance constrained to be 1
Alternate
Fixes an alternate parameter (usually E)
Estimates the remaining parameters
Time to Look at the Data!
Please open BinaryWarmUp.R
Observed Binary BMI is Imperfect
Measure of Underlying Continuous
Distribution
0.8
0.6
0.0
We are interested in the liability
of risk for being in the
“high” BMI category
0.2
0.4
Density
Mean (bmiB2) = 0.39
SD (bmiB2) = 0.49
Prevalence “low” BMI = 60.6%
1.0
1.2
Density of BMI
-1.5
-1.0
-0.5
0.0
twinData$bmiB2
0.5
1.0
1.5
It’s Helpful to Rescale
0.4
0.0
0.2
Density
0.6
0.8
density.default(x = test1)
-2
-1
0
1
2
3
N = 100000 Bandwidth = 0.0444
Raw Data (Unstandardized)
mean=0.49, SD=0.39
-Data not mapped to a standard normal
-No easy conversion to %
-Difficult to compare between groups
Since the scaling is now arbitrary
Standard Normal (Standardized)
mean=0, SD=1
Area under the curve
between two z-values is interpreted
as a probability or percentage
Binary Review
Threshold calculated using the
cumulative normal distribution (CND)
-We used frequencies and
inverse CND to do our
own estimation of the threshold
qnorm(0.816) = 0.90
- Threshold is the Z Value that
corresponds with the proportion
of the population
having “low BMI”
Moving to Ordinal Data!
Getting a Feel for the Data
Open twinSatOrd.R
Calculate the frequencies of the 5 BMI
categories for the second twins of the MZ
pairs
CrossTable(mzDataOrdF$bmi2)
Estimating MZ Twin 2 Thresholds
by Hand
T1 = qnorm(0.124)
T1 = -1.155
T2 = qnorm(0.124 + 0.236)
T2 = -0.358
T3 = qnorm(0.124 + 0.236 + 0.291)
T3 =0.388
T4 = qnorm(0.124 + 0.236 + 0.291 + 0.175)
T4 = 0.939
Estimate Twin Pair Correlations for the Liabilities Too!
Translating Back to the SEM
Approach in OpenMx
Handling Ordinal Data in OpenMx
1- Determine the 1st threshold
2- Determine displacements between 1st
threshold and subsequent thresholds
3- Add the 1st threshold and the
displacement to obtain the subsequent
thresholds
Ordinal Saturated Code Deconstructed
Defining Threshold Matrices
covDZ
covMZ
μ MZT1
0
LT1
Variance
Constraint
1
μ MZT2
0
1
LT2
1
μ DZT1
0
LT1
1
1
μ DZT2
0
LT2
1
Threshold
Model
t1MZ1
t1MZ2
t1DZ1
t1DZ2
t2MZ1
t2MZ2
t2DZ1
t2DZ2
t3MZ1
t3MZ2
t3DZ1
t3DZ2
t4MZ1 t4MZ2
t4DZ1
t4DZ2
threM <- mxMatrix( type="Full", nrow=nth, ncol=ntv, free=TRUE,
values=thVal, lbound=thLB, labels=thLabMZ, name="ThreMZ" )
threD <- mxMatrix( type="Full", nrow=nth, ncol=ntv, free=TRUE,
values=thVal, lbound=thLB, labels=thLabDZ, name="ThreDZ" )
Ordinal Saturated Code Deconstructed
Defining Threshold Matrices- ThreMZ
Tw1
Tw2
-1.89
-1.16
0.81
0.79
0.73
0.76
0.55
0.55
1- Determine the 1st threshold
2- Determine displacements
between1st thresholds and
subsequent thresholds
Double Check- Moving from Frequencies
to Displacements
Frequency
BMI T2
Cumulative
Frequency
Z
Value
0.124
0.124
-1.16
0.236
0.360
-0.37
0.79
0.291
0.651
0.39
0.76
0.175
0.826
0.94
0.55
0.175
1
-
-
Displacement
Ordinal Saturated Code Deconstructed
Estimating Expected Threshold Matrices
threMZ <- mxAlgebra( expression= Inc %*% ThreMZ,
name="expThreMZ" )
Inc
<- mxMatrix( type="Lower", nrow=nth, ncol=nth, free=FALSE,
values=1, name="Inc" )
1
0
0
0
1
1
0
0
1
1
1
0
1
1
1
1
%*%
-1.19
-1.16
-1.19
-1.16
0.81
0.79
-0.38
-0.37
=
0.73
0.76
0.34
0.39
0.55
0.55
0.89
0.93
3- Add the 1st threshold and the displacement
to obtain the subsequent thresholds
Ordinal Saturated Code Deconstructed
Estimating Correlations & Fixing Variance
corMZ <- mxMatrix( type="Stand", nrow=ntv, ncol=ntv,
free=TRUE, values=corVals, lbound=lbrVal, ubound=ubrVal,
labels="rMZ", name="expCorMZ" )
corDZ <- mxMatrix( type="Stand", nrow=ntv, ncol=ntv,
free=TRUE, values=corVals, lbound=lbrVal, ubound=ubrVal,
labels="rDZ", name="expCorDZ" )
How Many Parameters in this Ordinal
Model?
MZ correlation- rMZ
DZ correlation- rDZ
Thresholds
t1MZ1,t2MZ1,t3MZ1,t4MZ1
t1MZ2,t2MZ2,t3MZ2,t4MZ2
t1DZ1,t2DZ1,t3DZ1,t4DZ1
t1DZ2,t2DZ2,t3DZ2,t4DZ2
Questions to Consider
Run script and double check against your
previously hand-calculated values
What are the conclusions regarding the
thresholds?
Is testing an ACE model with the usual
assumptions justified?
Univariate Analysis with Ordinal Data
A Roadmap
1- Use the data to test basic assumptions inherent to
standard ACE (ADE) models
Saturated Model
2- Estimate contributions of genetic and environmental effects
on the liability of a trait
ADE or ACE Models
3- Test ADE (ACE) submodels to identify and report
significant genetic and environmental contributions
AE or E Only Models
Open twinAceOrd.R
ACE Model Deconstructed
Path Coefficients
pathA <- mxMatrix( type="Full", nrow=1, ncol=1,
free=TRUE, values=.6, label="a11", name="a" )
a
1 x 1 matrix
pathC <- mxMatrix( type="Full", nrow=1, ncol=1,
free=TRUE, values=.6, label="c11", name="c" )
c
1 x 1 matrix
pathE <- mxMatrix( type="Full", nrow=1, ncol=1,
free=TRUE, values=.6, label="e11", name="e" )
e
1 x 1 matrix
ACE Model Deconstructed
Variance Components
covA <- mxAlgebra( expression=a %*% t(a), name="A" )
a
*
aT
1 x 1 matrix
covC <- mxAlgebra( expression=c %*% t(c), name="C" )
c
*
cT
1 x 1 matrix
covE <- mxAlgebra( expression=e %*% t(e), name="E" )
e
*
eT
1 x 1 matrix
Matrix and Algebra for Expected Means
meanG
<- mxMatrix( type="Zero", nrow=1, ncol=nv, name="Mean" )
meanT <- mxAlgebra( expression= cbind(Mean,Mean),
name="expMean" )
Matrices for Expected Thresholds
t1Z
threG <- mxMatrix( type="Full", nrow=nth,
ncol=nv, free=TRUE, values=thVal,
lbound=thLB, ubound=thUB, labels=thLab,
name="Thre" )
t2Z
t3Z
t4Z
4X1 Matrix
Inc
<- mxMatrix( type="Lower", nrow=nth,
ncol=nth, free=FALSE, values=1, name="Inc" )
1
0
0
0
1
1
0
0
1
1
1
0
1
1
1
1
4X4 Matrix
Algebra for Expected Thresholds
threT <- mxAlgebra( expression= cbind(Inc %*% Thre,
Inc %*% Thre), name="expThre" )
1
0
0
0
t1Z
1
1
0
0
% * % t2Z
1
1
1
0
t3Z
T31
T11 T11
1
1
1
1
t4Z
T41
T21 T21
T11
=
T21
T31 T31
T41 T41
1
0
0
0
t1Z
1
1
0
0
% * % t2Z
1
1
1
0
t3Z
T31
1
1
1
1
t4Z
T41
T11
=
T21
ACE Liability Model
1
1
1
1
A
E
A
C
E
C
c
a
Variance
Constraint
Threshold
Model
covMZ
1
MZ = 1
DZ= 0.5
1
1
LT1
1
e
c
a
e
LT
2
1
1
<- mxAlgebra( expression= rbind( cbind(A+C+E , A+C),
cbind(A+C , A+C+E)), name="expCovMZ" )
covDZ <- mxAlgebra( expression= rbind( cbind(A+C+E ,
0.5%x%A+C), cbind(0.5%x%A+C , A+C+E)), name="expCovDZ" )
ACE Model Deconstructed
Constraint on Variance of Ordinal
Variables
V
covP
<- mxAlgebra( expression=A+C+E, name="V" )
matUnv <- mxMatrix( type="Unit", nrow=nv, ncol=1,
name="Unv1" )
var1 <- mxConstraint( expression=diag2vec(V)==Unv1,
name="Var1" )
1 x 1 matrix
1
1 x 1 matrix
1
1 x 1 matrix
How Many Parameters?
A
C
E
Thresholds (4)
Questions to Consider
Run script and double check against
your previously hand-calculated values
Are there any submodels that are
appropriate to use instead of ACE?
What are your conclusions regarding
genetic and environmental influences
on this measure of obesity?