Transcript Document

Latent Class Analysis in SAS:
Promise, Problems, and Programming
David M. Thompson
Department of Biostatistics and Epidemiology
College of Public Health, OUHSC
Copyright © 2007, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Latent class analysis (LCA)
• LCA validates classification in the
absence of a gold standard for
decision-making.
• Incorporation into SAS is recent.
Invited paper 192-2007
2
LCA and Patient Classification
Patient classification is part of
many clinical decisions.
• Diagnosis
• Prognosis
Invited paper 192-2007
3
Patient classification in the
absence of a gold standard
Diagnosis
• Diagnostic categories may be
emerging or unclear.
Prognosis
• predicting rehabilitation outcomes
• counseling patients and families
regarding expectations
Invited paper 192-2007
4
Outline
•
•
•
•
LCA defined
SAS approaches to LCA
Producing standard errors
Curing the problem of fracturing of
estimates
• Limitations of LCA
Invited paper 192-2007
5
Latent class analysis (LCA)
• LCA is a parallel to
factor analysis, but
for categorical
responses.
• Like factor analysis,
LCA addresses the
complex pattern of
association that
appears among
observations….
A
B
Invited paper 192-2007
C
D
6
…and attributes the pattern to a set of
latent (underlying, unobserved) factors
or classes.
Class
1
A
Class
2
B
C
Invited paper 192-2007
D
7
A complex pattern of responses
emerged when undergraduates made
ethical decisions in response to four
stimulus scenarios
Auto-pedestrian
accident
Insurance
physical exam
Broadway turkey
Stock market
information
Stouffer, S.A., & Toby, J. (1951). Role conflict and personality.
American Journal of Sociology, 56, 395-406.
Invited paper 192-2007
8
LC 1
11
Autopedestrian
accident
LC 2
Broadway
turkey
Insurance
physical
exam
Stock
market
information
LCA predicts latent class membership such
that observed responses are independent.
Invited paper 192-2007
9
LC 1
11
LC 2
P(A-P acc | LC 1)
Autopedestrian
accident
P(St.Mkt.Info | LC 2)
Broadway
turkey
Insurance
physical
exam
Stock
market
information
LCA estimates
Latent class prevalences
Conditional probabilities:
probabilities of a specific response,
given class membership
Invited paper 192-2007
10
LC 1
11
LC 2
P(A-P acc | LC 1)
Autopedestrian
accident
P(St.Mkt.Info | LC 2)
Broadway
turkey
Insurance
physical
exam
Stock
market
information
Conditional probabilities are analogous to
sensitivities and specificities,
but are calculated in the absence
of a gold standard.
Invited paper 192-2007
11
LC parameter estimates for
Stouffer and Toby data
P(X=2)=.71
P(X=1)=.29
Auto-pedestrian
accident
0.99
0.93
Broadway turkey
Insurance
physical exam
Stock market
information
0.92
0.76
Invited paper 192-2007
0.71
0.33
0.35
0.13
12
Indicators’ informativeness defined by
differences in conditional probabilities
P(X=2) =.71
P(X=1) =.29
Auto-pedestrian
accident
0.99
0.93
Broadway turkey
Insurance physical
exam
Stock market
information
0.92
0.76
Invited paper 192-2007
0.71
0.33
0.35
0.13
13
LCA works on unconditional contingency table (a
table with no information on LC membership)
A-P accident Bdwy
turkey
0
0
Insurance
PE
0
St. Mkt.
Info
0
nijkl
20
0
0
0
1
2
0
0
1
0
9
0
0
1
1
2
0
1
0
0
6
.
.
.
.
.
1
1
1
1
42
Invited paper 192-2007
14
LCA’s goal is to produce
a complete (conditional) table
that assigns counts for each latent class:
A-P
acciden
t
Bdwy
turkey
Insuranc St. Latent nijklt
e PE
Mkt. Class
Info
X=t
0
0
1
9
0
0
0
0
0
0
2
6
0
0
0
1
1
3
0
0
0
1
2
11
.
.
.
.
.
.
1
1
1
1
2
9
Invited paper 192-2007
15
Assumptions of LCA
• Exhaustiveness
ABCD = X=t ABCDX
• Conditional (Local) Independence
ABCDX
= ABCD|X
=A|X  B|X  C|X  D|X X
(Goodman’s probabilistic
parameterization of an LC model with 4
observed variables)
Invited paper 192-2007
16
ML approach to LC estimation
• probability of obtaining
observed count nijkl
for response profile {i,j,k,l}
is (ABCDX )nijklt
• likelihood of obtaining a set of observed
counts for several response profiles is
L =  i  j  k  l  t (ABCDX )
n
ijklt
log L =  i  j  k  l  t nijklt ln(ABCDX )
Invited paper 192-2007
17
ML approach to LC estimation
• Because LC membership (X=t) is
unobserved, the likelihood function and
likelihood surface are complex.
Invited paper 192-2007
18
EM algorithm calculates L
when some data (X) are unobserved
“M” step
produces ML estimates
from complete table
“E” step
uses parameter estimates
to update expected values
for cell counts nijklt
in complete contingency table
Invited paper 192-2007
19
EM algorithm requires initial estimates
1st “E” step:
provides initial
estimates to “fill in”
missing information
on LC membership
“M” step
“E” step
Functions achieved in SAS-IML or conventional
DATA steps
Invited paper 192-2007
20
EM algorithm instituted using
SAS-IML or conventional DATA steps
1st “E” step:
randomly assigns each
response profile to one
latent class
“M” step
“E” step
Invited paper 192-2007
21
Alternative approach using SAS
PROC CATMOD
1st “E” step:
SAS DATA step
randomly assigns
each response profile
to one latent class
“M” step
PROC CATMOD
“E” step
SAS DATA step
moon.ouhsc.edu/dthompso/ latent%20variable%20research/lvr.htm
Invited paper 192-2007
22
Other approaches
• PROC LCA, Methodology Center of
Penn State University
methcenter.psu.edu/lca/
• LC regression macros
K. Bandeen-Roche, Johns Hopkins
Invited paper 192-2007
23
EM algorithm does not produce
standard errors
Strategies include:
• Converting CATMOD’s loglinear
parameter SE into probabilities
• Bootstrapping SE
• Obtain SE from multiple solutions
Invited paper 192-2007
24
Strategy 1: Convert SE obtained
from CATMOD’s loglinear model
Invited paper 192-2007
25
Loglinear SE are convertible to
probabilities (after Heinen, 1996)
But probabilities are complex nonlinear
functions of their loglinear counterparts:
•latent class prevalences:
P(X=t) = exp tX / x exp tX
•conditional probabilities:
P(A=i | X=t) = P(AX) / P(X)
=exp(iA+itAX) / a exp(iA+itAX)
Invited paper 192-2007
26
Strategy 2: Bootstrap parameter
estimates and SE
• Generate initial LCA solution and use its
parameter estimates to generate a
complete (conditional) contingency table.
• From complete table, generate B
bootstrapped unconditional tables.
• Perform LCA on each table, producing B
sets of parameter estimates.
• The mean and SD of these constitute,
respectively, parameter estimates and SE.
Invited paper 192-2007
27
Bootstrapping
• Creating multiple samples by
resampling repeatedly from original
sample
• Bootstrapped samples typically
chosen randomly, with replacement,
so n equals that of original sample
• Statistical operation repeated on
each bootstrapped sample.
Invited paper 192-2007
28
Efficient bootstrapping code
(Barker, 2005)
data boot;
do bootsamp=1 to 100;
do i=1 to nobs;
pick=round(ranuni(0)*nobs);
set original nobs=nobs point=pick;
output;
end;
end;
stop;
run;
Invited paper 192-2007
29
Bootstrapped estimates
LC Parameter
Conditional Probs P(A=1|X=t)
P(B=1|X=t)
P(C=1|X=t)
P(D=1|X=t)
Latent Class Prevalences
Number
of
solutions
100
Mean
0.9752
0.9143
0.8816
0.7315
0.3300
LR Statistic
Mean
Min
Max
10.64 2.47 42.93
Latent Class X=t
X=1
X=2
STD
Mean
STD
0.0316
0.7022 0.0556
0.0685
0.2939 0.0772
0.0825
0.3395 0.0520
0.1083
0.1103 0.0411
0.0870
0.6700 0.0870
Number of iterations
Mean
Min
Max
31.02
7
91
Invited paper 192-2007
30
Strategy 3: Generate multiple solutions
from different starting values
Invited paper 192-2007
31
Estimates and SE from multiple solutions,
each from a different initial assignment of
response profiles
LC Parameter
Conditional Probs P(A=1|X=t)
P(B=1|X=t)
P(C=1|X=t)
P(D=1|X=t)
Latent Class Prevalences
Number
of
solutions
94
Latent Class X=t
X=1
X=2
Mean
STD
Mean
STD
0.9909 0.0018
0.7113 0.0038
0.9330 0.0129
0.3256 0.0060
0.9210 0.0138
0.3499 0.0053
0.7597 0.0114
0.1291 0.0061
0.2873 0.0117
0.7127 0.0117
LR Statistic
Mean Min Max
2.84 2.75 4.17
Invited paper 192-2007
Number of iterations
Mean
Min
Max
31.29
8
90
32
Estimates of conditional
probability
P(A=1|X=1) from multiple
estimates
P(A=1|X=1) from bootstrapped
estimates
Repeated solutions
approach may be more
useful than bootstrapping
because it explicitly
accounts for LCA’s
sensitivity to initial
estimates.
Invited paper 192-2007
33
Multiple solutions
and bootstrapping
approaches are
useful, but present a
new challenge.
Above: Distribution of
multiple estimates of
conditional probability
P(A=1|X=1)
Below: P(A=1|X=2)
“Fracturing” of distributions
of LC estimates.
Invited paper 192-2007
34
What fractures the distributions?
P(X=2)=.71
P(X=1)=.29
0.99
Auto-pedestrian
accident
0.93
Broadway turkey
Insurance
physical exam
Stock market
information
Invited paper 192-2007
0.92
0.76
0.71
0.33
0.35
0.13
35
What fractures the distributions?
Latent classes have no
intrinsic meaning.
Identification of LC
membership is flexible.
Auto-pedestrian
accident
0.99
0.93
Broadway turkey
Insurance
physical exam
LCA can attribute a
vector of parameter
estimates to LC X=1 for
one solution, and to LC
X=2 for the next.
P(X=2)=.71
P(X=1)=.29
Stock market
information
Invited paper 192-2007
0.92
0.76
0.71
0.33
0.35
0.13
36
How to resolve fracturing
Simulation studies
confirm that vectors of
parameter estimates are
individually coherent.
Consistent assignment of
vectors to the appropriate
latent classes should cure
fracturing.
What rule leads to
consistent assignment?
Invited paper 192-2007
37
Rule: Reflect all
estimates in a vector
into the half-plane
most heavily populated
by conditional
probabilities of the
most informative
indicator.
In this example, D is the
most informative
indicator, so estimates for
every parameter are
reflected into indicator
D’s more heavily
populated (upper left) half
plane.
Invited paper 192-2007
38
Distribution of
estimates after
reflection
P(A=1|X=1)
P(A=1|X=2)
Invited paper 192-2007
39
With the fracturing problem solved, the
multiple solutions approach is an attractive
strategy to overcome EM algorithm’s inability
to produce standard errors.
LC Parameter
Conditional Probs P(A=1|X=t)
P(B=1|X=t)
P(C=1|X=t)
P(D=1|X=t)
Latent Class Prevalences
Number
of
solutions
94
Latent Class X=t
X=1
X=2
Mean
STD
Mean
STD
0.9909 0.0018
0.7113 0.0038
0.9330 0.0129
0.3256 0.0060
0.9210 0.0138
0.3499 0.0053
0.7597 0.0114
0.1291 0.0061
0.2873 0.0117
0.7127 0.0117
LR Statistic
Mean Min Max
2.84 2.75 4.17
Invited paper 192-2007
Number of iterations
Mean
Min
Max
31.29
8
90
40
Limitations of LCA
• Sample size must support detection
of weak latent structures, those with:
Rare latent class(es)
Uninformative indicators
Invited paper 192-2007
41
Limitations of LCA
• Fit statistics primarily assess
conditional independence
and so don’t alert the analyst when
LCA is struggling to characterize
weak latent structure.
Invited paper 192-2007
42
Limitations of LCA
• Violations of assumption of
conditional independence
• conditional (or residual) dependence
Class
1
A
B
C
Invited paper 192-2007
D
43
Conditional dependence
• leads to poor estimation
Overestimation of informativeness of
both correlated indicators
Overestimation of prevalence of other LC
• leads to poor model fit
Analyst may respond by positing
additional latent classes, which
complicates interpretation.
Model’s applicability limited when
modifications increasingly capitalize on
data’s idiosyncracies.
Invited paper 192-2007
44
Assessing conditional dependence
Z scores compare observed log odds ratios for pairs
of indicators with those expected under conditional
independence
(Garrett & Zeger, 2000)
Pairs of
Indicators ___________Log odds____________
Expected Observed
ASE
z
a
b
0.2993
0.7270
0.3557 1.2024
a
c
0.3630
0.7953
0.3557 1.2154
a
d
0.7847
0.5312
0.4796 -0.5285
b
c
0.6534
0.5586
0.2760 -0.3435
b
d
1.2871
1.3876
0.3430 0.2929
c
d
1.6395
1.6994
0.3685 0.1626
Large z scores arouse suspicion that pairs of
indicators are conditionally dependent.
Invited paper 192-2007
45
Accounting for conditional dependence
• Pairwise conditional dependence can
be incorporated into a revised model.
• Patterns of dependence and
independence are flexibly expressed in
both LCA parameterizations
Probabilistic (Goodman)
ABCDX=A|X  B|X  C|X  D|X X
Loglinear (Haberman)
ln ABCDX =  + iA + jB + kC + lD
+ tX + itAX + jtBX + ktCX + ltDX
Invited paper 192-2007
46
Accounting for conditional dependence
• Take advantage of CATMOD’s loglinear
modeling capabilities in the M step.
• The standard M step that assumes
conditional independence:
ods output estimates=mu;
proc catmod order=data;
weight count;
model a*b*c*d*x=_response_ /wls addcell=.1;
loglin a b c d x a*x b*x c*x d*x;
run;
quit;
ods output close;
Invited paper 192-2007
47
Accounting for conditional dependence
• Modifying the CATMOD M step to model
conditional dependence between
indicators B and C:
ods output estimates=mu;
proc catmod order=data;
weight count;
model a*b*c*d*x=_response_ / wls addcell=.1;
loglin a b c d x a*x b*x c*x d*x b*c b*c*x;
run;
quit;
ods output close;
Invited paper 192-2007
48
Concluding remarks
•LCA is a potentially valuable tool in
clinical epidemiology for clarifying
ill-defined diagnostic and prognostic
classifications.
•Recent work brings LCA into SAS’
analytic framework.
Invited paper 192-2007
49
• In any approach to LCA, sensitivity to
initial estimates requires caution
• Employ repeated solutions from different initial
estimates
• E-M loop should iterate between 3 and 40 times
• Probe assumption of conditional
independence
• At least four indicators needed
• Expanded model can account for dependence
Invited paper 192-2007
50
Acknowledgements
Barbara R. Neas, Ph.D.
Willis Owen, Ph.D.
Dept. of Biostatistics and Epidemiology
University of Oklahoma Health Sciences Center
Invited paper 192-2007
51
Thank you!
Invited paper 192-2007
52