Transcript Document

Cross-sectional LCA
Patterns of first response to cigarettes
First smoking experience
• Have you ever tried a cigarette (including roll-ups), even a puff?
• How old were you when you first tried a cigarette?
• When you FIRST ever tried a cigarette can you remember how it
made you feel? (tick as many as you want)
–
–
–
–
–
It made me cough
I felt ill
It tasted awful
I liked it
It made me feel dizzy
Aim
• To categorise the subjects based on their pattern of
responses
• To assess the relationship between first-response
and current smoking behaviour
• To try not to think too much about the possibility
of recall bias
Step 1
Look at your data!!!
Examine your data structure
• LCA converts a large number of response patterns
into a small number of ‘homogeneous’ groups
• If the responses in your data are fair mutually
exclusive then there’s no point doing LCA
• Don’t just dive in
How many items endorsed?
numresp |
Freq.
Percent
Cum.
------------+----------------------------------0 |
69
2.75
2.75
1 |
1,597
63.70
66.45
2 |
569
22.70
89.15
3 |
202
8.06
97.21
4 |
68
2.71
99.92
5 |
2
0.08
100.00
------------+----------------------------------Total |
2,507
100.00
Frequency of each item (n ~ 2500)
Number of positive responses
1,400
1,200
1,000
800
600
400
200
0
cough
ill
taste
liked
dizzy
Examine pattern frequency
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+---------------------------------------+
| cough ill taste liked dizzy
num |
|---------------------------------------|
|
0
0
1
0
0
468 |
|
0
0
0
1
0
452 |
|
1
0
0
0
0
449 |
|
1
0
1
0
0
279 |
|
0
0
0
0
1
194 |
|---------------------------------------|
|
1
1
1
0
0
94 |
|
1
0
0
1
0
87 |
|
1
0
0
0
1
76 |
|
0
0
0
0
0
69 |
|
1
1
1
0
1
59 |
|---------------------------------------|
|
0
0
0
1
1
56 |
|
1
0
1
0
1
47 |
|
1
0
0
1
1
35 |
|
0
1
0
0
0
34 |
|
0
0
1
0
1
27 |
|---------------------------------------|
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
+---------------------------------------+
| cough ill taste liked dizzy
num |
|---------------------------------------|
|
0
1
1
0
0
17 |
|
0
0
1
1
0
13 |
|
1
1
0
0
1
9 |
|
1
1
0
0
0
8 |
|
0
1
1
0
1
7 |
|---------------------------------------|
|
1
0
1
1
1
7 |
|
1
0
1
1
0
6 |
|
0
1
0
0
1
5 |
|
1
1
1
1
1
2 |
|
0
1
0
1
1
2 |
|---------------------------------------|
|
0
1
0
1
0
1 |
|
1
1
1
1
0
1 |
|
1
1
0
1
1
1 |
|
0
0
1
1
1
1 |
|
1
1
0
1
0
1 |
+---------------------------------------+
Examine correlation structure
Polychoric correlation matrix
cough
ill
taste
liked
cough
1
ill
0.371
1
taste
0.049
0.468
1
liked
-0.510
-0.542
-0.786
1
dizzy
-0.030
0.246
-0.241
-0.158
dizzy
1
Step 2
Now you can fit a latent class model
Latent Class models
• Work with observations at the pattern level rather
than the individual (person) level
1.
2.
3.
4.
5.
+---------------------------------------+
| cough ill taste liked dizzy
num |
|---------------------------------------|
|
0
0
1
0
0
468 |
|
0
0
0
1
0
452 |
|
1
0
0
0
0
449 |
|
1
0
1
0
0
279 |
|
0
0
0
0
1
194 |
|---------------------------------------|
Latent Class models
• For a given number of latent classes, using application of Bayes’
rule plus an assumption of conditional independence one can
calculate the probability that each pattern should fall into each
class
• Derive the likelihood of the obtained data under each model (i.e.
assuming different numbers of classes) and use this plus other fit
statistics to determine optimal model i.e. optimal number of
classes
Latent Class models
• Bayes’ rule:
• Conditional independence:
P( pattern = ’01’ | class = i)
= P(pat(1) = ‘0’ | class = i)*P(pat(2) = ‘1’ | class = i)
How many classes can I have?
~ degrees of freedom
• 32 possible patterns
• Each additional class requires
– 5 df to estimate the 5 prevalence of each item that class
(i.e. 5 thresholds)
– 1 df for an additional cut of the latent variable defining
the class distribution
• Hence a 5-class model uses up 5*5 + 4 = 29
degrees of freedom leaving 3df to test the model
Standard thresholds
• Mplus thinks of binary variables as being a
dichotomised continuous latent variable
• The point at which a continuous N(0,1) variable
must be cut to create a binary variable is called a
threshold
• A binary variable with 50% cases corresponds to a
threshold of zero
• A binary variable with 2.5% cases corresponds to
a threshold of 1.96
Standard thresholds
Figure from Uebersax webpage
Data:
File is “..\smoking_experience.dta.dat";
listwise is on;
Variable:
Names are sex cough ill taste liked dizzy
numresp less_12 less_13;
categorical are cough ill taste liked dizzy ;
usevariables are cough ill taste liked dizzy;
Missing are all (-9999) ;
classes = c(3);
Analysis:
proc = 2 (starts);
type = mixture;
starts = 1000 500;
stiterations = 20;
Output:
tech10;
What you’re actually doing
model:
%OVERALL%
[c#1 c#2];
%c#1%
[cough$1];
[ill$1];
[taste$1];
[liked$1];
[dizzy$1];
Defines the latent class variable
Defines the within class
thresholds i.e. the prevalence of
the endorsement of each item
+ five more threshold parameters for %c#2% and %c#3%
SUMMARY OF CATEGORICAL DATA PROPORTIONS
COUGH
Category
Category
ILL
Category
Category
TASTE
Category
Category
LIKED
Category
Category
DIZZY
Category
Category
1
2
0.537
0.463
1
2
0.904
0.096
1
2
0.590
0.410
1
2
0.735
0.265
1
2
0.789
0.211
RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST
LOGLIKELIHOOD VALUES
Final stage loglikelihood values at local maxima,
seeds, and initial stage start numbers:
-6343.937
-6343.937
-6343.937
-6343.937
-6343.937
-6343.937
-6343.937
-6343.937
-6343.937
-6343.937
Etc.
685561
172907
497824
770684
584663
872295
116150
271339
472383
707126
9973
9395
9464
7725
5193
2899
3570
4768
9650
3683
How many random starts?
• Depends on
– Sample size
– Complexity of model
• Number of manifest variables
• Number of classes
• Aim to find consistently the model with the lowest
likelihood, within each run
Success
Loglikelihood values at local maxima,
seeds, and initial stage start
numbers:
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
-10148.718
987174
777300
406118
51296
997836
119680
338892
765744
636396
189568
469158
90078
373592
73484
154192
203018
785278
235356
681680
92764
1689
2522
3827
3485
1208
4434
1432
4617
168
3651
1145
4008
4396
4058
3972
3813
1603
2878
3557
2064
Not there yet
Loglikelihood values at local maxima,
seeds, and initial stage start
numbers
-10153.627
-10153.678
-10154.388
-10155.122
-10155.373
-10155.437
-10155.482
-10155.482
-10155.630
-10155.833
-10156.017
-10156.231
-10156.339
-10156.497
-10156.644
-10156.741
-10156.793
-10156.819
-10156.950
-10157.011
23688
150818
584226
735928
309852
925994
370560
662718
320864
873488
212934
98352
12814
557806
134830
80226
276392
304762
468300
83306
4596
1050
4481
916
2802
1386
3292
460
2078
2965
568
3636
4104
4321
780
3041
2927
4712
4176
2432
Scary “warnings”
IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET
AT THE EXTREME VALUES. EXTREME VALUES ARE -15.000 AND 15.000.
THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES:
*
*
*
*
*
THRESHOLD
THRESHOLD
THRESHOLD
THRESHOLD
THRESHOLD
1
1
1
1
1
OF
OF
OF
OF
OF
CLASS
CLASS
CLASS
CLASS
CLASS
INDICATOR
INDICATOR
INDICATOR
INDICATOR
INDICATOR
TASTE
DIZZY
ILL
LIKED
TASTE
FOR
FOR
FOR
FOR
FOR
CLASS
CLASS
CLASS
CLASS
CLASS
3
3
3
1
1
AT
AT
AT
AT
AT
ITERATION
ITERATION
ITERATION
ITERATION
ITERATION
11
12
16
34
93
WARNING: WHEN ESTIMATING A MODEL WITH MORE THAN TWO CLASSES, IT MAY BE
NECESSARY TO INCREASE THE NUMBER OF RANDOM STARTS USING THE STARTS OPTION
TO AVOID LOCAL MAXIMA.
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value
H0 Scaling Correction Factor
for MLR
-6343.937
1.006
Information Criteria
Number of Free Parameters
Akaike (AIC)
Bayesian (BIC)
Sample-Size Adjusted BIC
(n* = (n + 2) / 24)
17
12721.873
12820.930
12766.916
Chi-Square Test of Model Fit for the Binary and Ordered
Categorical (Ordinal) Outcomes
Pearson Chi-Square
Value
Degrees of Freedom
P-Value
623.040
14
0.0000
Likelihood Ratio Chi-Square
Value
Degrees of Freedom
P-Value
563.869
14
0.0000
FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT
CLASSES BASED ON THE ESTIMATED MODEL
Latent Classes
1
600.41143
2
1517.83320
3
388.75538
0.23949
0.60544
0.15507
CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST
LIKELY LATENT CLASS MEMBERSHIP
Latent Classes
1
2
3
630
1396
481
0.25130
0.55684
0.19186
Entropy (fuzzyness)
CLASSIFICATION QUALITY
Entropy
0.832
Average Latent Class Probabilities for Most Likely
Latent Class Membership (Row) by Latent Class
(Column)
1
2
3
1
0.952
0.000
0.000
2
0.048
0.979
0.252
3
0.000
0.021
0.748
Model results
Estimate
S.E.
1.604
7.371
15.000
-15.000
1.890
0.133
4.945
0.000
0.000
0.139
Two-Tailed
Est./S.E.
P-Value
Latent Class 1
Thresholds
COUGH$1
ILL$1
TASTE$1
LIKED$1
DIZZY$1
12.103
1.490
999.000
999.000
13.604
0.000
0.136
999.000
999.000
0.000
Categorical Latent Variables
Estimate
Means
C#1
C#2
0.435
1.362
S.E.
0.124
0.135
Est./S.E.
3.500
10.058
Two-Tailed
P-Value
0.000
0.000
RESULTS IN PROBABILITY SCALE
Latent Class 1
COUGH
Category
Category
ILL
Category
Category
TASTE
Category
Category
LIKED
Category
Category
DIZZY
Category
Category
1
2
0.833
0.167
0.018
0.018
45.072
9.059
0.000
0.000
1
2
0.999
0.001
0.003
0.003
321.448
0.202
0.000
0.840
1
2
1.000
0.000
0.000
0.000
0.000
0.000
1.000
1.000
1
2
0.000
1.000
0.000
0.000
0.000
0.000
1.000
1.000
1
2
0.869
0.131
0.016
0.016
54.848
8.284
0.000
0.000
Class 1 from 3-class model
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
Conditional independence
• The latent class variable accounts for the
covariance structure in your dataset
• Conditional on C, any pair of manifest variables
should be uncorrelated
• Harder to achieve for a cross-sectional LCA
• With a longitudinal LCA there tends to be a more
ordered pattern of correlations based on proximity
in time
Tech10 – response patterns
MODEL FIT INFORMATION FOR THE LATENT CLASS INDICATOR MODEL
PART
RESPONSE PATTERNS
No.
1
5
9
13
17
21
25
29
Pattern
10000
11101
10100
00011
11000
10110
01010
00111
No.
2
6
10
14
18
22
26
30
Pattern
00100
00001
00101
01101
10111
01000
11001
11110
No.
3
7
11
15
19
23
27
Pattern
00010
10101
10001
10011
11011
01001
01011
No.
4
8
12
16
20
24
28
Pattern
11100
10010
00000
00110
01100
11111
11010
Tech10 – Bivariate model fit
• 5 manifest variables → number of pairs =
 5
5!
5 4
  

 10
2
 2  (3!)  (2!)
Overall Bivariate Pearson Chi-Square
Overall Bivariate Log-Likelihood Chi-Square
Compare with χ² (10 df) = 18.307
215.353
214.695
Tech10 – Bivariate model fit
Not bad:Estimated Probabilities
Variable
Variable
COUGH
Category
Category
Category
Category
ILL
Category
Category
Category
Category
1
1
2
2
H1
1
2
1
2
0.511
0.026
0.393
0.070
Bivariate Pearson Chi-Square
Bivariate Log-Likelihood Chi-Square
H0
0.506
0.031
0.398
0.065
Standardized
Residual
(z-score)
0.457
-1.321
-0.467
0.925
2.726
2.798
Tech10 – Bivariate model fit
Terrible:Estimated Probabilities
Variable
Variable
COUGH
Category
Category
Category
Category
ILL
Category
Category
Category
Category
1
1
2
2
H1
1
2
1
2
0.566
0.338
0.024
0.072
Bivariate Pearson Chi-Square
Bivariate Log-Likelihood Chi-Square
H0
0.534
0.370
0.056
0.040
Standardized
Residual
(z-score)
3.149
-3.255
-6.850
7.977
116.657
117.162
Conditional Independence violated
Need more classes
Obtain the ‘optimal’ model
Assess the following for models with increasing classes
• aBIC
• Entropy
• BLRT (Bootstrap LRT)
• Conditional Independence (Tech10)
• Ease of interpretation
• Consistency with previous work / theory
Model fit stats
1 class
2 class
3 class
4 class
5 class
5
11
17
23
29
H0 Likelihood
-6962.1
-6458.7
-6343.9
-6200.1
-6100.8
aBIC
13947.4
12968.5
12766.9
12507.1
12336.5
Entropy
-
0.944
0.832
0.894
0.844
Tech 10
625.2
228.1
214.7
135.9
17.6
BLRT statistic
-
1006.8
229.5
287.8
198.4
BLRT p-value
-
< 0.0001
< 0.0001
< 0.0001
< 0.0001
Estimated params
5-class model
• aBIC values are still decreasing
• Tech 10 is still quite high – residual correlations
between ill and both liked and dizzy
• BLRT rejects 4-class model
• Not enough df to fit 6-class model so we cannot
assess fit of 5-class
• Seems unlikely as BLRT values are decreasing
slowly
Cross-sectional LCA
Patterns of first response to cigarette
Attempt 2
What to do?
• We need more degrees of freedom
• There were only 5 questions on response to smoking
• Add something else:
– How old were you when you first tried a cigarette?
– Split into pre-teen / teen
• 6 binary variables means 64 d.f. to play with
Model fit stats – attempt 2
3 class
4 class
5 class
6 class
7 class
20
27
34
41
48
H0 Likelihood
-7866.3
-7720.2
-7616.0
-7582.4
-7576.2
aBIC
15825.6
15565.7
15389.9
15355.1
15375.2
Entropy
0.823
0.893
0.812
0.876
0.850
Tech 10
228.9
144.6
16.8
1.2
0.29
BLRT statistic
123.3
146.1
104.2
67.3
12.4
BLRT p-value
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.2100
Estimated params
Model fit stats – attempt 2
3 class
4 class
5 class
6 class
7 class
20
27
34
41
48
H0 Likelihood
-7866.3
-7720.2
-7616.0
-7582.4
-7576.2
aBIC
15825.6
15565.7
15389.9
15355.1
15375.2
Entropy
0.823
0.893
0.812
0.876
0.850
Tech 10
228.9
144.6
16.8
1.2
0.29
BLRT statistic
123.3
146.1
104.2
67.3
12.4
BLRT p-value
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.2100
Estimated params
6-class model results
CLASS COUNTS AND PROPORTIONS FOR
THE LATENT CLASSES BASED ON
THE ESTIMATED MODEL
CLASSIFICATION OF INDIVIDUALS
BASED ON THEIR MOST LIKELY
LATENT CLASS MEMBERSHIP
Latent classes
Latent classes
1
2
3
4
5
6
53.23894
541.96140
396.04196
454.89294
750.87470
295.99007
2.1%
21.7%
15.9%
18.2%
30.1%
11.9%
1
2
3
4
5
6
34
540
403
447
840
229
1.4%
21.7%
16.2%
17.9%
33.7%
9.2%
Examine entropy in more detail
• Model-level entropy = 0.876
• Class level entropy:
1
2
3
4
5
6
1
0.953
0.000
0.000
0.000
0.025
0.000
2
0.000
0.997
0.000
0.000
0.005
0.000
3
0.000
0.000
0.958
0.000
0.000
0.043
4
0.000
0.000
0.000
0.949
0.036
0.003
5
0.026
0.002
0.017
0.041
0.851
0.036
6
0.020
0.001
0.025
0.011
0.083
0.918
Pattern level entropy
• Save out the model-based probabilities
• Open in another stats package
• Collapse over response patterns
Save out the model-based probabilities
savedata:
file is "6-class-results.dat";
save cprobabilities;
Varnames shown at end of output
SAVEDATA INFORMATION
Order and format of variables
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
ALN
QLET
SEX
CPROB1
CPROB2
CPROB3
CPROB4
CPROB5
CPROB6
C
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
F10.3
Open / process in Stata
Remove excess spaces from data file, then:
insheet using 6-class-results.dat, delim(" ")
local i = 1
local varnames "COUGH ILL TASTE LIKED DIZZY LESS_13 ALN
QLET SEX CPROB1 CPROB2 CPROB3 CPROB4 CPROB5 CPROB6 C"
foreach x of local varnames {
rename v`i' `x'
local i=`i'+1
}
gen num = 1
collapse (mean) CPROB* C (count) num, by(COUGH ILL
TASTE LIKED DIZZY LESS_13)
Check the assignment probabilities for each class
cough
ill
taste
liked
dizzy
< 13
P_c1
P_c2
P_c3
P_c4
P_c5
P_c6
Mod
class
n
1
1
1
0
0
0
0
0
0
0
0.052
0.948
6
64
1
1
1
0
1
0
0
0
0.003
0
0.001
0.996
6
34
1
1
1
0
0
1
0
0
0
0
0.027
0.973
6
30
1
0
1
0
1
0
0
0
0.135
0
0.062
0.803
6
29
1
1
1
0
1
1
0
0
0.003
0
0.001
0.996
6
25
1
0
1
0
1
1
0
0
0.154
0
0.032
0.815
6
18
1
1
0
0
0
0
0
0
0
0.071
0.054
0.874
6
6
0
1
1
0
1
1
0
0
0.073
0
0.012
0.915
6
4
1
1
0
0
1
0
0
0
0.303
0
0.001
0.696
6
4
1
1
0
0
1
1
0
0
0.329
0
0
0.671
6
4
0
1
1
0
0
1
0
0
0
0
0.411
0.589
6
3
0
1
1
0
1
0
0
0
0.065
0
0.024
0.912
6
3
1
1
0
0
0
1
0
0
0
0.055
0.029
0.917
6
2
1
1
1
1
0
1
0
0.001
0
0
0.023
0.977
6
1
1
1
1
1
1
0
0
0
0.039
0
0.001
0.96
6
1
1
1
1
1
1
1
0
0
0.044
0
0
0.955
6
1
Check the assignment probabilities for each class
cough
ill
taste
liked
dizzy
< 13
P_c1
P_c2
P_c3
P_c4
P_c5
P_c6
Mod
class
n
1
1
1
0
0
0
0
0
0
0
0.052
0.948
6
64
1
1
1
0
1
0
0
0
0.003
0
0.001
0.996
6
34
1
1
1
0
0
1
0
0
0
0
0.027
0.973
6
30
1
0
1
0
1
0
0
0
0.135
0
0.062
0.803
6
29
1
1
1
0
1
1
0
0
0.003
0
0.001
0.996
6
25
1
0
1
0
1
1
0
0
0.154
0
0.032
0.815
6
18
1
1
0
0
0
0
0
0
0
0.071
0.054
0.874
6
6
0
1
1
0
1
1
0
0
0.073
0
0.012
0.915
6
4
1
1
0
0
1
0
0
0
0.303
0
0.001
0.696
6
4
1
1
0
0
1
1
0
0
0.329
0
0
0.671
6
4
0
1
1
0
0
1
0
0
0
0
0.411
0.589
6
3
0
1
1
0
1
0
0
0
0.065
0
0.024
0.912
6
3
1
1
0
0
0
1
0
0
0
0.055
0.029
0.917
6
2
1
1
1
1
0
1
0
0.001
0
0
0.023
0.977
6
1
1
1
1
1
1
0
0
0
0.039
0
0.001
0.96
6
1
1
1
1
1
1
1
0
0
0.044
0
0
0.955
6
1
Check the assignment probabilities for each class
cough
ill
taste
liked
dizzy
< 13
P_c1
P_c2
P_c3
P_c4
P_c5
P_c6
Mod
class
n
1
1
1
0
0
0
0
0
0
0
0.052
0.948
6
64
1
1
1
0
1
0
0
0
0.003
0
0.001
0.996
6
34
1
1
1
0
0
1
0
0
0
0
0.027
0.973
6
30
1
0
1
0
1
0
0
0
0.135
0
0.062
0.803
6
29
1
1
1
0
1
1
0
0
0.003
0
0.001
0.996
6
25
1
0
1
0
1
1
0
0
0.154
0
0.032
0.815
6
18
1
1
0
0
0
0
0
0
0
0.071
0.054
0.874
6
6
0
1
1
0
1
1
0
0
0.073
0
0.012
0.915
6
4
1
1
0
0
1
0
0
0
0.303
0
0.001
0.696
6
4
1
1
0
0
1
1
0
0
0.329
0
0
0.671
6
4
0
1
1
0
0
1
0
0
0
0
0.411
0.589
6
3
0
1
1
0
1
0
0
0
0.065
0
0.024
0.912
6
3
1
1
0
0
0
1
0
0
0
0.055
0.029
0.917
6
2
1
1
1
1
0
1
0
0.001
0
0
0.023
0.977
6
1
1
1
1
1
1
0
0
0
0.039
0
0.001
0.96
6
1
1
1
1
1
1
1
0
0
0.044
0
0
0.955
6
1
Bad taste (30.1%)
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
Positive experience (21.7%)
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
Coughed (18.2%)
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
Dizziness (15.9%)
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
V negative experience (11.9%)
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
Felt ill (2.1%)
1
0.8
0.6
0.4
0.2
0
COUGH
ILL
TASTE
LIKED
DIZZY
LESS_13
Well that was a complete waste of time!
• You might think that those resulting classes could have
been derived just looking at the response patterns and
making some arbitrary decisions e.g.
– Group all of those who had >1 negative experience
– Keep separate each group who had 1 experience
• You would have ended up with a bunch of weird patterns
with no clue of what to do with them
• Strange patterns likely to be measurement error?
• LCA incorporates ALL patterns and deals with uncertainty
through the posterior probabilities
Conclusions / warning
• Like EFA, LCA is an exploratory tool with the aim
of summarising the variability in the dataset in a
simple/interpretable way
• These results do not prove that there are 6 groups
of young people in real life.
• LCA will find groupings in the data even if there
is no reason to think such groups might exist. It’s
just mathematics and it knows no better
Remember, we are dealing with probabilities
Model-based
Ill
Positive
Dizzy
Coughed
Bad taste
V negative
53.24
541.96
396.04
454.89
750.87
295.99
2.1%
21.7%
15.9%
18.2%
30.1%
11.9%
“Modal assignment”
34
540
403
447
840
229
1.4%
21.7%
16.2%
17.9%
33.7%
9.2%
• Working with modal assignment is easy
– chuck each pattern into it’s most likely class and pretend everything is OK
– Equivalent to doing a single imputation for missing data – shudder!
• Unless entropy is V high, stick with the probabilities
Covariates and outcomes
Merging the classes with other data
• In the “olden days”, you could pass your ID variable
through Mplus so when you saved your class probabilities
you could merge this with other data.
• Now you can pass other data through Mplus as well –
hurrah!
Variable:
<snip>
auxiliary are ID sex;
Reshaping the dataset
• To account for the uncertainty in our class variable
we will need to weight by the posterior
probabilities obtained from Mplus
• Weighted model requires a reshaping of the
dataset so that each respondent has n-rows (for an
n-class model) rather than just 1
Pre-shaped – first 20 kids
| ID
sex
dev_18
dev_42
pclass1
pclass2
pclass3
pclass4
pclass5 modclass |
|--------------------------------------------------------------------------------------------------|
| 30004
male
3
.
.001
0
.803
0
.197
3 |
| 30008
male
2
1
.908
0
0
.007
.085
1 |
| 30010
male
2
2
.053
.001
.052
0
.894
5 |
| 30023
male
1
3
.115
0
.596
.001
.288
3 |
| 30031
male
3
4
0
0
.983
0
.016
3 |
|--------------------------------------------------------------------------------------------------|
| 30033
male
4
4
.392
0
.397
0
.211
3 |
| 30042
male
1
3
0
0
.983
0
.016
3 |
| 30050
male
3
2
0
0
.983
0
.016
3 |
| 30051
male
2
2
0
0
0
1
0
4 |
| 30057
male
1
3
.135
0
.002
0
.864
5 |
|--------------------------------------------------------------------------------------------------|
| 30058
male
1
4
0
0
.958
0
.041
3 |
| 30064
male
2
4
0
0
.983
0
.016
3 |
| 30068
male
4
3
.001
0
.803
0
.197
3 |
| 30070
male
3
4
0
0
.983
0
.016
3 |
| 30072
male
1
1
0
0
.983
0
.016
3 |
|--------------------------------------------------------------------------------------------------|
| 30075
male
3
3
0
0
.982
0
.018
3 |
| 30088
male
3
4
.03
.002
.889
.003
.076
3 |
| 30095
male
3
.
0
0
.983
0
.016
3 |
| 30098
male
3
.
.068
.158
.173
.018
.583
5 |
| 30104
male
4
1
.008
0
.775
0
.217
3 |
+--------------------------------------------------------------------------------------------------+
Pre-shaped – first 20 kids
| ID
sex
dev_18
dev_42
pclass1
pclass2
pclass3
pclass4
pclass5 modclass |
|--------------------------------------------------------------------------------------------------|
| 30004
male
3
.
.001
0
.803
0
.197
3 |
| 30008
male
2
1
.908
0
0
.007
.085
1 |
| 30010
male
2
2
.053
.001
.052
0
.894
5 |
| 30023
male
1
3
.115
0
.596
.001
.288
3 |
| 30031
male
3
4
0
0
.983
0
.016
3 |
|--------------------------------------------------------------------------------------------------|
| 30033
male
4
4
.392
0
.397
0
.211
3 |
| 30042
male
1
3
0
0
.983
0
.016
3 |
| 30050
male
3
2
0
0
.983
0
.016
3 |
| 30051
male
2
2
0
0
0
1
0
4 |
| 30057
male
1
3
.135
0
.002
0
.864
5 |
|--------------------------------------------------------------------------------------------------|
| 30058
male
1
4
0
0
.958
0
.041
3 |
| 30064
male
2
4
0
0
.983
0
.016
3 |
| 30068
male
4
3
.001
0
.803
0
.197
3 |
| 30070
male
3
4
0
0
.983
0
.016
3 |
| 30072
male
1
1
0
0
.983
0
.016
3 |
|--------------------------------------------------------------------------------------------------|
| 30075
male
3
3
0
0
.982
0
.018
3 |
| 30088
male
3
4
.03
.002
.889
.003
.076
3 |
| 30095
male
3
.
0
0
.983
0
.016
3 |
| 30098
male
3
.
.068
.158
.173
.018
.583
5 |
| 30104
male
4
1
.008
0
.775
0
.217
3 |
+--------------------------------------------------------------------------------------------------+
covariates
Posterior probs
Modal class
The reshaping
. reshape long pclass, i(id) j(class)
(note: j = 1 2 3 4 5)
Data
wide
->
long
-------------------------------------------------------Number of obs.
5584
->
27920
Number of variables
66
->
63
j variable (5 values)
->
class
xij variables:
pclass1 pclass2 ... pclass5
->
pclass
--------------------------------------------------------
Re-shaped – first 3 kids
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+--------------------------------------------------+
|
id
sex
dev_18
dev_42
pclass
class |
|--------------------------------------------------|
| 30004
male
3
.
.001
1 |
| 30004
male
3
.
0
2 |
| 30004
male
3
.
.803
3 |
| 30004
male
3
.
0
4 |
| 30004
male
3
.
.197
5 |
|--------------------------------------------------|
| 30008
male
2
1
.908
1 |
| 30008
male
2
1
0
2 |
| 30008
male
2
1
0
3 |
| 30008
male
2
1
.007
4 |
| 30008
male
2
1
.085
5 |
|--------------------------------------------------|
| 30010
male
2
2
.053
1 |
| 30010
male
2
2
.001
2 |
| 30010
male
2
2
.052
3 |
| 30010
male
2
2
0
4 |
| 30010
male
2
2
.894
5 |
+--------------------------------------------------+
Constant within child
Sum = 1
First kid
Second kid
Third kid
Similar with our data:
. list id SEX CPROB class C in 1/12
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
+---------------------------------+
|
id
SEX
CPROB
class
C |
|---------------------------------|
| 30012
2
0
1
4 |
| 30012
2
0
2
4 |
| 30012
2
0
3
4 |
| 30012
2
.945
4
4 |
| 30012
2
.045
5
4 |
| 30012
2
.01
6
4 |
|---------------------------------|
| 30024
2
0
1
5 |
| 30024
2
0
2
5 |
| 30024
2
0
3
5 |
| 30024
2
0
4
5 |
| 30024
2
.991
5
5 |
| 30024
2
.009
6
5 |
|---------------------------------|
First respondent
Second respondent
Simple crosstab
. tab class SEX , row nofreq
|
SEX
class |
1
2 |
Total
-----------+----------------------+---------Ill |
40.87
59.13 |
100.00
Positive |
40.87
59.13 |
100.00
Dizzy |
40.87
59.13 |
100.00
Coughed |
40.87
59.13 |
100.00
Bad taste |
40.87
59.13 |
100.00
V negative |
40.87
59.13 |
100.00
-----------+----------------------+---------Total |
40.87
59.13 |
100.00
• Oops!
Simple crosstab – take 2
. tab class SEX [iw = CPROB], row nofreq
|
SEX
class |
Male
Female | Total
-----------+-------------------+------Ill | 52.9%
47.1% | 100%
Positive | 32.9%
67.1% | 100%
Dizzy | 43.2%
56.8% | 100%
Coughed | 40.8%
59.2% | 100%
Bad taste | 45.2%
54.8% | 100%
V negative | 39.3%
60.7% | 100%
-----------+-------------------+------Total | 40.9%
59.1% | 100%
Compare with modal class assignment
. tab class SEX [iw = CPROB],
row nofreq
|
SEX
class | Male
Female |
-----------+-----------------+
Ill | 52.9%
47.1% |
Positive | 32.9%
67.1% |
Dizzy | 43.2%
56.8% |
Coughed | 40.8%
59.2% |
Bad taste | 45.2%
54.8% |
V negative | 39.3%
60.7% |
-----------+-----------------+
Total | 40.9%
59.1% |
. tab C SEX if (class==1),
row nofreq
|
SEX
C | Male
Female |
-----------+-----------------+
Ill | 50.0%
50.0% |
Positive | 33.0%
67.0% |
Dizzy | 43.4%
56.6% |
Coughed | 40.7%
59.3% |
Bad taste | 45.4%
54.6% |
V negative | 37.6%
62.4% |
-----------+-----------------+
Total | 40.9%
59.1% |
Multinomial logistic
. xi: mlogit class i.SEX [iw = CPROB], rrr
Multinomial logistic regression
Number of obs
=
2493
LR chi2(5)
=
24.52
Prob > chi2
=
0.0002
Log likelihood = -4053.3746
Pseudo R2
=
0.0030
-----------------------------------------------------------------------------class |
RRR
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Ill
|
_ISEX_2 |
.7322787
.2081189
-1.10
0.273
.4195259
1.278186
-------------+---------------------------------------------------------------Positive
|
_ISEX_2 |
1.677364
.1965463
4.41
0.000
1.333175
2.110413
-------------+---------------------------------------------------------------Dizzy
|
_ISEX_2 |
1.082775
.1355213
0.64
0.525
.8472297
1.383807
-------------+---------------------------------------------------------------Coughed
|
_ISEX_2 |
1.194885
.1437877
1.48
0.139
.9438344
1.512712
-------------+---------------------------------------------------------------V negative
|
_ISEX_2 |
1.274734
.1782148
1.74
0.083
.9692081
1.676572
-----------------------------------------------------------------------------(class==Bad taste is the base outcome)
Class predicts binary outcome
. Outcome = weekly smoker at age of 15
char class[omit] 5
. xi: logistic sm1100 i.class [iw = CPROB]
Logistic regression
Log likelihood =
-1168.697
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
2493
229.03
0.0000
0.0892
-----------------------------------------------------------------------------sm1100 | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Ill |
2.132652
.9125838
1.77
0.077
.9218961
4.933531
Positive |
7.190203
1.231216
11.52
0.000
5.140265
10.05766
Dizzy |
7.899915
1.413907
11.55
0.000
5.562583
11.21937
Coughed |
3.686492
.6831946
7.04
0.000
2.563689
5.301041
V negative |
2.243034
.497619
3.64
0.000
1.452099
3.46478
------------------------------------------------------------------------------
Compare with modal class
. Posterior probabilities
-----------------------------------------------------------------------------sm1100 | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Ill |
2.132652
.9125838
1.77
0.077
.9218961
4.933531
Positive |
7.190203
1.231216
11.52
0.000
5.140265
10.05766
Dizzy |
7.899915
1.413907
11.55
0.000
5.562583
11.21937
Coughed |
3.686492
.6831946
7.04
0.000
2.563689
5.301041
V negative |
2.243034
.497619
3.64
0.000
1.452099
3.46478
------------------------------------------------------------------------------
Modal assignment
-----------------------------------------------------------------------------sm1100 | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Ill |
2.560182
1.291868
1.86
0.062
.9522577
6.88315
Positive |
7.802047
1.313428
12.20
0.000
5.609367
10.85184
Dizzy |
8.3454
1.467249
12.07
0.000
5.912796
11.77881
Coughed |
4.224301
.7686958
7.92
0.000
2.957071
6.034592
V negative |
2.861537
.6548723
4.59
0.000
1.827254
4.481255
------------------------------------------------------------------------------
Conclusions
• Young people at 15yrs can report a variety of
responses to their first cigarette
• Certain responses are associated with current
regular smoking behaviour
• 15 year-old girls are more likely to retrospectively
report a positive experience
• Recall bias is likely to play a part in these
associations
Conclusions
• LCA is an exploratory tool which can be used to simplify a
set of binary responses
• Extension to ordinal responses is straight-forward
• The use of ordinal data is an alternative way to boost
degrees of freedom
• Resulting probabilities can be used model latent class
variable as a risk factor or outcome
• A modal class variable should be used with caution