20131011_Analysis_of_Categorical_Data_Jackson
Download
Report
Transcript 20131011_Analysis_of_Categorical_Data_Jackson
Analysis of Categorical Data
Nick Jackson
University of Southern California
Department of Psychology
10/11/2013
1
Overview
Data Types
Contingency Tables
Logit Models
◦ Binomial
◦ Ordinal
◦ Nominal
2
Things not covered (but still fit into the topic)
Matched pairs/repeated measures
◦ McNemar’s Chi-Square
Reliability
◦ Cohen’s Kappa
◦ ROC
Poisson (Count) models
Categorical SEM
◦ Tetrachoric Correlation
Bernoulli Trials
3
Data Types (Levels of Measurement)
Discrete/Categorical/
Qualitative
Nominal/Multinomial:
Properties:
Values arbitrary (no magnitude)
No direction (no ordering)
Example:
Race: 1=AA, 2=Ca, 3=As
Measures:
Mode, relative frequency
Continuous/
Quantitative
Rank Order/Ordinal:
Binary/Dichotomous/
Binomial:
Properties:
Properties:
Values semi-arbitrary (no magnitude?)
2 Levels
Have direction (ordering)
Special case of Ordinal or
Example:
Multinomial
Lickert Scales (LICK-URT):
Examples:
1-5, Strongly Disagree to Strongly
Gender (Multinomial)
Agree
Disease (Y/N)
Measures:
Measures:
Mode, relative frequency, median
Mode, relative frequency,
Mean?
Mean?
4
Code 1.1
Contingency Tables
Often called Two-way tables or Cross-Tab
Have dimensions I x J
Can be used to test hypotheses of
association between categorical variables
2 X 3 Table
Age Groups
Gender
<40 Years
40-50 Years
>50 Year
Female
25
68
63
Male
240
223
201
5
Contingency Tables: Test of Independence
Chi-Square Test of Independence (χ2)
◦ Calculate χ2
◦ Determine DF: (I-1) * (J-1)
◦ Compare to χ2 critical value for given DF.
2 X 3 Table
Age Groups
Gender
<40 Years
40-50 Years
>50 Year
Female
25
68
63
Male
240
223
201
R1=156
R2=664
C2=331
C3=264
N=820
C1=265
𝑛
χ2 =
𝑖=1
𝑂𝑖 − 𝐸𝑖
𝐸𝑖
2
𝐸𝑖,𝑗
𝑅𝑖 ∗ 𝐶𝑗
=
𝑁
Where: Oi = Observed Freq
Ei = Expected Freq
n = number of cells in table
6
Code 1.2
Contingency Tables: Test of Independence
Pearson Chi-Square Test of Independence (χ2)
◦ H0: No Association
◦ HA: Association….where, how?
χ2 𝑑𝑓 2 = 23.39, 𝑝 < 0.001
Not appropriate when Expected (Ei) cell size freq < 5
◦ Use Fisher’s Exact Chi-Square
2 X 3 Table
Age Groups
Gender
<40 Years
40-50 Years
>50 Year
Female
25
68
63
Male
240
223
201
R1=156
R2=664
C2=331
C3=264
N=820
C1=265
7
Contingency Tables
2x2
Disorder (Outcome)
Risk Factor/
Exposure
Yes
No
Yes
a
b
a+b
No
c
d
c+d
a+c
b+d
a+b+c+d
8
Contingency Tables:
Depression
Measures of Association
Alcohol Use
Yes
No
Yes
a=
25
No
c=
20
b=
10
d=
45
45
Probability :
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑎
25
𝑷 𝑫𝑨 =
=
= 0.714
𝑎 + 𝑏 35
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑐
20
𝑷 𝑫𝑨 =
=
= 0.308
𝑐 + 𝑑 65
Odds:
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑃 𝐷𝐴
0.714
𝑶𝒅𝒅𝒔 𝑫 𝑨 =
=
= 2.5
1−𝑃 𝐷 𝐴
1 − 0.714
𝐷𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑁𝑂 𝐴𝑙𝑐𝑜ℎ𝑜𝑙 𝑈𝑠𝑒
𝑃 𝐷𝐴
0.308
𝑶𝒅𝒅𝒔 𝑫 𝑨 =
=
= 0.44
1 − 0.308
1−𝑃 𝐷 𝐴
55
35
65
100
Contrasting Probability:
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑅𝑖𝑠𝑘 (𝑅𝑅) =
𝑃 𝐷 𝐴)
0.714
=
= 2.31
0.308
𝑃(𝐷|𝐴)
Individuals who used alcohol were 2.31
times more likely to have depression
than those who do not use alcohol
Contrasting Odds:
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅) =
𝑂𝑑𝑑𝑠 𝐷 𝐴)
2.5
=
= 5.62
0.44
𝑂𝑑𝑑𝑠(𝐷|𝐴)
The odds for depression were 5.62
times greater in Alcohol users compared
to nonusers.
9
Depression
Why Odds Ratios?
Alcohol Use
Yes
Yes
a=
25
No
c=
20
i=1 to 45
b=
(25 + 10*i)
10*i
d=
45*i (20 + 45*i)
55*i (45 + 55*i)
4
3
2
OR / RR
5
6
45
No
0
.1
.2
.3
Overall Probability of Depression
RR
.4
.5
OR
10
The Generalized Linear Model
General Linear Model (LM)
◦ Continuous Outcomes (DV)
◦ Linear Regression, t-test, Pearson correlation,
ANOVA, ANCOVA
Generalized Linear Model (GLM)
◦
◦
◦
◦
John Nelder and Robert Wedderburn
Maximum Likelihood Estimation
Continuous, Categorical, and Count outcomes.
Distribution Family and Link Functions
Error distributions that are not normal
11
Logistic Regression
“This is the most important model for
categorical response data” –Agresti
(Categorical Data Analysis, 2nd Ed.)
Binary Response
Predicting Probability (related to the Probit
model)
Assume (the usual):
◦
◦
◦
◦
Independence
NOT Homoscedasticity or Normal Errors
Linearity (in the Log Odds)
Also….adequate cell sizes.
12
Logistic Regression
The Model
◦𝑌= 𝜋 𝑥 =
𝑒 𝛼+ 𝛽1 𝑥1
1+𝑒 𝛼+ 𝛽1 𝑥1
In terms of probability of success π(x)
◦ 𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥
= l𝑛
𝜋(𝑥)
1−𝜋(𝑥)
= 𝛼 + 𝛽1 𝑥1
In terms of Logits (Log Odds)
Logit transform gives us a linear equation
13
Code 2.1
Logistic Regression: Example
The Output as Logits
◦ Logits: H0: β=0
Y=Depressed
Coef
α (_constant) -1.51
Freq.
672
148
Not Depressed
Depressed
SE
Z
P
0.091
-16.7
<0.001 -1.69, -1.34
Conversion to Probability:
𝑒𝛽
𝑒 −1.51
=
= 0.1805
−1.51
𝛽
1+𝑒
1+𝑒
Percent
81.95
18.05
CI
What does H0: β=0 mean?
𝑒𝛽
1+𝑒 𝛽
=
𝑒0
1+𝑒 0
= 0.5
Conversion to Odds
𝑒 𝛽 = 𝑒 −1.51 = 0.22
Also=0.1805/0.8195=0.22
14
Code 2.2
Logistic Regression: Example
The Output as ORs
◦ Odds Ratios: H0: β=1
Y=Depressed
OR
α (_constant) 0.220
Freq.
672
148
Not Depressed
Depressed
Percent
81.95
18.05
SE
Z
P
CI
0.020
-16.7
<0.001 0.184, 0.263
◦ Conversion to Probability:
𝑂𝑅
1+𝑂𝑅
=
0.220
1+0.220
= 0.1805
◦ Conversion to Logit (log odds!)
Ln(OR) = logit
Ln(0.220)=-1.51
15
Code 2.3
Logistic Regression: Example
Logistic Regression w/ Single Continuous Predictor:
◦ log
𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
= 𝛼 + 𝛽(𝑎𝑔𝑒)
AS LOGITS:
Y=Depressed
Coef
SE
Z
P
CI
α (_constant) -2.24
0.489
-4.58
<0.001 -3.20, -1.28
β (age) 0.013
0.009
1.52
0.127
-0.004, 0.030
Interpretation:
A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.
Hmmmm….I have no concept of what a log-odds is. Interpret as something else.
Logit > 0 so as age increases the risk of depression increases.
OR=e^0.013 = 1.013
For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.
We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of
depression[ (1-OR)*100 % change]
16
Logistic Regression: GOF
• Overall Model Likelihood-Ratio Chi-Square
• Omnibus test for the model
• Overall model fit?
• Relative to other models
• Compares specified model with Null model (no
predictors)
• Χ2=-2*(LL0-LL1), DF=K parameters estimated
17
Code 2.4
Logistic Regression: GOF
(Summary Measures)
Pseudo-R2
◦ Not the same meaning as linear regression.
◦ There are many of them (Cox and Snell/McFadden)
◦ Only comparable within nested models of the same outcome.
Hosmer-Lemeshow
◦
◦
◦
◦
Models with Continuous Predictors
Is the model a better fit than the NULL model. X2
H0: Good Fit for Data, so we want p>0.05
Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of
Group * Outcome using. Df=g-2
◦ Conservative (rarely rejects the null)
Pearson Chi-Square
◦ Models with categorical predictors
◦ Similar to Hosmer-Lemeshow
ROC-Area Under the Curve
◦ Predictive accuracy/Classification
18
Code 2.5
Logistic Regression: GOF
(Diagnostic Measures)
Outliers in Y (Outcome)
◦ Pearson Residuals
Square root of the contribution to the Pearson χ2
◦ Deviance Residuals
Square root of the contribution to the likeihood-ratio test statistic of a
saturated model vs fitted model.
Outliers in X (Predictors)
◦ Leverage (Hat Matrix/Projection Matrix)
Maps the influence of observed on fitted values
Influential Observations
◦ Pregibon’s Delta-Beta influence statistic
◦ Similar to Cook’s-D in linear regression
Detecting Problems
◦ Residuals vs Predictors
◦ Leverage Vs Residuals
◦ Boxplot of Delta-Beta
19
Logistic Regression: GOF
log
𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
= 𝛼 + 𝛽1 (𝑎𝑔𝑒)
1 − 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
L-R χ2 (df=1): 2.47, p=0.1162
McFadden’s R2: 0.0030
H-L GOF:
Number of Groups:
H-L Chi2:
DF:
P:
Y=Depressed
Coef
10
7.12
8
0.5233
SE
Z
P
CI
α (_constant) -2.24
0.489
-4.58
<0.001 -3.20, -1.28
β (age) 0.013
0.009
1.52
0.127
-0.004, 0.030
20
Code 2.6
Logistic Regression: Diagnostics
Linearity in the Log-Odds
◦ Use a lowess (loess) plot
◦ Depressed vs Age
Lowess smoother
-1
-2
-3
Depressed (Logit)
0
1
Logit transformed smooth
20
40
60
80
age
bandwidth = .8
21
Code 2.7
Logistic Regression: Example
Logistic Regression w/ Single Categorical Predictor:
◦ log
𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
1−𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)
= 𝛼 + 𝛽1 (𝑔𝑒𝑛𝑑𝑒𝑟)
AS OR:
Y=Depressed
OR
SE
Z
P
CI
α (_constant) 0.545
0.091
-3.63
<0.001 0.392, 0.756
β (male) 0.299
0.060
-5.99
<0.001 0.202, 0.444
Interpretation:
The odds of depression are 0.299 times lower for males compared to females.
We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males
compared to females.
Or…why not just make males the reference so the OR is positive. Or we could just take
the inverse and accomplish the same thing. 1/0.299 = 3.34.
22
Ordinal Logistic Regression
Also called Ordered Logistic or Proportional
Odds Model
Extension of Binary Logistic Model
>2 Ordered responses
New Assumption!
◦ Proportional Odds
BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)
The predictors effect on the outcome is the same across
levels of the outcome.
Bmi3grp (1 vs 2,3) = B(age)
Bmi3grp (1,2 vs 3) = B(age)
23
Ordinal Logistic Regression
The Model
◦ A latent variable model (Y*)
◦ j= number of levels-1
◦
𝑌∗
𝛽𝑥
= 𝑙𝑜𝑔𝑖𝑡(𝑝1 + 𝑝2 + 𝑝𝑗 ) = 𝑙𝑛
𝑝1 +𝑝2 +𝑝𝑗
1−𝑝1 − 𝑝2 −𝑝𝑗
= 𝛼𝑗
∗
+
◦ From the equation we can see that the odds ratio is
assumed to be independent of the category j
24
Code 3.1
Ordinal Logistic Regression Example
AS LOGITS:
Y=bmi3grp
Coef
SE
Z
P
CI
β1 (age)
-0.026
0.006
-4.15
<0.001
-0.381, -0.014
β2 (blood_press)
0.012
0.005
2.48
0.013
0.002, 0.021
Threshold1/cut1
-0.696 0.6678
-2.004, 0.613
Threshold2/cut2 0.773 0.6680
-0.536, 2.082
For a 1 unit increase in Blood Pressure there is a 0.012 increase in the
log-odds of being in a higher bmi category
AS OR:
Y=bmi3grp
OR
SE
Z
P
CI
β1 (age)
0.974
0.006
-4.15
<0.001
0.962, 0.986
β2 (blood_press)
1.012
0.005
2.48
0.013
1.002, 1.022
Threshold1/cut1
-0.696 0.6678
-2.004, 0.613
Threshold2/cut2
0.773 0.6680
-0.536, 2.082
For a 1 unit increase in Blood Pressure the odds of being in a higher bmi
category are 1.012 times greater.
25
Code 3.2
Ordinal Logistic Regression: GOF
Assessing Proportional Odds Assumptions
◦ Brant Test of Parallel Regression
H0: Proportional Odds, thus want p >0.05
Tests each predictor separately and overall
◦ Score Test of Parallel Regression
H0: Proportional Odds, thus want p >0.05
◦ Approx Likelihood-ratio test
H0: Proportional Odds, thus want p >0.05
26
Code 3.3
Ordinal Logistic Regression: GOF
Pseudo R2
Diagnostics Measures
◦ Performed on the j-1 binomial logistic
regressions
27
Multinomial Logistic Regression
Also called multinomial logit/polytomous
logistic regression.
Same assumptions as the binary logistic
model
>2 non-ordered responses
◦ Or You’ve failed to meet the parallel odds
assumption of the Ordinal Logistic model
28
Multinomial Logistic Regression
The Model
◦ j= levels for the outcome
◦ J=reference level
◦ 𝜋𝑗 𝑥 = 𝑃 𝑌 = 𝑗 𝑥) where x is a fixed setting of an
explanatory variable
◦ 𝑙𝑜𝑔𝑖𝑡 𝜋𝑗 (𝑥) = l𝑛
𝜋𝑗 (𝑥)
𝜋𝐽 (𝑥)
= 𝛼 + 𝛽𝑗1 𝑥1 + … 𝛽𝑗𝑝 𝑥𝑝
◦ Notice how it appears we are estimating a Relative
Risk and not an Odds Ratio. It’s actually an OR.
◦ Similar to conducting separate binary logistic models,
but with better type 1 error control
29
Code 4.1
Multinomial Logistic Regression
Example
Does degree of supernatural belief indicate a religious preference?
AS OR:
Y=religion
(ref=Catholic(1))
OR
SE
Z
P
CI
Protestant (2)
β (supernatural)
1.126
0.090
1.47
0.141
0.961, 1.317
α (_constant)
1.219
0.097
2.49
0.013
1.043, 1.425
β (supernatural)
1.218
0.117
2.06
0.039
1.010, 1.469
α (_constant)
0.619
0.059
-5.02
<0.001
0.512, 0.746
Evangelical (3)
For a 1 unit increase in supernatural belief, there is a (1-OR= %change)
21.8% increase in the probability of being an Evangelical compared to
Catholic.
30
Multinomial Logistic Regression
GOF
Limited GOF tests.
◦ Look at LR Chi-square and compare nested
models.
◦ “Essentially, all models are wrong, but some
are useful” –George E.P. Box
Pseudo R2
Similar to Ordinal
◦ Perform tests on the j-1 binomial logistic
regressions
31
Resources
“Categorical Data Analysis” by Alan Agresti
UCLA Stat Computing:
http://www.ats.ucla.edu/stat/
32