INTRODUCTION TO CATEGORICAL DATA ANALYSIS

Transcript INTRODUCTION TO CATEGORICAL DATA ANALYSIS

INTRODUCTION TO
CATEGORICAL DATA ANALYSIS
ODDS RATIO, MEASURE OF ASSOCIATION, TEST OF INDEPENDENCE,
LOGISTIC REGRESSION AND POLYTOMIOUS LOGISTIC REGRESSION
DEFINITION
• Categorical data are such that measurement scale consists of a set of
categories.
• E.g. marital status: never married, married, divorced, widowed nominal
• E.g. attitude toward some policy: strongly disapprove, disapprove, approve,
strongly approve
ordinal
• SOME VISUALIZATION TECHNIQUES: Jittering, mosaic plots, bar plots etc.
• Correlation between ordinal or nominal measurements are usually referred
to as association.
MEASURE OF ASSOCIATION
ODDS RATIO
ODDS RATIO - EXAMPLE
• Chinook Salmon fish captured in 1999
• VARIABLES:
-SEX: M or F
Nominal
- MODE OF CAPTURE: Hook&line or Net Nominal
- RUN: Early run (before July 1) or Late run (After July 1) Ordinal
- AGE: Interval (Cont. var.)
- LENGTH (Eye to fork of tail in mm): Interval (Cont. Var.)
• What is the odds that a captured fish is a female?
Consider Success = Female (Because they are heavier )
CHINOOK SALMON EXAMPLE
SEX
EARLY RUN DATA
Hook&Line
Net
TOTAL
F
172
165
337
M
119
202
321
291
367
658
TOTAL
For Hook&Line:
For Net:
Estimated
Odds of capturing a female fish with hook & line
1 . 45
OR  O Rˆ 

 1 . 77
Odds of capturing a female fish with net
0 . 82
The odds that a captured fish is female are 77% ((1.77-1)=0.77) higher with hook&line compared to net.
ODDS RATIO
• In general
Variable 1
Variable 2
Y11 / Y11  Y 21  / Y 21 / Y11  Y 21  Y11Y 22
ˆ
OR 

Y12 / Y12  Y 22  / Y 22 / Y12  Y 22  Y12 Y 21
INTERPRETATION OF OR
• What does OR=1 mean?
OR 
P Success Condition 1  / P  Failure Condition 1 
P Success Condition 2  / P  Failure Condition 2 

P Success C 1 
P  Failure C 1 

 independen
1
P Success C 2 
P  Failure C 2 
t events
Odds of success are equal number under both conditions. e.g. no matter which
mode of capturing is used.
INTERPRETATION OF OR
• OR>1
P Success C 1 
P  Failure C 1 

P Success C 2 
P  Failure C 2 
Odds of success is higher with condition 1
• OR<1
P Success C 1 
P  Failure C 1 

P Success C 2 
P  Failure C 2 
Odds of success is lower with condition 1
SHAPE OF OR
Non-symmetric
• The range of OR is: 0OR
• ln(OR) has a more symmetric distribution than OR (i.e., more close to
normal distribution)
• OR=1
ln(OR)=0
• (1)100% Confidence Interval for ln(OR):
ln( O Rˆ )  z 
1
/2
Y11

1
Y12

1
Y 21

1
Y 22
say
 A,B 
(1)100% Confidence Interval for OR: exp  A , exp  B 
CHINOOK SALMON EXAMPLE (Contd.)


1
SE ln O Rˆ 

172
1
119

1
165

1
 0 . 1589
202
ln O Rˆ  0 . 571
 0 . 571  1 . 96 0 . 1589   0 . 259 ,0 . 882

CI for OR  e
0.259
,e
0 . 882
  1 .3 ,2 .42 

• The odds that a captured fish is female are about 30 to 140% greater
with hook&line than with using a net.
OTHER MEASURES OF ASSOCIATION FOR 2X2
TABLES
Y11 / Y11  Y 21 
• Relative Risk=
Y12 / Y12  Y 22 
Yule ' s Q 
OR  1
OR  1
, 1  Q  1
When Q  0, events are independen
t.
Interpreta tion is close to correlatio n.
MEASURE OF ASSOCIATION FOR IxJTABLES
• Pearson 2 in contingency tables:

2
I
J
  
O ij  E ij 
2
E ij
i 1 j 1
• EXAMPLE= Instrument Failure
Location of Failure
Type of Failure
L1
L2
L3
TOTAL
T1
50
16
31
97
T2
61
26
16
103
111
42
47
200
TOTAL
PEARSON 2 IN CONTINGENCY TABLES
• Question: Are the type of failure and location of failure independent?
H0: Type and location are independent
H1: They are not independent
• We will compare sample frequencies (i.e. observed values) with the
expected frequencies under H0.
• Remember that if events A and B are independent, then P(AB)=P(A)P(B).
• If type and location are independent, then
• P(T1 and L1)=P(T1)P(L1)=(97/200)(111/200)
PEARSON 2 IN CONTINGENCY TABLES
• Cells~Multinomial(n,p1,…,p6)E(Xi)=npi
• Expected Frequency=E11=n.Prob.=200(97/200)(111/200)=53.84
• E12=200(97/200)(42/200)=20.37
• E13=(97*47/200)=22.8
• E21=(103*111/200)=57.17
• E22=(103*42/200)=21.63
• E23=(103*47/200)=24.2

2
2
3
  
i 1 j 1
O ij  E ij 2 50  53 .84 2
16  24 . 2 2

 
E ij
53 . 84
 8 . 086
24 . 2
with df  #column  1 # row  1   1 x 2  2
 2 ,0 .05  5 . 99  
2
2
 8 . 086  Reject H 0  They are not independen
t.
CRAMER’S V
• It adjusts the Pearson 2 with n, I and J.
 /n
2
Cramer ' s V 
In the previous example, V

min I  1, J  1
8 . 086 / 200
min 1 ,2 
 0 .2
,0  V  1
CORRELATION BETWEEN ORDINAL VARIABLES
• Correlation coefficients are used to quantitatively describe the strength and
direction of a relationship between two variables.
• When both variables are at least interval measurements, may report Pearson
product moment coefficient of correlation that is also known as the correlation
coefficient, and is denoted by ‘r’.
• Pearson correlation coefficient is only appropriate to describe linear correlation.
The appropriateness of using this coefficient could be examined through scatter
plots.
• A statistic that measures the correlation between two ‘rank’ measurements is
Spearman’s ρ , a nonparametric analog of Pearson’s r.
• Spearman’s ρ is appropriate for skewed continuous or ordinal measurements. It
can also be used to determine the relationship between one continuous and one
ordinal variable.
• Statistical tests are available to test hypotheses on ρ. Ho: There is no correlation
between the two variables (H0: ρ= 0).
MEASURES OF ASSOCIATION FOR IxJ TABLES
FOR TWO ORDINAL VARIABLES
• Why are there multiple measures of association?
• Statisticians over the years have thought of varying
ways of characterizing what a perfect relationship is:
tau-b = 1, gamma = 1
tau-b <1, gamma = 1
55
35
40
55
10
25
3
7
30
Either of these might be considered a perfect relationship,
depending on one’s reasoning about what relationships between
variables look like.
I’m so confused!!
Rule of Thumb
• Gamma tends to overestimate strength but
gives an idea of upper boundary.
• If table is square use tau-b; if rectangular, use
tau-c.
• Pollock (and we agree):
τ <.1 is weak; .1<τ<.2 is moderate; .2<τ<.3
moderately strong; .3< τ<1 strong.
MEASUREMENT OF AGREEMENT FOR IxI
TABLES
Judge 2
A
B
C
A
Judge 1
B
C
Prob. of agreement
ˆ 

 ˆ ii   ˆ i  ˆ  i
i
i
1   ˆ i  ˆ  i
i
ˆ  1 (perfect agreement)
,
EXAMPLE (COHEN’S KAPPA or Index of Interrater Reliability)
• Two pathologists examined 118 samples and categorize them into 4
groups. Below is the 2x2 table for their decisions.
Pathologist X
Pathologist Y
TOTAL
1
2
3
4
TOTAL
1
22
2
2
0
26
2
5
7
14
0
26
3
0
2
36
0
38
4
0
1
17
10
28
27
12
69
10
118
EXAMPLE (Contd.)
4
22  7  36  10
i 1
118
 ˆ ii 
4
 ˆ i  ˆ  i 
 0 . 636
 26  27    26 12   38 69    28 10 
i 1
118
2
 0 . 281
ˆ  0 . 636  0 . 281  0 . 493

1  0 . 281
The difference between observed agreement that expected under independence
is about 50% of the maximum possible difference.
EVALUATION OF KAPPA
• If the obtained K is less than .70 -- conclude that the inter-rater
reliability is not satisfactory.
• If the obtained K is greater than .70 -- conclude that the inter-rater
reliability is satisfactory.
• Interpretation of kappa, after Landis and Koch (1977)
PROBABILITY MODELS FOR CATEGORICAL
DATA
• Bernoulli/Binomial
• Multinomial
• Poisson
•…
TEST ON PROPORTIONS AND CONFIDENCE
INTERVALS
• You are already familiar with tests for proportions:
H 0 :    0 or H 0 :  1   2 or H 0 :  1   2    
Pearson 2 or Deviance G2 test
• CI for Y=0
CONFIDENCE INTERVAL FOR A PROPORTION
• For large sample size, we can use normal approximation to binomial
(np5 and n(1p) 5).
Point estimate
 ˆ  p 
Y
n
CI for  : p  z 
p 1  p 
/2
n
• If np<5 or n(1p)<5, normal approximation is not realistic.
CONFIDENCE INTERVAL FOR A PROPORTION
• Consider Y=0 in n trials. Then, p=Y/n=0.
• Normal approximated CI:
0  1 . 96
0 1  0 
n
 0 ,0 
No matter what n is! But, observing 0 success in 1 trial or in 100 trials is
different. Note that, np=0<5.
EXACT CONFIDENCE INTERVALS
(Collette, 1991, Modeling Binary Data)
• Lower Limit:
PL 
v1
v1  v 2 F v 2 ,v1  ,
/2
where v1  2 Y , v 2  2  n  Y  1 
• Upper Limit:
PU 
v 3 F v 3 ,v 4  ,
/2
v 4  v 3 F v 3 ,v 4  ,
/2
where v 3  2 Y  1 , v 4  2  n  Y

EXACT CONFIDENCE INTERVALS
• Going back to Example with Y=0.
• Let n=5.
• Y=0  v1=0, v2=2(5+1)=12, v3=2, v4=2(5)=10
PL 
PU 
0
0  12 F 12 ,0 0 . 025
2 .F  2 ,10 0 . 025
10  2 F  2 ,10 0 . 025
 

5 . 46
0
 0 . 5218
LOGISTIC REGRESSION
• To analyze the relationship between a binary outcome and a set of
explanatory variables when Y is binary.
• Assumptions of linear models do not hold.
• Assume Yi~Ber(i). Then, E(Yi)= i=P(Yi=1)P(Yi=0)=1-i.
• Logistic regression is defined as:
log it  i  log
 i 


1 
i 



  0   1 x1i     p x pi
log odds is expressed as a function
of x’s
odds of P  Yi 1  to P  Yi  0 
E Yi    i 

exp  0   1 x1i     p x pi


1  exp  0   1 x1i     p x pi

Binary Logistic Regression
• Logistic Distribution
P (Y=1)
x
• Transformed, however, the “log odds” are linear.
ln[p/(1-p)]
x
INTERPRETATION OF PARAMATERS
• Consider p=1. Let X*=X+1 (i.e., one unit increase in X). Then, odds
ratio is:
i
*
1i
*
i
1i

e
 0  1 xi
e
e
 0  1 xi
1
e
1
• exp(1): the odds ratio for 1 unit change in X
• 1: the log-odds ratio for 1 unit change in X
MULTIPLE LOGISTIC REGRESSION
ESTIMATION OF PARAMETERS
Yi~Ber(i).  L      f  y i      i i 1   i 
n
n
i 1
i 1
y
1 y i
n
ln L       y i ln  i  1  y i  ln 1   i 
i 1

 i
   y i ln 
i 1 
1i

n


  ln 1   i 





1

   y i  0   1 x1i     p x pi  ln 
 0   1 x1i     p x pi
i 1 
1 e
n
 ln L   
 0
 ln L  
 k
n

n
e
  yi  
i 1


n
i 1 1 
n
  y i x ki  
i 1

 

 0   1 x1i     p x pi
e
 0   1 x1 i     p x pi
x ki e
i 1 1 
e
0
 0   1 x1i     p x pi
 0   1 x1i     p x pi
 0 , k  1 ,2 , , p
Nonlinear equations in s. No
closed form. Need iterative
methods in computer!
MODEL CHECK
• Since errors, i takes only two values in logistic regression, “usual”
residuals will not help with model checks. But, there is “deviance in
residuals” in this case.
Deviance
for subject i  d eˆ v i  
 2 log
likelihood

d eˆ v i    2  y i ln ˆ i  1  y i  ln 1  ˆ i 
MODEL CHECK
• You can plot devi vs i, which is called index plot of deviance residuals
to identify outlying residuals. But this plot does not indicate whether
these residuals should be treated as outliers.
• There are also analogues of common methods used for linear
regression such as leverage values and influence diagnostics ( Dffits,
Cook’s distance)…
• NOTE: An alternative for predicting binary response is discriminant
analysis. However, this approach assumes X’s are jointly distributed as
multivariate normal distribution. So, it is more reasonable when X’s
are continuous. Otherwise, logistic regression should be preferred.
Binary Logistic Regression
•
•
A researcher is interested in the likelihood of gun ownership in the
US, and what would predict that.
He uses the 2002 GSS to test the following research hypotheses:
1.
2.
3.
4.
Men are more likely to own guns than women
The older persons are, the more likely they are to own guns
White people are more likely to own guns than those of other races
The more educated persons are, the less likely they are to own guns
Binary Logistic Regression
• Variables are measured as such:
Dependent:
Havegun: no gun = 0, own gun(s) = 1
Independent:
1.
2.
3.
4.
Sex:
men = 0, women = 1
Age:
entered as number of years
White:
all other races = 0, white =1
Education: entered as number of years
SPSS: Anyalyze  Regression  Binary Logistic
Enter your variables and for output below, under options, I checked
“iteration history”
Binary Logistic Regression
SPSS Output: Some descriptive information first…
Binary Logistic Regression
SPSS Output: Some descriptive information first…
Maximum likelihood process stops at
third iteration and yields an intercept
(-.625) for a model with no
predictors.
A measure of fit, -2 Log likelihood is
generated. The equation producing
this:
-2(∑(Yi * ln[P(Yi)] + (1-Yi) ln[1-P(Yi)])
This is simply the relationship
between observed values for each
case in your data and the model’s
prediction for each case. The
“negative 2” makes this number
distribute as a X2 distribution.
In a perfect model, -2 log likelihood
would equal 0. Therefore, lower
numbers imply better model fit.
Binary Logistic Regression
Originally, the “best guess” for each
person in the data set is 0, have no gun!
If you added
each…
This is the model for log
odds when any other
potential variable equals
zero (null model). It
predicts : P = .651, like
above. 1/1+ea or 1/1+.535
Real P = .349
Binary Logistic Regression
Next are iterations for our full model…
Binary Logistic Regression
Goodness-of-fit statistics for new model come next…
Test of new model vs. interceptonly model (the null model), based
on difference of -2LL of each. The
difference has a X2 distribution. Is
new -2LL significantly smaller?
-2(∑(Yi * ln[P(Yi)] + (1-Yi) ln[1-P(Yi)])
The -2LL number is “ungrounded,” but it has a χ2
distribution. Smaller is better. In a perfect model, -2 log
likelihood would equal 0.
These are attempts to
replicate R2 using information
based on -2 log likelihood,
(C&S cannot equal 1)
Assessment of new model’s
predictions
Binary Logistic Regression
Interpreting Coefficients…
ln[p/(1-p)] = a + b1X1 + b2X2 + b3X3 + b4X4
eb
X1
b1
X2
b2
X3
b3
X4
b4
1
a
Which b’s are significant?
Being male, getting older, and being white have a positive effect on likelihood of
owning a gun. On the other hand, education does not affect owning a gun.
Binary Logistic Regression
• ln[p/(1-p)] = a + b1X1 + …+bkXk, the power to which you need to
take e to get:
P
P
1–P
So… 1 – P = ea + b1X1+…+bkXk
• Plug in values of x to get the odds ( = p/1-p).
The coefficients can be manipulated as follows:
Odds = p/(1-p) = ea+b1X1+b2X2+b3X3+b4X4 = ea(eb1)X1(eb2)X2(eb3)X3(eb4)X4
Odds = p/(1-p) = ea+.898X1+.008X2+1.249X3-.056X4 = e-1.864(e.898)X1(e.008)X2(e1.249)X3(e-.056)X4
Binary Logistic Regression
The coefficients can be manipulated as follows:
Odds = p/(1-p) = ea+b1X1+b2X2+b3X3+b4X4
= ea(eb1)X1(eb2)X2(eb3)X3(eb4)X4
Odds = p/(1-p) = e-2.246-.780X1+.020X2+1.618X3-.023X4 = e-2.246(e-.780)X1(e.020)X2(e1.618)X3(e-.023)X4
Each coefficient increases the odds by a multiplicative
amount, the amount is eb. “Every unit increase in X
increases the odds by eb.”
In the example above, eb = Exp(B) in the last column.
Binary Logistic Regression
Each coefficient increases the odds by a multiplicative amount, the amount is eb.
“Every unit increase in X increases the odds by eb.”
In the example above, eb = Exp(B) in the last column.
For Sex: e-.780 = .458 … If you subtract 1 from this value, you get the proportion
increase (or decrease) in the odds caused by being male, -.542. In percent terms,
odds of owning a gun decrease 54.2% for women.
Age: e.020 = 1.020 A year increase in age increases the odds of owning a gun 2%.
White: e1.618 = 5.044 …Being white increases the odd of owning a gun by 404%
Educ: e-.023 = .977 …Not significant
Binary Logistic Regression
Age: e.020 = 1.020 A year increase in age increases the odds of owning a gun 2%.
How would 10 years’ increase in age affect the odds? Recall (eb)X is the equation component
for a variable. For 10 years, (1.020)10 = 1.219. The odds jump by 22% for ten years’ increase
in age.
Note: You’d have to know the current prediction level for the dependent variable to know if this
percent change is actually making a big difference or not!
Binary Logistic Regression
For our problem, P = e-2.246-.780X1+.020X2+1.618X3-.023X4
1 + e-2.246-.780X1+.020X2+1.618X3-.023X4
For, a man, 30, Latino, and 12 years of education, the P equals?
Let’s solve for e-2.246-.780X1+.020X2+1.618X3-.023X4 = e-2.246-.780(0)+.020(30)+1.618(0)-.023(12)
e-2.246 – 0 + .6 + 0 - .276 = e -1.922 = 2.71828-1.922 = .146
Therefore,
P =
.146
1.146
= .127 The probability that the 30 year-old, Latino with 12
years of education will own a gun is .127!!! Or
you could say there is a 12.7% chance.
Binary Logistic Regression
Inferential statistics are
as before:
• In model fit, if χ2 test is
significant, the
expanded model (with
your variables),
improves prediction.
• This Chi-squared test
tells us that as a set, the
variables improve
classification.
Binary Logistic Regression
Inferential statistics are as before:
• The significance of the coefficients is determined
by a “wald test.” Wald is χ2 with 1 df and equals a
two-tailed t2 with p-value exactly the same.
Binary Logistic Regression
So how would I do hypothesis testing? An Example:
1. Significance test for -level = .05
2. Critical X2df=1= 3.84
3.To find if there is a significant slope in the population,
H o:  = 0
H a:   0
4.Collect Data
5.Calculate Wald, like t (z): t = b – o
(1.96 * 1.96 = 3.84)
s.e.
6.Make decision about the null hypothesis
7.Find P-value
Reject the null for Male, age, and white. Fail to reject the null for education.
There is a 24.2% chance that the sample came from a population where
the education coefficient equals 0.
EXTENSIONS OF LOGISTIC REGRESSION
MULTINOMIAL LOGISTIC REGRESSION
• There are many ways of constructing polytomous regression.
1. Logistic regression with respect to a baseline category (e.g. last
category).
For nominal response:
  1i
log 
 1   Ji

   01   11 x1i     p 1 x pi


  2i
log 
 1   Ji

   02   12 x1i     p 2 x pi



  J 1 ,i
log 
 1   Ji

   0 , J 1   1 , J 1 x1i     p , J 1 x pi


MULTINOMIAL LOGISTIC REGRESSION
2. Adjacent categories logits (for ordinal data):
  1i
log 
  2i

   01   11 x1i     p 1 x pi


  2i
log 
  3i

   02   12 x1i     p 2 x pi



  J 1 ,i
log 
  Ji

   0 , J 1   1 , J 1 x1i     p , J 1 x pi


MULTINOMIAL LOGISTIC REGRESSION
3. Cumulative logits for ordinal variables.
4. Continuation-ratio logits for ordinal variables.
5. Proportional odds model for ordinal variables.
(See Agresti!)