Contingency Tables - Stony Brook University

Download Report

Transcript Contingency Tables - Stony Brook University

Contingency Tables
• Chapters Seven, Sixteen, and Eighteen
• Chapter Seven
– Definition of Contingency Tables
– Basic Statistics
– SPSS program (Crosstabulation)
• Chapter Sixteen
– Basic Probability Theory Concepts
– Test of Hypothesis of Independence
Basic Empirical Situation
• Unit of data.
• Two nominal scales measured for each unit.
– Example: interview study, sex of respondent,
variable such as whether or not subject has a
cellular telephone.
– Objective is to compare males and females with
respect to what fraction have cellular
telephones.
Contingency Table
• One column for each value of the column
variable; C is the number of columns.
• One row for each value of the row variable;
R is the number of rows.
• R x C contingency table.
Contingency Table
• Each entry is the OBSERVED COUNT
O(i,j) of the number of units having the (i,j)
contingency.
• Column of marginal totals.
• Row of marginal totals.
Basic Hypothesis
• ASSUME column variable is the
independent variable.
• Hypothesis is independence.
• That is, the conditional distribution in any
column is the same as the conditional
distribution in any other column.
Expected Count
• Basic idea is proportional allocation of
observations in a column based on column
total.
• Expected count in (i, j ) contingency =
E(i,j)= total number in column j *total
number in row i/total number in table.
• Expected count need not be an integer; one
expected count for each contingency.
Residual
• Residual in (i,j) contingency = observed
count in (i,j) contingency - expected count
in (i,j) contingency.
• That is, R(i,j)= O(i,j)-E(i,j)
• One residual for each contingency.
Pearson Chi-squared Component
• Chi-squared component for (i, j)
contingency =C(i,j)= (Residual in (i, j)
contingency)2/expected count in (i, j)
contingency.
• C(i,j)=(R(i,j))2 / E(i,j)
Assessing Pearson Component
• Rough guides on whether the (i, j)
contingency has an excessively large chisquared component C(i,j):
– the observed significance level of 3.84 is about
0.05.
– Of 6.63 is about 0.01.
– Of 10.83 is 0.001.
Pearson Chi-Squared Test
• Sum C(i,j) over all contingencies.
• Pearson chi-squared test has (R-1)(C-1)
degrees of freedom.
• Under null hypothesis
– Expected value of chi-square equals its degrees
of freedom.
– Variance is twice its degrees of freedom
Marijuana Use at Time 4 by
Marijuana Use at Time 3
Use at
time 4
No use at
time 4
Used at
time 4
Total
No use at
time 3
120
Used at
time 3
9
Total
95
142
237
215
151
366
129
Contingency Tables
• Chapter Eighteen
– Measures of Association
– For nominal variables
– For ordinal variables
Measures of Association
• Measures strength of an association
– usually, a dimensionless number between 0 and
1 in absolute value.
– Values near 0 indicate no association, near 1
mean strong association.
• Correlation coefficient is a measure of
association
• Chi-square test is not
– depends on the number of observations.
Measures of Association for
Nominal Scale Variables
• Chi-square based
– Phi coefficient
– Coefficient of contingency
– Cramer’s V
• Proportional reduction in error
– Lambda, symmetric
– Lambda, not symmetric
Chi-squared Measure: Phi
Coefficient
• Definition of the Phi Coefficient

N
2
Phi Coefficient
• Can be greater than one.
• N is the total number of the table.
• For marijuana at time 3 and 4 data, phi
coefficient is (96.595/366)0.5=0.51.
Coefficient of Contingency
• Definition of coefficient of contingency
C

2
 N
2
Coefficient of Contingency
• Can never get as large as one.
• Largest value depends on number in table.
• For example given, c=0.46.
Cramér’s V
• Definition of statistic; k is smaller of
number of rows and columns.

V
N (k  1)
2
Interpretation of Chi-squared
measures of association
• An approximate observed level of
significance is given for each measure.
• Use this in the usual way.
Proportional Reduction in Error
(PRE) Measures
• Prediction is the modal category.
• Predict overall
– Predict used marijuana at time 4; correct for
237 and wrong for 129.
• Number of misclassified is 129.
Proportional Reduction in Error
(PRE) Measures
• Predict for each condition of the
independent variable.
– Predict not use at time 4 for those not using at
time 3
• correct 120 of 215 times
• misclassify 95 times
– Predict use at time 4 for those using at time
• correct 142 of 151 times
• misclassify 9 times.
Proportional Reduction in Error
(PRE) Measures
• Using only totals, number of misclassified
is 129.
• Using marijuana at time 3, number
misclassified is 104.
• The lambda measure is λ= (129-104)
/129=0.19
Lambda PRE Measures
• There is a lambda measure using marijuana
use at time 4 as the independent variable.
– Total: predict no usage at time 3: 151 errors.
– Conditional
• no usage at Time 4: predict none at 3 with 9 errors
• usage at time 4: predict use at 3 with 95 errors
• 104 total errors.
– Lambda measure is (151-104)/151=0.31
Lambda PRE Measures
• There is a symmetric lambda measure.
• [(129-104)+(151-104)]/(129+151)=0.26
Text Example Data Set
Subject
Life
Degree
Case 1
1
2
Case 2
2
3
Case 3
3
2
Comparing Pairs of Cases
• Concordant pair of cases: sign of difference
on variable 1 is the same as the sign of the
difference on variable 2.
– Case 1 and Case 2: concordant.
– Case 2 and Case 3: discordant
– Case 1 and Case 3: tied
• Let P be number of concordant pairs and Q
be the number of discordant pairs.
Measures Based on Concordant
and Discordant Pairs
• Goodman and Kruskal’s Gamma
– (P-Q)/(P+Q)
• Kendall’s Tau-b
• Kendall’s Tau-c
• Somers’ d
Choosing a measure
• Choose a measure “interpretable for the
purpose in hand”!
• Avoid data dredging (taking the measure
that is largest for the data set that you have).
Other measures
• Correlation based
– Pearson’s correlation
– Spearman correlation: replace values by ranks.
• Measures of agreement
– Cohen’s kappa.
Summary
• Contingency table methods crucial to the
analysis of market research and social
science data.
• Hypothesis of independence
• Measures of association describe the
strength of the dependence between two
variables.