Transcript Slide 1
2
Analysis of frequency
counts with Chi square
Dr David Field
Summary
• Categorical data
• Frequency counts
• One variable chi-square
– testing the null hypothesis that frequencies in the sample are
equally divided among the catgegories
– varying the null hypothesis
• Two variable chi-square
– testing the null hypothesis that status on one categorical variable is
independent from status on another categorical variable
• Limitations and assumptions of chi-square
• Andy Field chapter 18 covers chi-square
• There is also a guide online at
– http://davidmlane.com/hyperstat/
– Chi-square is topic 16 in the list
Categorical data
• Each participant is a member of a single
category, and the categories cannot be
meaningfully placed in order
– e.g., nationality = French, German, Italian
• Sometimes chi-square is used with ordered
categories, e.g. age bands
• To perform statistical tests with categorical
data each participant must be a member of
only one category
– Category membership must be mutually exclusive
• You can’t be a smoker and a non-smoker
• This allows frequency counts in each category
to be calculated
Chi square
• If you can express the data as frequency counts in
several categories, then chi square can be used to
test for differences between the categories
• You will also see chi square written as a Greek
letter accompanied by the mathematical symbol
indicating that a number should be squared
2
Chi square with a single categorical variable
• Suppose we are interested in which drink is most popular
• We ask a sample of 100 people if they prefer to drink
coffee, tea, or water
– each respondent is only allowed to select one answer
– this is important: if each person can have membership
of more than one category you can’t use Chi square
• By default, the null hypothesis for chi-square is that each
of the categories is equally frequent in the underlying
population
– it is possible to modify this (see later)
One variable chi-square example
• Let’s say that the preferences expressed by the sample of
100 people result in the following observed frequency
counts:
–
–
–
–
tea 39
coffee 30
Water 31
SUM 100
• The null hypothesis assumes that each category is equally
frequent, and thus provides a model that the data can be
used to test
• Based on the null hypothesis, the expected frequency
counts would 100 / 3 = 33.3 per category
• The Chi square statistic works out the probability that the
observed frequencies could be obtained by random
sampling from a population where the null hyp is true
One variable chi-square example
Observed
Expected
39
33.3
30
33.3
31
33.3
100
100
Difference
Difference
squared
Divide by
expected
One variable chi-square example
Observed
Expected
Difference
39
33.3
5.7
30
33.3
-3.3
31
33.3
-2.3
100
100
Difference
squared
Divide by
expected
One variable chi-square example
Observed
Expected
Difference
Difference
squared
39
33.3
5.7
32.49
30
33.3
-3.3
10.89
31
33.3
-2.3
5.29
100
100
Divide by
expected
One variable chi-square example
Observed
Expected
Difference
Difference
squared
Divide by
expected
39
33.3
5.7
32.49
0.98
30
33.3
-3.3
10.89
0.33
31
33.3
-2.3
5.29
0.16
100
100
One variable chi-square example
Observed
Expected
Difference
Difference
squared
Divide by
expected
39
33.3
5.7
32.49
0.98
30
33.3
-3.3
10.89
0.33
31
33.3
-2.3
5.29
0.16
100
100
SUM
1.47
Converting Chi square to a p value
• SPSS will do this for you
• Chi square has degrees of freedom equal to the
number of categories minus 1
– 2 in the example this is because if you knew the
frequencies of preference for tea and coffee and the
sample size, the frequency of preference for water
would not be free to vary
• “The chi square value of 1.47, df = 2 had an
associated p value of 0.48, so the null hypothesis
that preferences for drinking tea, coffee and water
in the population are equal cannot be rejected.”
One variable chi square with unequal
expected frequencies
• By default, the expected frequencies are just the
sample size divided equally among the number of
categories.
• But, sometimes this is inappropriate
– For example, we know that the % of the population of
the UK that smokes is less than 50%
– Let’s assume for purposes of illustration that 25% of the
UK population are smokers
• We might hypothesise that the smoking rate is
higher in Glasgow than the UK average rate
• The null hypothesis is that it is the same
One variable chi square with unequal
expected frequencies
• We ask 200 adults in Glasgow if they smoke.
– 80 say yes
– 120 say no
• We know that the UK average rate is 25%, and 80
is rather more than 25% of 200
• Chi square can be used to assess the probability
of the above frequencies being obtained by
random sampling if the real smoking rate in
Glasgow was actually 25%
One variable chi-square example with
unequal expected frequencies
Observed
Expected
Difference
Difference
squared
Divide by
expected
120
150
-30
900
6
80
50
30
900
18
200
200
SUM
24
One variable chi square with unequal
expected frequencies
• “80 of the sample of 200 people from Glasgow
classified themselves as smokers. This resulted in
a chi square value of 24.0, df = 1 with an
associated p value of < 0.001, so the null
hypothesis that smoking rates in Glasgow are
equal to the UK average of 25% can be rejected.”
Chi square with two variables
• Usually, it is more interesting to use Chi square to ask
about the relationship between 2 categorical variables.
• For example, what is the relationship between gender and
smoking?
– gender can be male or female
– smoking can be smoker or non-smoker
• If you have smoking data from just men, you can only use
chi-square to ask if the proportion of smokers and nonsmokers is different
• If you have smoking data from men and women you can
use chi-square to ask if the proportion of men who smoke
differs from the proportion of women who smoke
What 2*2 chi square does not do
• It is important to realise that in the 2*2 chi square, having a
big imbalance between the number of men and the
number of women will not increase the value of the chisquare statistic
• Also, having a big imbalance between the number of
smokers and non-smokers will not increase the value of
the chi-square statistic
• This contrasts with the one variable chi-square, where an
imbalance in the numbers of men vs women, or smokers
vs. non-smokers does increase the value of chi-square.
• The value of chi-square for two variables is high if smoking
frequency is contingent on gender, and low if smoking
frequency is independent of gender
• The key to understanding 2*2 chi square is how the expected
frequencies are calculated
• The expected frequencies provide the null hypothesis, or null
model, that the chi square statistic tests
• If there are 200 participants, the simplest null model would be
to expect 50 female smokers, 50 male smokers, 50 female
non smokers, and 50 male non smokers
– but we already know that it is implausible to expect an equal split of
smokers and non-smokers
– the expected frequencies will have to allow for the imbalance of
smokers vs non smokers and a possible imbalance of men vs women
in the sample
– A sample with 20 male smokers, 10 female smokers, 80 male nonsmokers and 40 female non-smokers has an imbalance of gender and
smoking status, but smoking status does not depend on gender and
there is no deviation from the null model
The contingency table of observed
frequencies
Men
Women
Row totals
Smoke
13
31
44
Don’t
smoke
29
86
115
Column
totals
42
117
159
Calculating the expected frequencies
• The key step in the calculation of chi-square is to
estimate the frequency counts that would occur in
each cell if the null hypothesis that the row
frequencies and column frequencies do not
depend upon each other were true
• To calculate the expected frequency of the male
smokers cell, we first need to calculate the
proportion of participants that are male, without
considering if they smoke or not
• This proportion is 42 males out of 159 (the total
number of participants)
– 42 / 159 = 0.26
Calculating the expected frequencies
• If the null hyp is true, and the proportion of female
smokers and male smokers is equal, then the
proportion of the smokers in the sample that are
male should be equal to the overall proportion of
the sample that is male
• Total number of smokers in sample (44) *
proportion of sample that is male (0.26)
• 44 * 0.26 = 11.62
Calculating the expected frequencies
Men
Women
Row totals
Smoke
13
31
44
Expected
smokers
11.62
Don’t
smoke
29
86
115
42
117
159
Expected
non smoke
Column
totals
Calculating the expected frequencies
Men
Women
Row totals
Smoke
13
31
44
Expected
smokers
11.62
32.37
Don’t
smoke
29
86
Expected
non smoke
Column
totals
115
0.74
42
117
159
Calculating the expected frequencies
Men
Women
Row totals
Smoke
13
31
44
Expected
smokers
11.62
32.37
Don’t
smoke
29
86
115
Expected
non smoke
30.37
Column
totals
42
117
159
Calculating the expected frequencies
Men
Women
Row totals
Smoke
13
31
44
Expected
smokers
11.62
32.37
Don’t
smoke
29
86
Expected
non smoke
30.37
84.62
Column
totals
42
117
115
159
Calculating the value of chi square
• Each cell in the contingency table makes a
contribution to the total chi-square
• For each cell you calculate
• (Observed – Expected) and square it
• You then divide by the Expected
• Do this for each cell individually and add up the
results
Calculating chi square
Men
Women
Smoke
13
31
Expected
smokers
11.62
Don’t
smoke
29
86
Expected
non smoke
30.37
84.62
Column
totals
42
117
Row totals
44
2
(13-11.62) = 1.90
32.37/ 11.62 = 0.16
1.90
115
159
Converting chi-square to a p value
• The degrees of freedom for a two way Chi square
depends upon the number of categories in the
contingency table
– (num columns -1) * (num rows -1)
• SPSS will calculate the DF and p value for you
• “The chi square value of 0.31, df = 1 had an
associated p value of 0.58, so the null hypothesis
that the proportion of men and women that smoke
is equal cannot be rejected.”
• Also see
Larger contingency tables
• You can perform chi-square on larger contingency
tables
• For example, we might be interested in whether
the proportion of smokers vs. non smokers differs
according to age, where age is a 3 level
categorical variable
– 20-29 years old
– 30-39 years old
– 40-49 years old
• This results in a 2 * 3 contingency table
• However, there is some uncertainty as to what a
significant chi-square means in this case
Partitioning chi-square
• A statistically significant 2 * 3 chi-square might
have occurred for one of these 3 reasons
– The proportion of 20-29 year olds who smoke differs
from the proportion of 30-39 year olds that smoke
– The proportion of 20-29 year olds that smoke differs
from the proportion of 40-49 year olds that smoke
– The proportion of 30-39 year olds that smoke differs
from the proportion of 40-49 year olds that smoke
– Or all 3 of the above might be true
– Or 2 of the above might be true
• As a researcher, you will want to distinguish
between these possibilities
Partitioning chi-square
• The solution is to break the 2 * 3 contingency
table into smaller 2 * 2 contingency tables to test
each of the comparisons in the list
– The proportion of 20-29 year olds who smoke differs
from the proportion of 30-39 year olds that smoke
– The proportion of 20-29 year olds that smoke differs
from the proportion of 40-49 year olds that smoke
– The proportion of 30-39 year olds that smoke differs
from the proportion of 40-49 year olds that smoke
• Run 3 separate 2 * 2 chi-square tests
Partitioning chi-square
• However, running 3 tests results in 3 chances of a type 1
error occurring
• To maintain the probability of a type 1 error at the
conventional level of 5% you divide the alpha level by the
number of chi-square tests you run
– Effectively, you share the 5% risk of rejecting the null hypothesis
due to sampling error equally among the tests you perform
• For a single chi-square, it is significant if SPSS reports that
p is less than 0.05
• For two chi-square tests, they are significant at the 0.05
level individually if SPSS reports that p is less than 0.025
• For three chi-square tests, they are significant at the 0.05
level individually if SPSS reports that p is less than 0.0166
Warnings about chi-square
• The expected frequency count in any cell must not be less
than 5
– If this occurs then chi-square is not reliable
• If the contingency table is 2 * 2 or 2 * 3 you can use the
Fisher exact probability test instead
– SPSS will report this
• For bigger contingency tables the only solution is to
“collapse” across categories, but only where this is
meaningful
– If you began with age categories 0-4, 5-10, 11-15, 16-20 you could
collapse to 0-10 and 11-20, which would increase the expected
frequencies in each cell
• Finally, remember that the total of frequencies is equal to
the number of participants you have
– each person must only be a member of one cell in the table