Chi-square test or c2 test

Download Report

Transcript Chi-square test or c2 test

The
2
χ
(Chi-Squared) Test
Crazy Dice?
You roll a die 60 times and get:
3 ones, 6 twos, 19 threes, 22 fours, 6 fives,
and 4 sixes
 Is this a fair die?
How could we test this to prove it?
 Not a mean number of anything
 Six proportions to test!
Chi-Squared Test
• Tests counts of categorical data
• Does observed data match what we
expect to happen?
• Three types of χ2 tests:
– Goodness of Fit (one variable)
– Independence (two variables)
– Homogeneity (one variable from two
different samples)
2
χ
Distribution
df = 3
df = 5
df = 10
2
χ
Distribution
• Different df = different curves
• Skewed right
• As df increase, curve shifts right &
becomes more normal
2
χ
Conditions
• Reasonably random sample
• Counts of categorical data; we expect
each category to happen at least once
Combine these
• Sample size must be large enough together:
All expected
we expect at least five in each category
counts are at
least 5
***Be sure to list expected counts!
2
χ Formula
 
2
observed expected
2
expected
χ2 Goodness of Fit Test
• Univariate data
• How well do observed counts “fit”
what we expect the counts to be?
• Use χ2cdf to find p-values
• df = number of categories – 1
Hypotheses: Written in words!
H0: The data fits what we expect
Ha: The data doesn't fit what we expect
(In context!)
Does your zodiac sign determine how successful you will be? Fortune
magazine collected the zodiac signs of 256 heads of the largest 400
companies. Is there sufficient evidence to suggest successful people
are more likely to be born under some signs than others?
Aries
23
Libra
18
Leo
20
Taurus
20
Scorpio
21
Virgo
19
Gemini
18
Sagittarius
19
Aquarius
24
Cancer
23
Capricorn
22
Pisces
29
We expect
to in
be each
born equally
signs:
How many would
youCEOs
expect
sign if under
there all
were
no difference
256/12 = 21.333333
between them?
How many degrees of freedom?
Since there are 12 signs,
df = 12 – 1 = 11
Conditions:
• Reasonably random sample
• All expected counts (21.33) > 5
H0: The same number of CEO’s are born under each sign.
Ha: More CEO’s are born under some signs than others.

23  21. 3 

2

2
11
21. 3

29  21. 3 
 ... 
p-value = χ2cdf(5.094, 1E99, 11) = .9265
2
21. 3
 5.094
α = .05
Since p-value > α, we fail to reject H0. There is not sufficient
evidence to suggest that more CEOs are born under some signs
than others.
A company says its premium mixture of nuts contains
10% Brazil nuts, 20% cashews, 20% almonds, 10%
hazelnuts, and 40% peanuts. You buy a large can and
separate the nuts. Upon weighing them, you find there
are 112g of Brazil nuts, 183g of cashews, 207g of
almonds, 71g of hazelnuts, and 446g of peanuts. You
wonder: Is your mix significantly different from what
We don't have
the company advertises?
counts of nuts
Why is the chi-squared goodness-of-fit test NOT
appropriate here?
Count the nuts instead of
weighing them
What could we do to use chi-squared?
The
300 total
nuts. Almond
What are
the expected
Type can hasBrazil
Cashew
Hazelnut
Peanut
counts
of each
Exp. Count
30 type of60nut? 60
30
120
Offspring of certain fruit flies may have yellow or ebony
bodies and normal wings or short wings. Genetic theory
predicts that these traits will appear in the ratio 9:3:3:1
(yellow & normal, yellow & short, ebony & normal,
ebony & short). A researcher checks 100 such flies and
finds the distribution of traits to be 59, 20, 11, and 10,
respectively. 4 categories  df = 4 – 1 = 3
What are the expected counts? df?
9 + 3 + 3 + 1 = 16
So we expect 9/16 to be Y & N,
3/16 to be Y & S,
3/16 to be E & N,
and 1/16 to be E & S
Expected counts:
Y & N = 56.25
Y & S = 18.75
E & N = 18.75
E & S = 6.25
Are the results consistent with the genetic model's predictions?
Conditions:
• Reasonably random sample
• All expected counts > 5 (Y/N = 56.25, Y/S = 18.75, E/N = 18.75,
E/S = 6.25)
H0: The distribution of flies is the same as the genetic model.
Ha: The distribution of flies is different from the genetic model.

59  56.25
 
2
2
3
56.25

10  6.25
 ... 
p-value = χ2cdf(5.671, 1E99, 3) = .129
2
6.25
 5.671
α = .05
Since p-value > α, we fail to reject H0. There is not sufficient
evidence to suggest that the distribution of fruit flies is different
from the genetic model.
χ2 Test for Independence
• Bivariate data from one sample
• Are two categorical variables
dependent or independent?
• Use χ2-Test to find p-values
Same conditions and formula!
Hypotheses: Written in words!
H0: The variables are independent
Ha: The variables are dependent
(In context!)
A beef distributor wants to determine whether
there is a relationship between geographic region
and preferred cut of meat. If there is no
relationship, we will say that beef preference is
independent of geographic region.
Suppose that, in a random sample of 500
customers, 300 are from the North and 200 from
the South. Also, 150 prefer cut A, 275 prefer cut
B, and 75 prefer cut C.
If beef preference is independent of geographic
region, how would we expect this table to be
filled in?
North
South
Total
Cut A
90
60
150
Cut B
165
110
275
Cut C
45
30
75
Total
300
200
500
Expected Counts
• Assuming H0 is true,
row total  column tot al
expected counts 
table total
Degrees of Freedom
df  (r  1)(c  1)
Or:
1. Cover up one row & one column
2. Count the number of cells remaining
In the actual sample of 500 consumers, the
observed counts were as follows:
North
South
Total
Cut A
100
50
150
Cut B
150
125
275
Cut C
50
25
75
Total
300
200
500
Is there sufficient evidence to suggest that
geographic regions and beef preference are not
independent?
Expected Counts:
Conditions:
• Reasonably random sample
• All expected counts > 5
N
S
A
90
60
B
165
110
C
45
30
H0: Geographic region and beef preference are independent
Ha: Geographic region and beef preference are dependent
100 90
2
 
2
2
90
p-value = .0226
50  60
2

60
 ...  7.576
α = .05
Since p-value < α, we reject H0. There is sufficient evidence
to suggest that geographic region and beef preference are
dependent.
χ2 Test for Homogeneity
• One categorical variable from two (or
more) independent samples
• Are the two populations the same
(homogeneous)?
• Use χ2-Test to find p-values
Conditions: THE SAME!
Formula: THE SAME!
Expected counts & df: THE SAME
AS INDEPENDENCE!
Only change? Hypotheses!
Hypotheses: Written in words!
H0: The distributions are the same
Ha: The distributions are different
(In context!)
The following data is on drinks per week for
independently chosen random samples of male
and female students. (low = 1-7, moderate = 824, high = 25 or more)
Men
None
140
Low
478
Moderate 300
High
63
Total
981
Women
186
661
173
16
1036
Total
326
1139
473
79
2017
Does there appear to be a gender difference
with respect to drinking behavior?
Expected Counts:
Conditions:
• Reasonably random samples
• All expected counts > 5
Men
Women
None
158.6
167.4
Low
554.0
585.0
Mod
230.1
243.0
High
38.4
40.6
H0: Drinking behavior is the same for men & women
Ha: Drinking behavior is not the same for men & women
140158.6
2
 
2
3
p-value = 0
158.6
186167.4
2

167.4
 ...  96.53
α = .05
Since p-value < α, we reject H0. There is sufficient evidence to
suggest drinking behavior is not the same for men & women.