Chi-square test or c2 test

Download Report

Transcript Chi-square test or c2 test

Chi-square test
or
2
c test
What if we are interested
in seeing if my “crazy”
dice are considered “fair”?
What can I do?
Chi-square test
• Used to test the counts of
categorical data
• Three types
– Goodness of fit (univariate)
– Independence (bivariate)
– Homogeneity (univariate with
two samples)
2
c
distribution –
df=3
df=5
df=10
2
c
distribution
• Different df have different
curves
• Skewed right
• As df increases, curve shifts
toward right & becomes more
like a normal curve
c2 assumptions
• SRS – reasonably random sample
• Have counts of categorical data &
we expect each category toCombine
happenthese
at least once
together:
All expected
to insure thatcounts
the are at
large enough weleast 5.
• Sample size –
sample size is
should expect at least five in each
category.
***Be sure to list expected counts!!
2
c
formula
c 
2
obs  exp
2
exp
2
c
Goodness of fit test
• Uses univariate data
df = number of categories - 1
• Want to see how well the
observed counts “fit” what we
expect the counts to be
• Use c2cdf function on the
calculator to find p-values
Based on df –
Hypotheses – written in words
H0: the observed counts equal the
expected counts
Ha: the observed counts are not
equal to the expected counts
Be sure to write in context!
Let’s test our dice!
Does your zodiac sign determine how successful you will
be? Fortune magazine collected the zodiac signs of 256
heads of the largest 400 companies. Is there
sufficient evidence to claim that successful people are
more likely to be born under some signs than others?
Aries
23
Libra
18
Leo
20
Taurus
20
Scorpio
21
Virgo
19
Gemini
18
Sagittarius 19
Aquarius
24
Cancer
23
Capricorn
Pisces
29
22
I would expect CEOs to be equally born under all signs.
Soare
256/12
= 21.333333
Since there
12 signs
–
How manydfwould
= 12you
– 1expect
= 11 in each sign if there were
no difference between them?
How many degrees of freedom?
Assumptions:
•Have a random sample of CEO’s
•All expected counts are greater than 5. (I expect 21.33 CEO’s
to be born in each sign.)
H0: The number of CEO’s born under each sign is the same.
Ha: The number of CEO’s born under each sign is different.
c
2


23  21.3 
20  21.3 


2
21.3
2
21.3

29  21.3 
 ... 
P-value = c2cdf(5.094, 10^99, 11) = .9265
2
21.3
 5.094
a = .05
Since p-value > a, I fail to reject H0. There is not sufficient
evidence to suggest that the CEOs are born under some signs
than others.
A company says its premium mixture of nuts
contains 10% Brazil nuts, 20% cashews, 20%
almonds, 10% hazelnuts and 40% peanuts. You
buy a large can and separate the nuts. Upon
weighing them, you find there are 112 g Brazil
nuts, 183 g of cashews, 207 g of almonds, 71 g
or hazelnuts, and 446 g of peanuts. You
Because we do NOT
wonder whether your mix ishave
significantly
counts of the
different from what the company
type advertises?
of nuts.
Why
NOT
We could count the number
is the chi-square goodness-of-fit
of each type of nuttest
and
appropriate here? then perform a c2 test.
What might you do instead of weighing the
nuts in order to use chi-square?
Offspring of certain
fruit are
flies
may have
Since there
4 categories,
yellow or ebony bodies and normal wings or
df = 4 –predicts
1 = 3
short wings. Genetic theory
that
Expected
these traits
willcounts:
appear in the ratio 9:3:3:1
& N = 56.25
(yellow &Y normal,
yellow & short, ebony &
Y & S = 18.75
normal, ebony & short) A researcher checks
E & N = 18.75
100 such Eflies
finds
the distribution
of
We expect
9/16 of the 100
& S and
= 6.25
traits to be 59, 20, 11,
andto10,
respectively.
flies
have
yellow and
normal
wings. (Y
& N)
What are the expected
counts?
df?
Are the results consistent with the
theoretical distribution predicted by the
genetic model? (see next page)
Assumptions:
•Have a random sample of fruit flies
•All expected counts are greater than 5.
Expected counts:
Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25
H0: The distribution of fruit flies is the same as the theoretical
model.
Ha: The distribution of fruit flies is not the same as the
theoretical model.
2
2
2






59

56
.
25
20

18
.
75
10

6
.
25
c2 

 ... 
56.25
18.75
P-value = c2cdf(5.671, 10^99, 3) = .129
6.25
 5.671
a = .05
Since p-value > a, I fail to reject H0. There is not sufficient
evidence to suggest that the distribution of fruit flies is not the
same as the theoretical model.
2
c
test for independence
• Used with categorical,
bivariate data from ONE
sample
• Used to see if the two
categorical variables are
associated (dependent) or not
associated (independent)
Assumptions & formula
remain the same!
Hypotheses – written in words
H0: two variables are independent
Ha: two variables are dependent
Be sure to write in context!
A beef distributor wishes to
determine whether there is a
relationship between geographic region
and cut of meat preferred. If there
is no relationship, we will say that beef
preference is independent of
geographic region. Suppose that, in a
random sample of 500 customers, 300
are from the North and 200 from the
South. Also, 150 prefer cut A, 275
prefer cut B, and 75 prefer cut C.
If beef preference is independent
of geographic region, how would we
expect this table to be filled in?
North
South
Total
Cut A
90
60
150
Cut B
165
110
275
Cut C
45
30
Total
300
200
75
500
Expected Counts
• Assuming H0 is true,
row total  column tot al
expected counts 
table total
Degrees of freedom
df  (r  1)(c  1)
Or cover up one row & one
column & count the number of
cells remaining!
Now suppose that in the actual sample
of 500 consumers the observed
numbers were as follows:
(on your paper)
Is there sufficient evidence to
suggest that geographic regions and
beef preference are not independent?
(Is there a difference between the
expected and observed counts?)
Assumptions:
Expected Counts:
•Have a random sample of people
N
S
•All expected counts are greater than 5. A
90
60
B
165
C
45
110
30
H0: geographic region and beef preference are independent
Ha: geographic region and beef preference are dependent
2
2




100

90
50

60
c2 

 ...  7.576
90
P-value = .0226
60
df = 2
a = .05
Since p-value < a, I reject H0. There is sufficient evidence to
suggest that geographic region and beef preference are
dependent.
2
c
test for homogeneity
• Used with a single categorical
variable from two (or more)
independent samples
• Used to see if the two
populations are the same
(homogeneous)
Assumptions & formula remain
the same!
Expected counts & df are found
the same way as test for
independence.
Only change is the hypotheses!
Hypotheses – written in words
H0: the two (or more) distributions
are the same
Ha: the distributions are different
Be sure to write in context!
The following data is on drinking
behavior for independently chosen
random samples of male and female
students. Does there appear to be a
gender difference with respect to
drinking behavior? (Note: low = 1-7
drinks/wk, moderate = 8-24 drinks/wk, high
= 25 or more drinks/wk)
Expected Counts:
Assumptions:
•Have 2 random sample of students
M
F
0
158.6
167.4
•All expected counts are greater than 5. L
554.0
585.0
M
230.1
243.0
H
38.4
40.6
H0: drinking behavior is the same for female & male students
Ha: drinking behavior is not the same for female & male
students
2
2




140

158
.
6
186

167
.
4
c2 

 ...  96.53
158.6
P-value = .000
167.4
df = 3
a = .05
Since p-value < a, I reject H0. There is sufficient evidence to
suggest that drinking behavior is not the same for female &
male students.