Lecture note

Download Report

Transcript Lecture note

Chapter 16
Chi Squared Tests
16.1 Introduction
• Two statistical techniques are presented, to
analyze nominal data.
– A goodness-of-fit test for the multinomial experiment.
– A contingency table test of independence.
• Both tests use the c2 as the sampling distribution
of the test statistic.
16.2 Chi-Squared Goodness-of-Fit Test
• The hypothesis tested involves the probabilities p1, p2, …,
pk.of a multinomial distribution.
• The multinomial experiment is an extension of the binomial
experiment.
– There are n independent trials.
– The outcome of each trial can be classified into one of k
categories, called cells.
– The probability pi that the outcome fall into cell i remains
constant for each trial. Moreover,
p1 + p2 + … +pk = 1.
– Trials of the experiment are independent
16.2 Chi-squared Goodness-of-Fit Test
• We test whether there is sufficient evidence to
reject a pre-specified set of values for pi.
• The hypothesis:
H 0 : p1  a1 , p 2  a 2 ,..., p k  a k
H 1 : At least one p i  a i
• The test builds on comparing actual frequency
and the expected frequency of occurrences in all
the cells.
The multinomial goodness of fit test Example
• Example 16.1
– Two competing companies A and B have enjoy
dominant position in the market. The companies
conducted aggressive advertising campaigns.
– Market shares before the campaigns were:
• Company A = 45%
• Company B = 40%
• Other competitors = 15%.
The multinomial goodness of fit test Example
• Example 16.1 – continued
– To study the effect of the campaign on the market
shares, a survey was conducted.
– 200 customers were asked to indicate their preference
regarding the product advertised.
– Survey results:
• 102 customers preferred the company A’s product,
• 82 customers preferred the company B’s product,
• 16 customers preferred the competitors product.
The multinomial goodness of fit test Example
• Example 16.1 – continued
Can we conclude at 5% significance level that
the market shares were affected by the
advertising campaigns?
The multinomial goodness of fit test Example
• Solution
–
–
–
–
The population investigated is the brand preferences.
The data are nominal (A, B, or other)
This is a multinomial experiment (three categories).
The question of interest: Are p1, p2, and p3 different
after the campaign from their values before the
campaign?
The multinomial goodness of fit test Example
• The hypotheses are:
H0: p1 = .45, p2 = .40, p3 = .15
H1: At least one pi changed.
The expected frequency for each
category (cell) if the null hypothesis
is true is shown below:
90 = 200(.45)
80 = 200(.40)
What actual frequencies
did the sample return?
102
82
1
2
1
3
2
30 = 200(.15)
3
16
The multinomial goodness of fit test Example
• The statistic is
2
(
f

e
)
i
c2   i
ei
i 1
k
w here e i  npi
• The rejection region is
c2  c2,k 1
The multinomial goodness of fit test Example
• Example 16.1 – continued
k
c2 

i1
(102  90) 2 ( 82  80) 2 (16  30)2


 8.18
90
80
30
c2 ,k 1  c .205,31  5.99147
The p value  P ( c 2  8.18)  .01679
[ from Excel ( CHIDIST(8.18,2)]
The multinomial goodness of fit test Example
• Example 16.1 – continued
c2 with 2 degrees of freedom
0.025
Conclusion: Since 8.18 > 5.99, there is sufficient
evidence at 5% significance level to reject the null
hypothesis. At least one of the probabilities pi is
different. Thus, at least two market shares have
changed.
0.02
0.015
0.01
Alpha
0.005
0
0
2
4
5.99
6
P value
8.18
8
10
Rejection region
12
Required conditions –
the rule of five
• The test statistic used to perform the test is only
approximately Chi-squared distributed.
• For the approximation to apply, the expected cell
frequency has to be at least 5 for all the cells
(npi  5).
• If the expected frequency in a cell is less than 5,
combine it with other cells.
16.3 Chi-squared Test of a Contingency Table
• This test is used to test whether…
– two nominal variables are related?
– there are differences between two or more
populations of a nominal variable
• To accomplish the test objectives, we need to
classify the data according to two different
criteria.
Contingency table c2 test –
Example
• Example 16.2
– In an effort to better predict the demand for courses
offered by a certain MBA program, it was hypothesized
that students’ academic background affect their choice
of MBA major, thus, their courses selection.
– A random sample of last year’s MBA students was
selected. The following contingency table summarizes
relevant data.
Contingency table c2 test –
Example
Degree
BA
BENG
BBA
Other
Accounting
31
8
12
10
61
Finance
13
16
10
5
44
Marketing
16
7
17
7
47
60
31
60
39
152
The observed values
There are two ways to address the problem
If each classification is considered
a nominal variable, are these two
variables dependent?
If each undergraduate degree
is considered a population, do
these populations differ?
Contingency table c2 test –
Example
• Solution
–
Since ei = npi but pi is
unknown, we need to
The hypotheses are:
estimate the unknown
H0: The two variables are independent probability from the data,
H1: The two variables are dependent assuming H0 is true.
– The test statistic
k
c 
2

i1
( fi  e i ) 2
ei
k is the number of cells in
the contingency table.
– The rejection region
c2  c2,(r 1)( c 1)
Estimating the expected frequencies
Undergraduate
Degree
Accounting
BA
BENG
BBA
Other
6161
Probability
61/152
MBA Major
Finance Marketing
44
44
44/152
6060
31
3939
22
47
47/152
Probability
60/152
31/152
39/152
22/152
152
152
Under the null hypothesis the two variables are independent:
P(Accounting and BA) = P(Accounting)*P(BA) = [61/152][60/152].
The number of students expected to fall in the cell “Accounting - BA” is
eAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08
The number of students expected to fall in the cell “Finance - BBA” is
eFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29
The expected frequencies for a
contingency table
• The expected frequency of cell of raw i and
column j in the contingency table is calculated by
(Column j total)(Row i total)
eij =
Sample size
k
c 
2

i1
( fi  e i ) 2
ei
Calculation of the c2 statistic
• Solution – continued
Undergraduate
Degree
Accounting
31 (24.08)
24.08
BA
k
BENG
2 8 (12.44)
BBA 31 24.08
12 (15.65)
Other
10 (8.83)
i1
31 24.08
61
c 
31
24.08
31
c2=
24.08

MBA Major
Finance
Marketing
13 (17.37) 2 16 (18.55)
16
(8.97)
7 (9.58)
i
i
10 (11.29) 17 (12.06)
(6.39) 77 6.80
(6.80)
55 6.39
i
44
47
(f  e )
e
5 6.39
The expected frequency
5 6.39
60
31
39
22
152
7 6.80
7 6.80
7 6.80
5 6.39
(31 - 24.08)2
(5 - 6.39)2
(7 - 6.80)2
=
+….+
+….+
24.08
6.39
6.80
14.70
Contingency table c2 test –
Example
• Solution – continued
– The critical value in our example is:
c
2
 ,( r 1)( c 1)
c
2
.05 ,( 4 1)( 31)
 12.5916
• Conclusion:
Since c2 = 14.70 > 12.5916, there
is sufficient evidence to infer at 5% significance
level that students’ undergraduate degree
and MBA students courses selection
are dependent.
Using the computer
Select the Chi squared / raw data
Option from Data Analysis Plus
under tools. See Xm16-02
Define a code to specify each nominal
value. Input the data in columns one
column for each category.
Code:
Undergraduate degree
1 = BA
2 = BENG
3 = BBA
4 = OTHERS
MBA Major
1 = ACCOUNTING
2 = FINANCE
3 = MARKETING
Degree MBA Major
3
1
1
1
1
1
1
1
2
2
1
3
.
.
.
.
Contingency Table
1
2
3 Total
1
31
13
16
60
2
8
16
7
31
3
12
10
17
39
4
10
5
7
22
Total 61
44
47 152
Test Statistic CHI-Squared = 14.7019
P-Value = 0.0227
Required condition Rule of five
– The c2 distribution provides an adequate approximation to
the sampling distribution under the condition that eij >= 5 for
all the cells.
– When eij < 5 rows or columns must be added such that the
condition is met.
Example
10 (10.1) 14
18 (12.8)
(17.9)
23 (16.0)
(22.3)
12 (12.7) 16
(12.8)
8 ( 7.2) 12
8 (9.2)
We combine
column 2 and 3
14 + 4
16 + 7
8+4
4 (5.1)
7 (6.3)
4 (3.6)
12.8 + 5.1
16 + 6.3
9.2 + 3.6
16.5 Chi-Squared test for Normality
• The goodness of fit Chi-squared test can be used to
determined if data were drawn from any distribution.
• The general procedure:
– Hypothesize on the parameter values of the distribution we test
(i.e. m  m0, s  s0 for the normal distribution).
– For the variable tested X specify disjoint ranges that cover all its
possible values.
– Build a Chi squared statistic that (aggregately) compares the
expected frequency under H0 and the actual frequency of
observations that fall in each range.
– Run a goodness of fit test based on the multinomial experiment.
15.5 Chi-Squared test for Normality
• Testing for normality in Example 12.1
For a sample size of n=50 (see Xm12-01) ,the sample
mean was 460.38 with standard error of 38.83. Can we
infer from the data provided that this sample was drawn
from a normal distribution with m = 460.38 and s =
38.83? Use 5% significance level.
c2 test for normality
Solution
First let us select z values that define each cell (expected frequency > 5 for each cell.)
z1 = -1; P(z < -1) = p1 = .1587; e1 = np1 = 50(.1587) = 7.94
z2 = 0; P(-1 < z< 0) = p2 = .3413; e2 = np2 = 50(.3413) = 17.07
z3 = 1; P(0 < z < 1) = p3 = .3413; e3 = 17.07
P(z > 1) = p4 = .1587; e4 = 7.94
The cell boundaries are
calculated from the
corresponding z values
under H0.
z1 =(x1 - 460.38)/38.83 = -1;
x1 = 421.55
The expected
frequencies
can now be
determined for
each cell.
e1 = 7.94
e2 = 17.07 e3 = 17.07
.3413
.1587
.3413
.1587
421.55 460.38 499.21
e4 = 7.94
c2 test for normality
– The test statistic
2
2
(10 - 7.94)2
(13
17.07)
(19
17.07)
2
c = 7.94 + 17.07 + 17.07 + (8 - 7.94)2
7.94
f3 = 19
e2 = 17.07
f1 = 10
e1 = 7.94
f2 = 13
= 1.72
e3 = 17.07
f4 = 8
e4 = 7.94
c2 test for normality
– The test statistic
2
2
(10 - 7.94)2
(13
17.07)
(19
17.07)
2
c = 7.94 + 17.07 + 17.07 + (8 - 7.94)2
7.94
= 1.72
– The rejection region
c 2  c 2 ,k 1L
where L is the number of parameters
estimated from the data.
c2 ,k3  c.205,43  3.84146
Conclusion: There is insufficient evidence to conclude
at 5% significance level that the data are not normally
distributed.