Chapter 1: Statistics
Download
Report
Transcript Chapter 1: Statistics
Chapter 11: Applications of
Chi-Square
df 1
df 4
df 10
0
5
10
15
df 20
20
25
2
Chapter Goals
• Investigate two tests: multinomial
experiment, and the contingency table.
• Compare experimental results with expected
results to determine
(1) Preferences
(2) Independence
(3) Homogeneity
• Enumerative data: data that is placed in
categories and counted.
11.1: Chi-Square Statistic
• Many problems for which the data is
categorized and the results shown by way of
counts.
• Results are often displayed on a chart
showing the number of observations for
each possible category.
Background:
1. Suppose there are n observations.
2. Each observation falls into a cell (or class).
3. Observed frequencies in each cell: O1, O2, O3, … , Ok.
Sum of the observed frequencies is n.
O1 O2 O3 Ok n
4. Expected, or theoretical, frequencies: E1, E2, E3, . . . , Ek.
E1 E2 E3 Ek n
Summary of notation:
Observed Frequency
Expected Frequency
1st
O1
E1
k Categories
2nd
3rd
O2
O3
E2
E3
k th
Ok
Ek
Total
n
n
Goal:
1. Compare the observed frequencies with the expected
frequencies.
2. Decide whether the observed frequencies seem to agree or
seem to disagree with the expected frequencies.
Methodology:
Use a chi-square statistic:
2
(
O
E
)
2*
E
all cells
Small values of 2: Observed frequencies close to expected
frequencies.
Large values of 2: Observed frequencies do not agree with
expected frequencies.
Sampling Distribution of 2*:
When n is large and all expected frequencies are greater than
or equal to 5, then 2* has approximately a chi-square
distribution.
Recall:
Properties of the Chi-Square Distribution:
1. 2 is nonnegative in value; it is zero or positively valued.
2. 2 is not symmetrical; it is skewed to the right.
3. 2 is distributed so as to form a family of distributions, a
separate distribution for each different number of degrees
of freedom.
Various Chi-Square Distributions:
df 1
df 4
df 10
0
5
10
15
df 20
20
25
2
Critical values for chi-square:
1. Table 8, Appendix B.
2. Identified by degrees of freedom (df) and the area under
the curve to the right of the critical value.
3. 2(df, a): critical value of a chi-square distribution with df
degrees of freedom and a area to the right.
4. Chi-square distribution is not symmetrical: critical values
associated with right and left tails are given separately.
a
0
2 (df ,a )
2
Example: Find 2(16, 0.05).
0.05
2 (16,0.05)
0
Portion of Table 8
Area to the right
df
0.05
16
2(16, 0.05) = 26.3
26.3
2
Example: Find 2(10, 0.99).
0 2 (10,0.99)
2
Portion of Table 8
Area to the right
df
0.99
10
2(10, 0.99) = 2.56
2.56
Note:
1. When df > 2, the mean value of the chi-square distribution
is df.
2. The mean is located to the right of the mode (the value
where the curve reaches its high point) and just to the right
of the median (the value that splits the distribution, 50%
on either side).
0
df mean
mode
median
2
Note:
1. There is a separate chi-square distribution for each degree
of freedom, df.
2. Assumptions for this chi-square test:
a. Information is obtained from a random sample.
b. Each observation is classified according to the
categorical variable(s) involved in the test.
3. Categorical Variable: a variable that classifies or
categorizes each individual into exactly one of several cells
or classes; these cells or classes are all inclusive and
mutually exclusive.
4. Liberal statements of null and alternative hypotheses.
Not simply statements about population parameters.
11.2: Inferences Concerning
Multinomial Experiments
• Examine the testing procedure for
multinomial experiments.
• Do the observed frequencies match the
expected frequencies?
• Hypothesis test is based on the 2* statistic.
Multinomial Experiment:
An experiment with the following characteristics:
1. It consists of n identical independent trials.
2. The outcome of each trial fits into exactly one of k possible
cells.
3. There is a probability associated with each particular cell,
and these individual probabilities remain constant during
the experiment. (It must be true that p1 p2 pk 1)
4. The experiment will result in a set of observed frequencies,
O1, O2, . . . , Ok, where each Oi is the number of times a
trial outcome falls into that particular cell.
(It must be the case that O1 + O2 + + Ok = n.)
Testing Procedure:
1. H0: The probabilities p1, p2, . . . , pk are correct.
Ha: At least two probabilities are incorrect.
Allow for liberal interpretation of H0 and Ha.
2. Test statistic:
3.
4.
5.
6.
(O E ) 2
*
E
all cells
2
Use a one-tailed critical region; the right-hand tail.
Degrees of freedom: df = k 1.
Expected frequencies: Ei n pi
To ensure a good approximation to the chi-square
distribution: Each expected frequency should be at least 5
( Ei 5).
Example: A market research firm conducted a consumerpreference experiment to determine which of 5 new breakfast
cereals was the most appealing to adults. A sample of 100
consumers tried each cereal and indicated the cereal he or she
preferred. The results are given in the following table:
Cereal
A
B
C
D
E
Total
Frequency
25
17
15
22
21
100
Is there any evidence to suggest the consumers had a
preference for one cereal, or did they indicate each cereal was
equally likely to be selected? Use a = 0.05.
Solution:
If no preference was shown, we expect the 100 consumers to be
equally distributed among the 5 cereals. Thus, if no preference is
given, we expect (100)(0.2) = 20 consumers in each class.
1. The Set-up:
a. Population parameter of concern: Preference for each
cereal, the probability that a particular cereal is selected.
b. The null and alternative hypotheses:
H0: There was no preference shown (equally distributed).
Ha: There was a preference shown (not equally distributed).
2. The Hypothesis Test Criteria:
a. Assumptions: The 100 consumers represent a random sample.
b. Test statistic: 2* with df = k 1 = 5 1 = 4
c. Level of significance: a = 0.05.
3. The Sample Evidence:
a. Sample information: Table given in the statement of the
problem.
b. Calculate the value of the test statistic:
O
E
O E
(O E ) /E
25
20
5
1.25
17
20
-3
0.45
15
20
-5
1.25
22
20
2
0.20
21
100
20
100
1
0
0.05
3.20
2* = 3.2
2
4. The Probability Distribution (Classical Approach):
a. Critical value: 2(k 1, 0.05) = 2(4, 0.05) = 9.49
b. 2* is not in the critical region.
4. The Probability Distribution (p-Value Approach):
a. The p-value: P P( 2 * 3.2 | df 4).
Using computer: P = 0.5429. Using Table 8: P > 0.5
b. The p-value is larger than the level of significance, a.
5. The Results:
a. Decision: Fail to reject H0.
b. Conclusion: At the 0.05 level of significance, there is no
evidence to suggest the consumers showed a preference
for any one cereal.
Example: A sample of 200 individuals were tested for their
blood type, and the results are used to test the hypothesized
distribution of blood types:
Blood Type
Hypothesized
Proportion
Observed
Frequency
A
B
O
AB
0.41
0.09
0.46
0.04
74
25
86
15
At the 0.05 level of significance, is there any evidence to
suggest the stated distribution is incorrect?
Solution:
1. The Set-up:
a. Population parameters of concern: The proportions:
P(A), P(B), P(O), P(AB).
b. The null and alternative hypotheses:
H0: Blood type proportions are 0.41, 0.09, 0.46, 0.04
Ha: Blood type proportions are not 0.41, 0.09, 0.46, 0.04
2. The Hypothesis Test Criteria:
a. Assumptions: The 200 individuals tested form a random
sample.
b. Test statistic: 2*, df = 4 1 = 3
c. Significance level: a = 0.05
3. The Sample Evidence:
a. Sample information: Table given in the statement of the
problem.
b. Calculate the value of the test statistic:
O
E
O E
(O E )2/E
74
82
-8
0.78
25
18
7
2.72
86
92
-6
0.39
15
200
8
200
7
0
6.13
10.02
2* = 10.02
4. The Probability Distribution (Classical Approach):
a. Critical value: 2(3, 0.05) = 7.82
b. 2* is in the critical region.
4. The Probability Distribution (p-Value Approach):
2
a. The p-value: P P( * 10.02 | df 3)
By computer: P = 0.0184. Table 8: 0.01 < P < 0.025
b. The p-value is smaller than the level of significance, a.
5. The Results:
a. Decision: Reject H0.
b. Conclusion: There is evidence to suggest the
hypothesized proportions for blood types are incorrect.
11.3: Inference Concerning
Contingency Tables
• Contingency table: an arrangement of data
into a two-way classification.
• Data is sorted into cells, and the observed
frequency in each cell is reported.
• Contingency table involves two factors, or
variables
• Usual question: are the two variables
independent or dependent?
r c Contingency Table:
1. r: number of rows; c: number of columns.
2. Used to test the independence of the row factor and the
column factor.
3. Degrees of freedom: df (r 1) (c 1)
4. n = grand total.
5. Expected frequency in the ith row and the jth column:
Ei , j
Row total Column total Ri C j
Grand total
n
Each Ei,j should be at least 5.
6. R1, R2, . . . , Rr and C1, C2, . . . Cc: marginal totals.
Expected Frequencies for an r c Contingency Table:
Columns
Rows
1
R1 C1
1
n
R2 C1
2
n
Ri C1
i th Row
n
2
jth Column
R1 C j
R1 C2
n
c
Total
R1
n
R2
Ri C j
Ri
n
r
Total
C1
C2
Cj
n
Example: A random sample of registered voters was selected
and each was asked his or her opinion on Proposal 129, a
property tax reform bill. The distribution of responses is
given in the table below.
Tax Reform
Yes
No
Unsure
Democrat
34
17
10
Political Party
Republican
11
12
16
Independent
12
18
15
Test the hypothesis “political party is independent of opinion
on Proposal 129.” Use a = 0.01.
Solution:
1. The Set-up:
a. Population parameters of concern: The independence of
variables “political party” and “opinion on tax reform.”
b. The null and the alternative hypotheses:
H0: Opinion on property tax reform is independent of
political party.
Ha: Opinion on property tax reform is not independent of
political party.
2. The Hypothesis Test Criteria:
a. Assumptions: The information was obtained
from a random sample in which each individual was
classified according to political party and tax reform
preference.
b. Test statistic:
2* with df = (r 1) (c 1) = (3 1) (3 1) = 4
c. Level of significance: a = 0.01.
3. The Sample Evidence:
a. Sample information: Table given in the statement of the
problem.
b. Calculate the value of the test statistic:
Table with observed frequencies, expected frequencies,
and the test statistic given on the next slide.
Contingency table showing sample results and expected
values:
Political Party
Tax Reform Democrat Republican Independent Total
Yes
34
11
12
57
(23.98)
(15.33)
(17.69)
No
17
12
18
47
(19.77)
(12.64)
(14.59)
Unsure
10
16
15
41
(17.25)
(11.03)
(12.72)
Total
61
39
45
145
2
(
O
E
)
2*
14.16
E
all cells
4. The Probability Distribution (Classical Approach):
a. Critical value: 2(4, 0.01) = 13.3
b. 2* is in the critical region.
4. The Probability Distribution (p-Value Approach):
2
a. The p-value: P P( * 14.16 | df 4)
By computer: P = 0.0068. Table 8: 0.005 < P < 0.01
b. The p-value is smaller than the level of significance, a.
5. The Results:
a. Decision: Reject H0.
b. Conclusion: There is evidence to suggest that opinion on
tax reform and political party are not independent.
Note: Minitab output for the previous Example.
Chi-Square Test
Expected counts are printed below observed counts
Dem
Rep
Ind
Total
1
34
11
12
57
23.98
15.33
17.69
2
17
12
18
47
19.77
12.64
14.59
3
10
16
15
41
17.25
11.03
12.72
Total
61
39
45
145
Chi-Sq =
4.188
0.389
3.046
DF = 4, P-Value
+ 1.224 +
+ 0.033 +
+ 2.242 +
= 0.007
1.830 +
0.799 +
0.407 = 14.156
Test for Homogeneity:
1. Another type of contingency table problem.
2. Used when one of the two variables is controlled by the
experimenter so that the row (or column) totals are
predetermined.
3. Hypothesis test: the distribution of proportions within rows
(or columns) is the same for all rows (or columns).
4. May be thought of as a comparison of several multinomial
experiments.
5. Test procedure for independence and homogeneity with
contingency tables is the same.
Example: A pharmaceutical company conducted an
experiment to determine the effectiveness of three new cough
suppressants. Each cough syrup was given to 100 random
subjects.
No relief
Some relief
Total relief
Total
Cough Suppressant
A
B
C
23
29
20
60
56
50
17
15
30
100
100
100
Total
72
166
62
300
Is there any evidence to suggest the syrups act differently to
suppress coughs? Use a = 0.05.
Solution:
1. The Set-up:
a. Population parameters of concern: The proportion of
individuals who receive no relief, some relief, or total
relief for each of the three cough syrups.
b. The null and alternative hypotheses:
H0: The proportion of individuals who receive various
forms of relief is the same for all three cough syrups.
Ha: The proportion of individuals who receive various
forms of relief is not the same for all three cough syrups.
(In at least one group the proportions are different from
the others.)
2. The Hypothesis Test Criteria:
a. Assumptions: The sample information was obtained
using three random samples drawn from three separate
populations in which each individual was classified
according to cough suppressant and relief.
b. Test statistic:
2* with df = (r 1) (c 1) = (3 1) (3 1) = 4
c. Level of significance: a = 0.05.
3. The Sample Evidence:
a. Sample information: Table given in the statement of the
problem.
b. Calculate the value of the test statistic:
A portion of the Minitab output:
1
2
3
Total
Chi-Sq =
A
23
24.00
60
55.33
17
20.67
100
0.042
0.394
0.651
DF = 4, P-Value
B
29
24.00
56
55.33
15
20.67
100
+ 1.042 +
+ 0.008 +
+ 1.554 +
= 0.059
C
20
24.00
50
55.33
30
20.67
100
Total
72
166
62
300
0.667 +
0.514 +
4.215 = 9.085
4. The Probability Distribution (Classical Approach):
a. Critical value: 2(4, 0.05) = 9.49
b. 2* does not lie in the critical region.
4. The Probability Distribution (p-Value Approach):
2
a. The p-value: P P( * 9.085 | df 4)
By computer: P = 0.059. Table 8: 0.05 < P < 0.010
b. The p-value is larger than the level of significance, a.
5. The Results:
a. Decision: Fail to reject H0.
b. Conclusion: There is no evidence to suggest the three
remedies act differently to suppress coughs.