Chi-Square Test

Download Report

Transcript Chi-Square Test

Week 10
Nov 3-7
Two Mini-Lectures
QMM 510
Fall 2014
ML 10.1
Chapter Contents
15.1 Chi-Square Test for Independence
15.2 Chi-Square Tests for Goodness-of-Fit
15.3 Uniform Goodness-of-Fit Test
15.4 Poisson Goodness-of-Fit Test
15.5 Normal Chi-Square Goodness-of-Fit Test
15.6 ECDF Tests (Optional)
15-2
So many
topics, so little
time …
Chapter 15
Chi-Square Tests
Chapter 15
Chi-Square Test for Independence
Contingency Tables
•
A contingency table is a cross-tabulation of n paired observations into
categories.
•
Each cell shows the count of observations that fall into the
category defined by its row (r) and column (c) heading.
15-3
Chapter 15
Chi-Square Test for Independence
Contingency Tables
•
For example:
15-4
Chapter 15
Chi-Square Test for Independence
Chi-Square Test
•
In a test of independence for an r x c contingency table, the hypotheses
are
H0: Variable A is independent of variable B
H1: Variable A is not independent of variable B
•
Use the chi-square test for independence to test these hypotheses.
•
This nonparametric test is based on frequencies.
•
The n data pairs are classified into c columns and r rows and then the
observed frequency fjk is compared with the expected frequency ejk.
15-5
Chapter 15
Chi-Square Test for Independence
Chi-Square Distribution
•
The critical value comes from the chi-square probability distribution
with d.f. degrees of freedom.
where
d.f. = degrees of freedom = (r – 1)(c – 1)
r = number of rows in the table
c = number of columns in the table
•
Appendix E contains critical values for right-tail areas of the chi-square
distribution, or use Excel’s =CHISQ.DIST.RT(α,d.f.)
•
The mean of a chi-square distribution is d.f. with variance 2d.f.
15-6
Chapter 15
Chi-Square Test for Independence
Chi-Square Distribution
Consider the shape of the chi-square distribution:
15-7
Chapter 15
Chi-Square Test for Independence
Expected Frequencies
•
Assuming that H0 is true, the expected frequency of row j and column k
is:
ejk = RjCk/n
where
Rj = total for row j (j = 1, 2, …, r)
Ck = total for column k (k = 1, 2, …, c)
n = sample size
15-8
Chapter 15
Chi-Square Test for Independence
Steps in Testing the Hypotheses
•
•
•
Step 1: State the Hypotheses
H0: Variable A is independent of variable B
H1: Variable A is not independent of variable B
•
•
Step 2: Specify the Decision Rule
Calculate d.f. = (r – 1)(c – 1)
•
For a given α, look up the right-tail critical value (2R) from
Appendix E or by using Excel =CHISQ.DIST.RT(α,d.f.).
•
Reject H0 if 2R > test statistic.
15-9
Chapter 15
Chi-Square Test for Independence
Steps in Testing the Hypotheses
•
For example, for d.f. = 6 and α = .05, 2.05 = 12.59.
15-10
Chapter 15
Chi-Square Test for Independence
Steps in Testing the Hypotheses
•
Here is the rejection region.
15-11
Chapter 15
Chi-Square Test for Independence
Steps in Testing the Hypotheses
• Step 3: Calculate the Expected Frequencies
ejk = RjCk/n
•
For example,
15-12
Chapter 15
Chi-Square Test for Independence
Steps in Testing the Hypotheses
• Step 4: Calculate the Test Statistic
•
The chi-square test statistic is
• Step 5: Make the Decision
•
Reject H0 if test statistic 2calc > 2R or if the p-value  α.
15-13
Chapter 15
Chi-Square Test for Independence
Example: MegaStat
all cells have ejk  5 so
Cochran’s Rule is met
Caution: Don’t highlight row or column totals
p-value = 0.2154 is not small enough to reject
the hypothesis of independence at α = .05
15-14
Chapter 15
Chi-Square Test for Independence
Test of Two Proportions
• For a 2 × 2 contingency table, the chi-square test is equivalent to a twotailed z test for two proportions.
• The hypotheses are:
Figure 14.6
15-15
Chapter 15
Chi-Square Test for Independence
Small Expected Frequencies
• The chi-square test is unreliable if the expected frequencies are
too small.
• Rules of thumb:
• Cochran’s Rule requires that ejk > 5 for all cells.
• Up to 20% of the cells may have ejk < 5
• Most agree that a chi-square test is infeasible if ejk < 1 in any cell.
• If this happens, try combining adjacent rows or columns to enlarge the
expected frequencies.
15-16
Chapter 15
Chi-Square Test for Independence
Cross-Tabulating Raw Data
•
Chi-square tests for independence can also be used to analyze
quantitative variables by coding them into categories.
•
For example, the variables Infant Deaths per 1,000 and Doctors
per 100,000 can each be coded into various categories:
15-17
Chapter 15
Chi-Square Test for Independence
Why Do a Chi-Square Test on Numerical Data?
•
The researcher may believe there’s a relationship between X
and Y, but doesn’t want to use regression.
•
There are outliers or anomalies that prevent us from assuming
that the data came from a normal population.
•
The researcher has numerical data for one variable but not
the other.
15-18
Chapter 15
Chi-Square Test for Independence
3-Way Tables and Higher
•
More than two variables can be compared using contingency
tables.
•
However, it is difficult to visualize a higher-order table.
•
For example, you could visualize a cube as a stack of tiled 2-way
contingency tables.
•
Major computer packages permit three-way tables.
15-19
Purpose of the Test
•
The goodness-of-fit (GOF) test helps you decide whether your
sample resembles a particular kind of population.
•
The chi-square test is versatile and easy to understand.
Hypotheses for GOF tests:
•
The hypotheses are:
H0: The population follows a _____ distribution
H1: The population does not follow a ______ distribution
•
The blank may contain the name of any theoretical distribution (e.g.,
uniform, Poisson, normal).
15-20
Chapter 15
Chi-Square Tests for Goodness-of-Fit ML 10.2
Chapter 15
Chi-Square Tests for Goodness-of-Fit
Test Statistic and Degrees of Freedom for GOF
•
Assuming n observations, the observations are grouped into c classes
and then the chi-square test statistic is found using:
where
fj = the observed frequency of
observations in class j
ej = the expected frequency in class j if the sample
came from the hypothesized population
15-21
Chapter 15
Chi-Square Tests for Goodness-of-Fit
Test Statistic and Degrees of Freedom for GOF tests
•
If the proposed distribution gives a good fit to the sample, the test
statistic will be near zero.
•
The test statistic follows the chi-square distribution with degrees of
freedom
d.f. = c – m – 1.
•
where c is the number of classes used in the test and m is the number
of parameters estimated.
15-22
Chapter 15
Normal Chi-Square GOF Test
Is the Sample from a Normal Population?
•
Many statistical tests assume a normal population, so this the most
common GOF test.
•
Two parameters, the mean μ and the standard deviation σ, fully
describe a normal distribution.
•
Unless μ and σ are known a priori, they must be estimated from a
sample in order to perform a GOF test for normality.
15-23
Method 1: Standardize the Data
•
Transform sample observations x1, x2, …, xn into standardized z-values.
•
Count the sample observations within each interval on the z-scale and
compare them with expected normal frequencies ej.
Problem: Frequencies will be small in the end bins yet large in the
middle bins (this may violate Cochran’s Rule and seems inefficient).
15-24
Chapter 15
Normal Chi-Square GOF Test
Chapter 15
Normal Chi-Square GOF Test
Method 2: Equal Bin Widths
•
Step 1: Divide the exact data range into c groups of equal
width, and count the sample observations in each bin to get
observed bin frequencies fj.
•
Step 2: Convert the bin limits into standardized z-values:
•
Step 3: Find the normal area within each bin assuming a
normal distribution.
•
Step 4: Find expected frequencies ej by multiplying each
normal area by the sample size n.
Problem: Frequencies will be small in the end bins yet large in the
middle bins (this may violate Cochran’s Rule and seems inefficient).
15-25
Chapter 15
Normal Chi-Square GOF Test
Method 3: Equal Expected Frequencies
• Define histogram bins in such a way that an equal number of
observations would be expected under the hypothesis of a
normal population, i.e., so that ej = n/c.
• A normal area of 1/c is expected in each bin.
• The first and last classes must be open-ended, so to define c bins
we need c-1 cut points.
• Count the observations fj within each bin.
• Compare the fj with the expected frequencies ej = n/c.
Advantage: Makes efficient use of
the sample.
Disadvantage: Cut points on the
z-scale points may seem strange.
15-26
Chapter 15
Normal Chi-Square GOF Test
Method 3: Equal Expected Frequencies
• Standard normal cut points for equal area bins.
Table 15.16
15-27
Critical Values for Normal GOF Test
•
Two parameters, m and s, are estimated from the sample, so the
degrees of freedom are d.f. = c – m – 1.
•
We need at least four bins to ensure at least one degree of freedom.
Small Expected Frequencies
•
15-28
Cochran’s Rule suggests at least ej  5 in each bin (e.g., with 4 bins
we would want n  20, and so on).
Chapter 15
Normal Chi-Square GOF Test
Visual Tests
•
The fitted normal superimposed on a histogram gives visual
clues as to the likely outcome of the GOF test.
•
A simple “eyeball” inspection of the histogram may suffice to
rule out a normal population by revealing outliers or other nonnormality issues.
15-29
Chapter 15
Normal Chi-Square GOF Test
ML 10.3
ECDF Tests for Normality
•
There are alternatives to the chi-square test for normality based
on the empirical cumulative distribution function (ECDF).
•
ECDF tests are done by computer. Details are omitted here.
•
A small p-value casts doubt on normality of the population.
•
The Kolmogorov-Smirnov (K-S) test uses the largest absolute
difference between the actual and expected cumulative relative
frequency of the n data values.
•
The Anderson-Darling (A-D) test is based on a probability plot.
When the data fit the hypothesized distribution closely, the
probability plot will be close to a straight line. The A-D test is
widely used because of its power and attractive visual.
15-30
Chapter 15
ECDF Tests
Example: Minitab’s Anderson-Darling Test for Normality
Data: weights of 80
babies (in ounces)
15-31
Near-linear probability
plot suggests good fit to
normal distribution
p-value = 0.122 is not small
enough to reject normal
population at α = .05
Chapter 15
ECDF Tests
Chapter 15
ECDF Tests
Example: MegaStat’s Normality Tests
Data: weights of 80
babies (in ounces)
p-value = 0.2487 is not small
enough to reject normal
population at α = .05 in this
chi-square test
Near-linear probability
plot suggests good fit to
normal distribution
Note: MegaStat’s chi-square test is not as powerful as the A-D test,
so we would prefer the A-D test if software is available. The
MegaStat probability plot is good, but shows no p-value.
15-32