Transcript CHAPTER 10
Chi-Squared Distributions
INFERENCE for CATEGORICAL DATA and MULTIPLE SAMPLES
1
Inference on Categorical Data
Often times we wish to work with categorical data rather than numerical data.
In order to make categorical data meaningful, we have to look at the
counts
within each category.
However, with counted data, we need to use another type of distribution, so we utilize the χ 2 (or chi-squared) distribution.
Properties of the Chi-square Distributions
The
chi-square distributions
are a family of distributions that take only positive values and are skewed to the right. A specific chi-square distribution is specified by one parameter, called the
degrees of freedom
.
The chi-square density curves have the following properties: The total area under a chi-square curve is equal to 1.
Each chi-square curve (except when df = 1) begins at 0 on the horizontal axis, increases to a peak, and then approaches the horizontal axis asymptotically from above.
Each chi-square curve is skewed right. As the number of degrees of freedom increase, the curve becomes more and more symmetrical and looks more like a normal curve.
Properties of the Chi-square Distributions
Question:
Without looking at your notes, can you state the 3 properties of a chi-square distribution?
The total area under a chi-square curve is equal to 1.
Each chi-square curve (except when df = 1) begins at 0 on the horizontal axis, increases to a peak, and then approaches the horizontal axis asymptotically from above.
Each chi-square curve is skewed right. As the number of degrees of freedom increase, the curve becomes more and more symmetrical and looks more like a normal curve.
Chi-Squared Tests
All of the tests of
means
and
proportions
that we have discussed deal with a
maximum of two quantitative variables
. But other questions arise: Suppose we make an assumption about the distribution of certain data, such as: the data values are equally likely to occur. Once we actually gather such data, how do we judge if the observed data are consistent with the expected patterns?
How do we establish the existence of a relationship between two categorical variables?
How do we test for differences in more than two proportions? e.g proportions of M & M colors.
Chi-square analysis
questions.
Chi-square statistic: is used to address such 2
n
( observed expected counts ) 2
i
1 expected The
2 statistic
is a SUM of the weighted differences of observed and expected data values (counts) which takes into account the magnitude of the difference between each observed and expected value relative to the magnitude of the values.
Now, a difference of 10 (observed - expected) is more significant if it comes from 85 and 95 than 152085 and 152095 which is accounted for by the degrees of freedom.
The 2 probability density function is a function whose domain { X 2 | x 2 ≥ 0}
depends only on the degrees of freedom
(df).
Inference on Categorical Data
1.
2.
3.
There are three different types of inferences that we can perform on categorical data:
The Chi-Squared Goodness of Fit Test (GoF Test)
This test is used when we wish to test how well a distribution fits a model or some expected outcome – or rather how well it does NOT fit the model, since we cannot really prove that it does.
The Chi-Squared Test of Homogeneity
This test is used when we wish to test how similar or different groups are to each other
The Chi-Squared Test of Independence
This test is used when we wish to test the likelihood of groups being independent of one another.
The
2
Test for Goodness of Fit
For the
2
called the
Goodness of Fit Test
, a specific chi square distribution is specified by one parameter,
degrees of freedom, n – 1
;
where n is the number of categories
.
Once again, we are generally testing how well our observation “fit” some outcome categories.
expected model
. Assume that the hypothesized distribution has n
Assumptions and Conditions
The Data are Counts Assumption
Check to make sure that the data is categorical and the numbers are counts
Independence Assumption
Randomization: Data is random, preferably an SRS 10% Condition You have less than 10% of the population
Sample Size Assumption
All
expected counts
in each cell are at least 5
2
Hypothesis
We test the hypothesis
H 0 : actual population data are equal to the hypothesized data.
Calculate the chi-square test statistic:
χ
2
i n
1
O
E E
2
2 has approximately a 2 - distribution with df = (n – 1) We test H 0 against
H a : actual population data are different from the hypothesized data
Example 1
According to an independent source, you are told that the distribution of M&M’s is as follows: Brown – 13%, Yellow – 14%, Red – 13%, Blue – 24%, Orange – 20%, and Green – 16%. You wish to determine if this distribution is accurate. So, using an SRS, you collect a large sample of M&M’s and determine the following: 159.25
172.5
159.25
294 245 196 1225
The Solution
Step 1
: Identify population
Parameter
, state the null and alternative asking.
Hypotheses
, determine what you are trying to do and what the question is We want to know whether a stated distribution of M&M’s is accurate H 0 H a = The stated distribution of M&M’s is accurate = The stated distribution of M&M’s is
not
accurate
The Solution
Step 2
: Verify the conditions
Assumptions
Data are counts Assumption
Yes, the data are counts by checking the
Independence Assumption
Randomization condition – the data comes from an SRS 10% condition – it is safe to assume, we have less than 10% of all M&M’s
Sample Size Assumption
Not too small
expected count
condition – We MUST make sure that our expected counts (not the actual counts!) are greater than 5 – all
expected counts
are definitely greater than 5!!!
The Solution
Step 3:
If conditions are met,
Name the inference procedure
, find the
Obtain the p-value Test statistic
, and in carrying out the inference: We will perform a χ 2 Goodness of Fit Test 2 (
Obs
Exp
) 2
Exp
( 146 159 .
25 ) 2 159 .
25 ( 185 172 .
5 ) 2 172 .
5 ...
18 .
103 P-value ≈ 0.00282
Since there are 6 colors, there are 5 degrees of freedom
7.
3.
4.
5.
1.
2.
The Solution
On the older TI-83 or TI-84 calculator: Put the data into L 1 Input the expected values in L 2 (
NOTE
: this can change the solution to the problem and there should be
an equal number
very important) of cells in L 2 as there are in L 1 – this is Input the residuals into L 3 : (L 1 -L 2 ) 2 /L 2 Find the sum of L 3 : sum(L 3 ) ≈ 18.103
Find the p-value: x 2 cdf(18.103, E99, 5) ≈ .00282
6.
OR get the program!!!!!
On the new calculators, use the x 2 GOF function
The Solution
Step 4 : Make a decision Conclusion
and
State the
in context of the problem: With a p-value of 0.00282
, we reject the null hypothesis. There is very strong evidence that the stated distribution of M&M’s is not accurate; in other words, it seems that the independent source did not accurately find the correct distribution of M&M’s.
Example 2
A major car dealer paints new cars on what they perceive the public likes and dislikes. They wish for you to determine if their color choices are appropriate. You choose an SRS of 1000 perspective customers and gather the following information:
Color Observed Count White
260
Black
285
Silver
128
Blue
64
Gold
88
Red
175
Expected Percentage
25% 25% 15% 5% 10% 20%
Expected Count
250 250 150 50 100 200
Color Observed Count Expected Count White
260 250
Example 2
Black Silver Blue
285 250 128 150 64 50
Gold
88 100
Red
175 200 1.
A quick glimpse reveals that none of the observed values match the expected values. What are two reasons for this?
This could be simply due to sampling variability or 2.
The distribution of colors is not the same as what the company claimed (the distribution may have changed) Ok, let’s determine if it’s just chance variation or a different distribution of colors…
The Solution
Step 1
: Identify population
Parameter
, state the null and alternative is asking.
Hypotheses
, determine what you are trying to do and what the question We want to determine if the distribution of car colors matches a perceived proportion.
H 0 = The proportion of car colors will be consistent with a perceived proportions H a = The proportion of car colors will
NOT
consistent with a perceived proportions be
The Solution
Step 2
: Verify the conditions
Assumptions
by checking the
Data are counts Assumption
Data are Counts Condition: yes, the data are counts of categorical data
Independence Assumption
Randomization condition – the data comes from an SRS 10% condition – it is safe to assume that we have less than 10% of all cars
The Solution
Step 2
: Verify the conditions
Assumptions
by checking the
Sample Size Assumption
All of the expected counts are greater than 5, so it’s safe to proceed.
Color White Black Silver Blue Gold Red Expected Count
250 250 150 50 100 200
The Solution
Step 3:
If conditions are met,
Name the inference procedure
, find the
Obtain the p-value Test statistic
, and in carrying out the inference: We will perform a χ 2 Goodness of Fit Test 2 (
Obs
Exp
) 2
Exp
( 260 250 ) 2 250 ( 285 250 ) 2 250 ...
17 .
012 Since there are 6 colors, there are 5 degrees of freedom P-value ≈ .00448
The Solution
Step 4: Make a decision Conclusion
and
State the
in context of the problem: With a p-value of .00448, we reject the null hypothesis at the 0.05 alpha level. There is very strong evidence that the perceived car colors do not follow the actual preferred distribution of colors; in other words, it seems that the car dealer is not using the correct distribution of colors.