Transcript CHAPTER 10

Chi-Squared Distributions

INFERENCE for CATEGORICAL DATA and MULTIPLE SAMPLES

1

Inference on Categorical Data

 Often times we wish to work with categorical data rather than numerical data.

 In order to make categorical data meaningful, we have to look at the

counts

within each category.

 However, with counted data, we need to use another type of distribution, so we utilize the χ 2 (or chi-squared) distribution.

Properties of the Chi-square Distributions

The

chi-square distributions

are a family of distributions that take only positive values and are skewed to the right. A specific chi-square distribution is specified by one parameter, called the

degrees of freedom

.

The chi-square density curves have the following properties:    The total area under a chi-square curve is equal to 1.

Each chi-square curve (except when df = 1) begins at 0 on the horizontal axis, increases to a peak, and then approaches the horizontal axis asymptotically from above.

Each chi-square curve is skewed right. As the number of degrees of freedom increase, the curve becomes more and more symmetrical and looks more like a normal curve.

  

Properties of the Chi-square Distributions

Question:

Without looking at your notes, can you state the 3 properties of a chi-square distribution?

The total area under a chi-square curve is equal to 1.

Each chi-square curve (except when df = 1) begins at 0 on the horizontal axis, increases to a peak, and then approaches the horizontal axis asymptotically from above.

Each chi-square curve is skewed right. As the number of degrees of freedom increase, the curve becomes more and more symmetrical and looks more like a normal curve.

Chi-Squared Tests

 All of the tests of

means

and

proportions

that we have discussed deal with a

maximum of two quantitative variables

. But other questions arise:  Suppose we make an assumption about the distribution of certain data, such as: the data values are equally likely to occur. Once we actually gather such data, how do we judge if the observed data are consistent with the expected patterns?

 How do we establish the existence of a relationship between two categorical variables?

 How do we test for differences in more than two proportions? e.g proportions of M & M colors.

    

Chi-square analysis

questions.

Chi-square statistic: is used to address such  2 

n

  ( observed  expected counts ) 2

i

1 expected The 

2 statistic

is a SUM of the weighted differences of observed and expected data values (counts) which takes into account the magnitude of the difference between each observed and expected value relative to the magnitude of the values.

Now, a difference of 10 (observed - expected) is more significant if it comes from 85 and 95 than 152085 and 152095 which is accounted for by the degrees of freedom.

The  2 probability density function is a function whose domain { X 2 | x 2 ≥ 0}

depends only on the degrees of freedom

(df).

Inference on Categorical Data

1.

2.

3.

There are three different types of inferences that we can perform on categorical data:

The Chi-Squared Goodness of Fit Test (GoF Test)

 This test is used when we wish to test how well a distribution fits a model or some expected outcome – or rather how well it does NOT fit the model, since we cannot really prove that it does.

The Chi-Squared Test of Homogeneity

 This test is used when we wish to test how similar or different groups are to each other

The Chi-Squared Test of Independence

 This test is used when we wish to test the likelihood of groups being independent of one another.

The

2

Test for Goodness of Fit

  For the 

2

called the

Goodness of Fit Test

, a specific chi square distribution is specified by one parameter,

degrees of freedom, n – 1

;

where n is the number of categories

.

Once again, we are generally testing how well our observation “fit” some outcome categories.

expected model

. Assume that the hypothesized distribution has n

Assumptions and Conditions

  

The Data are Counts Assumption

 Check to make sure that the data is categorical and the numbers are counts

Independence Assumption

 Randomization:  Data is random, preferably an SRS  10% Condition  You have less than 10% of the population

Sample Size Assumption

 All

expected counts

in each cell are at least 5

2

Hypothesis

 We test the hypothesis

H 0 : actual population data are equal to the hypothesized data.

Calculate the chi-square test statistic:

χ

2

i n

 

1

O

E E

2

 2 has approximately a  2 - distribution with df = (n – 1) We test H 0 against

H a : actual population data are different from the hypothesized data

Example 1

According to an independent source, you are told that the distribution of M&M’s is as follows: Brown – 13%, Yellow – 14%, Red – 13%, Blue – 24%, Orange – 20%, and Green – 16%. You wish to determine if this distribution is accurate. So, using an SRS, you collect a large sample of M&M’s and determine the following: 159.25

172.5

159.25

294 245 196 1225

The Solution

Step 1

: Identify population

Parameter

, state the null and alternative asking.

Hypotheses

, determine what you are trying to do and what the question is  We want to know whether a stated distribution of M&M’s is accurate   H 0 H a = The stated distribution of M&M’s is accurate = The stated distribution of M&M’s is

not

accurate

The Solution

Step 2

: Verify the conditions

Assumptions

Data are counts Assumption

 Yes, the data are counts by checking the 

Independence Assumption

  Randomization condition – the data comes from an SRS 10% condition – it is safe to assume, we have less than 10% of all M&M’s 

Sample Size Assumption

 Not too small

expected count

condition – We MUST make sure that our expected counts (not the actual counts!) are greater than 5 – all

expected counts

are definitely greater than 5!!!

The Solution

Step 3:

If conditions are met,

Name the inference procedure

, find the

Obtain the p-value Test statistic

, and in carrying out the inference:  We will perform a χ 2 Goodness of Fit Test  2   (

Obs

Exp

) 2

Exp

 ( 146  159 .

25 ) 2 159 .

25  ( 185  172 .

5 ) 2 172 .

5  ...

 18 .

103  P-value ≈ 0.00282

 Since there are 6 colors, there are 5 degrees of freedom

7.

3.

4.

5.

1.

2.

The Solution

On the older TI-83 or TI-84 calculator: Put the data into L 1 Input the expected values in L 2 (

NOTE

: this can change the solution to the problem and there should be

an equal number

very important) of cells in L 2 as there are in L 1 – this is Input the residuals into L 3 : (L 1 -L 2 ) 2 /L 2 Find the sum of L 3 : sum(L 3 ) ≈ 18.103

Find the p-value: x 2 cdf(18.103, E99, 5) ≈ .00282

6.

OR get the program!!!!!

On the new calculators, use the x 2 GOF function

The Solution

Step 4 : Make a decision Conclusion

and

State the

in context of the problem:  With a p-value of 0.00282

, we reject the null hypothesis. There is very strong evidence that the stated distribution of M&M’s is not accurate; in other words, it seems that the independent source did not accurately find the correct distribution of M&M’s.

Example 2

A major car dealer paints new cars on what they perceive the public likes and dislikes. They wish for you to determine if their color choices are appropriate. You choose an SRS of 1000 perspective customers and gather the following information:

Color Observed Count White

260

Black

285

Silver

128

Blue

64

Gold

88

Red

175

Expected Percentage

25% 25% 15% 5% 10% 20%

Expected Count

250 250 150 50 100 200

Color Observed Count Expected Count White

260 250

Example 2

Black Silver Blue

285 250 128 150 64 50

Gold

88 100

Red

175 200  1.

A quick glimpse reveals that none of the observed values match the expected values. What are two reasons for this?

This could be simply due to sampling variability or 2.

The distribution of colors is not the same as what the company claimed (the distribution may have changed) Ok, let’s determine if it’s just chance variation or a different distribution of colors…

The Solution

Step 1

: Identify population

Parameter

, state the null and alternative is asking.

Hypotheses

, determine what you are trying to do and what the question  We want to determine if the distribution of car colors matches a perceived proportion.

 H 0 = The proportion of car colors will be consistent with a perceived proportions  H a = The proportion of car colors will

NOT

consistent with a perceived proportions be

The Solution

Step 2

: Verify the conditions

Assumptions

by checking the 

Data are counts Assumption

 Data are Counts Condition: yes, the data are counts of categorical data 

Independence Assumption

 Randomization condition – the data comes from an SRS  10% condition – it is safe to assume that we have less than 10% of all cars

The Solution

Step 2

: Verify the conditions

Assumptions

by checking the 

Sample Size Assumption

 All of the expected counts are greater than 5, so it’s safe to proceed.

Color White Black Silver Blue Gold Red Expected Count

250 250 150 50 100 200

The Solution

Step 3:

If conditions are met,

Name the inference procedure

, find the

Obtain the p-value Test statistic

, and in carrying out the inference:  We will perform a χ 2 Goodness of Fit Test  2   (

Obs

Exp

) 2

Exp

 ( 260  250 ) 2 250  ( 285  250 ) 2 250  ...

 17 .

012  Since there are 6 colors, there are 5 degrees of freedom  P-value ≈ .00448

The Solution

Step 4: Make a decision Conclusion

and

State the

in context of the problem:  With a p-value of .00448, we reject the null hypothesis at the 0.05 alpha level. There is very strong evidence that the perceived car colors do not follow the actual preferred distribution of colors; in other words, it seems that the car dealer is not using the correct distribution of colors.