pre12: Stat200 - Chi

Download Report

Transcript pre12: Stat200 - Chi

Presentation 12

Chi-Square test

  

What does it mean for two categorical variables to be related?

Remember that Chi-Square is used to test for a relationship between 2 Categorical variables.

Ho: There is no relationship between the variables.

Ha: There is a relationship between the variables. If two categorical variables are related, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable.

Let’s say that we wanted to determine if there is a relationship between religion (Christian, Jew, Muslim, Other) and smoking. When we test if there is a relationship between these two variables, we are trying to determine if being part of a particular religion makes an individual more likely to be a smoker. If that is the case, then we can say that Religion and Smoking are

related

or

associated

.

   

Chi-Square test for 2-way tables

Suppose we are studying two categorical variables in a population, where the first variable has second one has s levels.

r levels (i.e. possible outcomes) and the We can summarize a sample from this population using a table with rows and c columns.

r A two-way table, also called contingency table, displays the counts of how many individuals fall into each possible combination of categories of two categorical variables. So, each cell of the table (total number of cells is r x c ) represents a combination of categories of the two variables.

The following table presents the data on race and smoking. The two variables of interest, race and smoking, have in 4x2=8 combinations of categories.

r = 4 and c = 2, resulting

Race NSmoke Caucasian 620 Black 240 Hispanic Other 130 190 Smoke 75 41 29 38

Chi-Square test for 2-way tables

   By considering the number if observation falling into each category, we will see how to test the hypotheses of the form: H 0 : The two variables are not associated. H a : The two variables are associated.

Two different experimental situations will lead to contingency tables 1.

If we have two populations under study, both of which have a particular trait with respect to a categorical variable. In this case the null hypothesis is a statement of homogeneity among

the two populations.

2.

If we have one population under study, and we are interested to check the relationship between two categorical variables. In this case the null hypothesis is a statement of independence

between the two variables.

For sufficiently large samples, the same test is appropriate for both of these situations. This test is called chi-square test, and in the following we will go over the steps in for testing the relationship between two variables.

Some Notation!

 For i taking values from 1 to taking values from 1 to c r (number of rows) and j (number of columns), denote: R i C j O ij = total count of observations in the i th row. = total count of observations in the j th column.

= observed count for the cell in the i-th row and the j-th column. E ij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H 0 was true. These counts are calculated as

Expected

Row total

Column tot Total sample size n al , thus

E ij

R i

C j n

Example

Race NSmoke Caucasian O 11 = 620 Black O 21 = 240 Hispanic Other O 31 = 130 O 41 = 190 Total C 1 = 1180 Smoke O 12 = 75 O 22 = 41 O 32 = 29 O 42 = 38 C 2 = 183 Total R 1 = 695 R 2 = 281 R 3 = 159 R 4 = 228 n=1363

E 11 =(695x1180)/1363 E 21 =(281x1180)/1363 E 31 =(159x1180)/1363 E 41 =(228x1180)/1363 E 12 =(695x183)/1363 E 22 =(281x183)/1363 E 32 =(159x183)/1363 E 42 =(228x183)/1363

Chi-Square Analysis Details

 The 5 Steps in a Chi-Square Test: 

Step 1: Write the null and alternative hypothesis.

H 0 : There is no relationship between the variables.

H a : There is a relationship between the variables. 

Step 2: Check conditions.

A) All expected counts should be > 1.

B) At least 80% of expected counts should > 5.

Step 3: Calculate Test Statistic and p-value.

The test statistic measure the difference between the observed counts and the expected counts assuming independence.

 2   all cells (Observed Expected) Expected 2 

i

, 

j

(

O ij

E ij E ij

) 2 This is called chi-square statistic because if the null hypothesis is true, then it has a chi-square distribution with (r-1)x(c-1) degrees of freedom.

Chi-Square Analysis Details

Step 3 Cont. Find the p-value.

 If the χ 2 - statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, ''large'' χ 2 gives evidence against the null hypothesis, and supports the alternative.

 The p-value of the chi-square test is the probability that the χ 2 - statistic, is as large or larger than the value we obtained if H true, the χ 2 0 is true. Also, if H 0 is  Thus, the p-value for Chi-Square test is ALWAYS the area to the right of the test statistic under the curve, i.e. p-value = P(X> χ 2 ), where X has a chi-square distribution with (r-1)x(c-1) df curve.

 To get this probability we need to use a chi-square distribution with (r 1)x(c-1) df (Table A.4). Using Minitab, or any other statistical software, you can obtain the p-value form the output. Otherwise, you can report a range for the p-value using Table 4 (since usually you will not be able to find the exact p-value on the table.

Chi-Square Analysis Details

Step 4: Decide whether or not the result is statistically significant.

The results are statistically significant if the p-value is less than alpha, where alpha is the significance level (usually α = 0.05).

Step 5: Report the conclusion in the context of the situation.

  The

p-value

is ______ which is < a, this result

is statistically significant

. Reject the H0 Conclude that (the two variables) are related.

The

p-value

is ______ which is > a, this result

is NOT statistically significant

. We cannot reject the H0 Cannot conclude that (the two variables) are related.

Detailed Example

 Derek wants to know if the geographical area that a student grew up in is associated with whether or not that the student drinks alcohol. Below are the results he obtained from a random sample of PSU students Big City Rural Small Town Suburban Total No 21 11 18 37 87 Yes 65 130 198 345 738 Total 86 141 216 382 825

Detailed Example

1. H o : There is no relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. H a : There is relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol.

2.

To check the conditions we need to calculate the expected counts for each cell. E 11 = (R 1 xC 1 )/n = (86x87)/825 = 9.07, E 12 = (R 1 xC 2 )/n = (86x738)/825 = 76.93, … E 32 = (R 3 xC 2 )/n = ___________________, …

Detailed Example

Here is the Minitab output with the Observed and Expected counts for each cell. We can see that the conditions are satisfied!

No Yes All Big_City 21 65 86 9.07 76.93 86.00

Rural 11 130 141 14.87 126.13 141.00

SmallTow 18 198 216 22.78 193.22 216.00

Suburban 37 345 382 40.28 341.72 382.00

All 87 738 825 87.00 738.00 825.00

Detailed Example

3.

Chi- Square statistic and P-value: χ 2 = sum {(Observed – Expected) 2 /Expected} = (21-9.07) 2 /9.07+ (65-76.93) 2 /76.93

+ (11-14.87) 2 /14.87+ (130-126.13) 2 /126.13

+ (18-22.78) 2 /22.78+ (198-193.22) 2 /193.22

+ (37-40.28) 2 /40.28+ (345-341.72) 2 /341.72

= 20.091

df = (4-1)x(2-1) =3 p-value= P(X> 20.091) < P(X> 16.17) = 0.001 (Table A.4) 4. Since the p-value< 0.05, the test is significant, and we can reject the null.

5.

We can conclude that there is a relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol.

Special Case - Analyzing 2x2 tables

 In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a contingency table having two rows and two columns (i.e. r=c=2). The general form of a 2x2 table is

Row 1 Row 2 Total Column 1 A C C 1 Column 2 B D C 2 Total R 1 R 2 n

 In this case, the chi-square statistic has the following simplified form,  2 

n

(

AD

BC

) 2 .

R

1

R

2

C

1

C

2  Under the null hypothesis, χ 2 -statistic has chi-square distribution with (2-1)x(2-1)=1 degrees of freedom.

Example for 2x2 table: Is there relationship between gender and smoking habits?

Gender Male Female Total NSmoke Smoke Total 540 325 865 52 31 83 592 356 948

 2  948 ( 540  31  52  325 ) 2 592  356  865  83  0.0016, the difference from the chi square statistic in the output is because Minitab rounded up to 3 decimal points.

Minitab Output C1 C2 Total 1 540 52 592 540.17 51.83

2 325 31 356 324.83 31.17

Total 865 83 948 Chi-Sq = 0.000 + 0.001 + 0.000 + 0.001 = 0.002

DF = 1, P-Value = 0.968

Minitab uses the general formula of the χ 2 test statistic.

Relationship Between Chi-Square and 2 Proportions Tests

When do we use Chi-Square and when do we use 2 proportions?

   Situation 1: Both categorical variables of interest have exactly 2 levels. Question - Is there a relationship between the variables, or is there a difference in the proportions? Answer - Either Chi-Square or Two Sided Test of 2-proportions will lead to the same conclusion!

In this case, the χ 2 –statistic = (z-statistic) 2 , and the p-values of the two tests are equal, i.e.

P(X (1df) > χ 2 –stat) = 2 P (Z > |z-stat|).

Situation 2: Both categorical variables of interest have exactly 2 levels. Question - Is one proportion greater/smaller than the other. Answer - This is a one-sided test and you MUST use a test of 2 proportions.

Situation 3: At least one of the two categorical variables of interest has MORE than 2 levels.

Question - Is there a relationship between the variables?

Answer - MUST use a Chi-Square Test.

Examples of Chi-Square and 2-Proportions

Gender Male Female NSmoke 540 325 Smoke 52 31 Q1: Is there a difference in the proportion of males and females that smoke?

Solution: Either a Chi-Square or Test of 2 proportions is fine.

2-proportions H 0 : p m – p f = 0 H a : p m – p f ≠ 0 Chi-Square H 0 : There is no relationship between Gender and Smoking.

H a : There is a relationship between Gender and Smoking.

Q2: Is the proportion of males who smoke greater than the proportion of females who smoke?

Solution: Test of 2 proportions, because the alternative is one sided!

2-proportions H 0 : p m – p f = 0 vs H a : p m – p f > 0

Examples of Chi-Square and 2-Proportions

Race Caucasian Black Hispanic Other NSmoke 620 240 130 190 Smoke 75 41 29 38 Q: Is there a relationship between Race and Smoking? Is there a difference in the proportion smokers of the different races?

Solution: Chi-Square because Race has more than 2 levels!

Chi-Square Test H 0 : There is no relationship between Race and Smoking.

H a : There is a relationship between Race and Smoking.