Transcript Chi-Square

Social Statistics: Chi-square
This week
What is chi-square
 CHIDIST
 Non-parameteric statistics

2
Parametric statistics

A main branch of statistics




3
Assuming data with a type of probability distribution (e.g.
normal distribution)
Making inferences about the parameters of the distribution
(e.g. sample size, factors in the test)
Assumption: the sample is large enough to represent the
population (e.g. sample size around 30).
They are not distribution-free (they require a probability
distribution)
Nonparametric statistics

Nonparametric statistics (distribution-free statistics)




4
Do not rely on assumptions that the data are drawn from a given
probability distribution (data model is not specified).
It was widely used for studying populations that take on a ranked
order (e.g. movie reviews from one to four stars, opinions about hotel
ranking). Fits for ordinal data.
It makes less assumption.Therefore it can be applied in situations
where less is known about the application.
It might require to draw conclusion on a larger sample size with the
same degree of confidence comparing with parametric statistics.
Nonparametric statistics

Nonparametric statistics (distribution-free
statistics)

Data with frequencies or percentage
Number of kids in difference grades
 The percentage of people receiving social security

5
One-sample/Two-sample chi-square

One-sample chi-square includes only one
dimension



Two-sample chi-square includes two
dimensions

6
Whether the number of respondents is equally distributed
across all levels of education.
Whether the voting for the school voucher has a pattern
of preference.
Whether preference for the school voucher is
independent of political party affiliation and gender
Compute chi-square
One-sample chi-square test
 
2

(O  E )
2
E
O: the observed frequency
E: the expected frequency
7
Example
Question: Whether the number of respondents is equally
distributed across all opinions?
One-sample chi-square
for
23
8
Preference for School Voucher
maybe
against
17
50
total
90
Chi-square steps

Step1: a statement of null and research
hypothesis
There is no difference in the frequency or proportion in each category
H 0 : P1  P2  P3
There is difference in the frequency or proportion in each category
H 1 : P1  P2  P3
9
Chi-square steps

Step2: setting the level of risk (or the level of
significance or Type I error) associated with
the null hypothesis

10
0.05
Chi-square steps

Step3: selection of proper test statistic

11
Frequencynonparametric procedureschisquare
Chi-square steps

Step4. Computation of the test statistic value
(called the obtained value)
category
for
maybe
against
Total
12
observed
frequency (O)
23
17
50
90
expected
frequency (E)
D(difference)
30
30
30
90
7
13
20
(O-E)2
49
169
400
(O-E)2/E
1.63
5.63
13.33
20.60
Chi-square steps

Step5: determination of the value needed for
rejection of the null hypothesis using the appropriate
table of critical values for the particular statistic




13
Distribution of Chi-Square
df = r-1 (r= number of categories)
If the obtained value > the critical value  reject the null
hypothesis
If the obtained value < the critical value  accept the null
hypothesis
Chi-square steps
14
Chi-square steps

Step6: a comparison of the obtained value and
the critical value is made

15
20.6 and 5.991
Chi-square steps

Step 7 and 8: decision time

16
What is your conclusion, why and how to
interpret?
Another example

17
We’ll settle the age-old debate of whether
people can actually detect their favorite cola
based solely on taste. For 30 coke-lovers, I
blindfold them, and have them sample 3
colas…is there a true difference, or are these
preference differences explainable by chance?
Hypothesis
Null: There are no preferences: The population
is divided evenly among the brands
 Alternate: There are preferences: The
population is not divided evenly among the
brands

18
Chance Model
df = C -1 = 3 -1 = 2, set α = .05
 For df = 2, X2-crit = 5.99

19
Calculate Chi-Square
category
Coke
Pepsi
RC Cola
Total
20
observed
frequency (O)
13
9
8
30
expected
frequency (E)
D(difference)
10
10
10
30
(O-E)2
3
1
2
9
1
4
(O-E)2/E
0.9
0.1
0.4
1.4
Decision and Conclusion


21

2
crit
 5 . 99

2
obt
 1 . 40

2
obt
 
2
crit
Conclude that the preferences are evenly
divided among the colas when the logos are
removed.
Excel functions
CHIDIST (x,degrees_freedom)
 CHIDIST(20.6,2)



CHIDIST(1.40,2)

22
0.000036<0.05
0.496585304>0.05