Transcript Slide 1

Chapter 14
Goodness-of-Fit
Tests and
Categorical Data
Analysis
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
14.1
Goodness-of-fit Tests
When Category
Probabilities Are
Completely Specified
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Terminology
A binomial experiment consists of a
sequence of independent trials in which
each trial can result in one of two possible
outcomes. A multinomial experiment
generalizes a binomial experiment by
allowing each trial to result in one of k
outcomes, where k is an integer greater
than 2.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Multinomial Experiment
The expected number of trials resulting
in category i is E(Ni) = npi. When
H0:p1 = p10,…,pk = pk0 is true, these
expected values become E(N1) = np10,…,
E(Nk) = npk0.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Recall: Chi-squared Critical
Value
Let  ,v , called a chi-squared critical
value, denote the number of the
measurement axis such that  of the area
under the chi-squared curve with v df lies
2
to the right of  ,v .
2
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
 ,v Notation Illustrated
2
2
v
pdf
shaded area = 
 ,v
2
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Multinomial Experiment
Provided that npi  5 for every i, the
random variable
( Ni  npi )
(observed  expected)
 
 
npi
e xpected
i 1
all cells
2
k
2
2
has approximately a chi-squared
distribution with k – 1 df.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Test With Significance Level 
Null hypothesis H0: p1 = p10,…,pk = pk0
Alternative hypoth Ha: at least one pi  pi 0 .
2
k
(ni  npi 0 )
2
Test statistic value:   
npi 0
i 1
2
2



Rejection region:
 ,k 1
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
P-Values for Chi-Squared Tests
The P-value for an upper-tailed chi2
squared test is the area under the  v curve
2
to the right of the calculated  .
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
 When the pi’s Are Functions
2
of Other Parameters
Frequently the pi’s are hypothesized to
depend on a smaller number of
parameters 1,...,m (m < k). Then a
specific hypothesis involving the i 's
yields specific pi0’s, which are then used
2
in the  test.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
 When the Underlying Distribution
2
is Continuous
Let X denote the variable being sampled.
The hypothesized pdf is f0(x). Subdivide
the measurement scale of X into k
intervals [a0, a1),…, [ak, ak-1). The cell
properties specified by H0 are
pi 0  P(ai 1  X  ai ) 
ai

f0 ( x)dx
ai 1
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
14.2
Goodness of Fit
for
Composite Hypotheses
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
 When Parameters Are
2
Estimated
The null hypothesis states that each pi is
a function of a small number of
parameters 1,...,m with the i 's
otherwise unspecified.
H0 : p1  1(θ),... pi   i (θ) θ=(1,...,m )
Ha: the hypothesis is not true
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Joint Distribution
For general k, the joint distribution of
N1,…,Nk is the multinomial distribution
with
P( N1  n1,..., Nk  nk ) 
n1
p1
 ... 
nk
pk
when H0 is true this becomes
P( N1  n1,..., N k  nk )  1 (θ )  ...   k (θ )
n1
nk
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Method of Estimation
Let n1,…,nk denote the observed values of
N1,…,Nk. Then ˆ1,...,ˆm are those values
of the i 's that maximize
P( N1  n1,..., N k  nk )  1 (θ )  ...   k (θ )
n1
nk
ˆ1,...,ˆm are the maximum likelihood
estimators of 1,..., m .
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Theorem
Under general “regularity” conditions on
1,...,m and the i (θ) ’s if 1,...,m are
estimated by the method of maximum
likelihood as described previously and n is
large,
2
k
ˆ
[ Ni  n i (θ )]
2
 
ˆ)
n

(
θ
i 1
i
has approximately a chi-squared distribution
with k – 1 – m df when H0 is true.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Level  Test
An approximate level  test of H0 versus
2
2



Ha is then to reject H0 if
 ,k 1 m . In
practice, the test can be used if
n i (θˆ)  5 for every i.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Degrees of Freedom
A general rule of thumb for degrees of
freedom in a chi-squared test is
 number of freely
  number of independent 
 df  
 

 determined cell counts   parameters estimated 
2
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Test Procedure
If    ,k 1 , reject H 0 .
2
2
If  2  2 ,k 1m , do not reject H 0 .
If  ,k 1m     ,k 1 withhold judgement.
2
2
2
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Goodness of Fit for Discrete
Distributions
Let ˆ1,...,ˆm be the maximum likelihood
estimators of 1,...,m based on the full
2
sample X1,…,Xn and let  denote the
statistic based on these estimators. Then
the critical value c that specifies a level 
upper-tailed test satisfies
2 ,k 1m  c  2 ,k 1
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Goodness of Fit for
Continuous Distributions
The chi-squared test can be used to test
whether the sample comes from a specified
family of continuous distributions. Once
the cells are chosen (independent of the
observations) it is usually difficult to
estimate unspecified parameters from the
observed cell counts, so mle’s based on the
full sample are computed.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Special Test for Normality
1
Let yi   [(i  3.75) /(n  .25)] and
Compute r for the pairs (x(1),y1),…,(x(n),yn).
The Ryan-Joiner test of
H0: the population distribution is normal
versus
Ha: the pop. distribution is not normal
consists of rejecting H0 when r  c .
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
14.3
Two-Way
Contingency Tables
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Data With Counts or Frequencies
1. There are I populations of interest,
each corresponding to a different row of
the table, and each population is divided
into the same J categories. A sample is
taken from the ith population, and the
counts are entered in the cells in the ith
row of the table.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Data With Counts or Frequencies
2. There is a single population of
interest, with each individual in the
population categorized with respect to
two different factors. There are I
categories associated with the first
factor and J categories associated with
the second factor.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Two-Way Contingency Table
n11
n21



ni1



nI1
n12

n1j
 n1J





nij

nIJ
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Estimated Expected Counts Under H0
(Homogeneity)
eˆij  estimated expected count in cell (i, j )
 ni 

n. j
n
 ith row total  jth column total 
n
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Test for Homogeneity
Null hypothesis: H0: p1j = p2j =…= pIj
Alternative hypoth.: Ha: H0 is not true
I J (n  eˆ ) 2
ij
ij
2
Test statistic value:   
eˆij
i 1 j 1
Rejection region:    ,( I 1)( J 1)
2
2
Apply as long as eˆij  5 for all cells.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Multinomial Experiment
Provided that npi  5 for every i, the
random variable
( Ni  npi )
(observed  expected)
 
 
npi
e xpected
i 1
all cells
2
k
2
2
has approximately a chi-squared
distribution with k – 1 df.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Estimated Expected Counts
(Independence)
eˆij  estimated expected count in cell (i, j )
ni n j ni  n j
 n  pˆ i  pˆ  j  n  

n n
n
ith row total  jth column total 


n
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Test for Independence
Null hypothesis: H0: pij  pi  p j
Alternative hypoth.: Ha: H0 is not true
I J (n  eˆ ) 2
ij
ij
2
Test statistic value:   
eˆij
i 1 j 1
Rejection region:    ,( I 1)( J 1)
2
2
Apply as long as eˆij  5 for all cells.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.