Transcript Slide 1
Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 14.1 Goodness-of-fit Tests When Category Probabilities Are Completely Specified Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Terminology A binomial experiment consists of a sequence of independent trials in which each trial can result in one of two possible outcomes. A multinomial experiment generalizes a binomial experiment by allowing each trial to result in one of k outcomes, where k is an integer greater than 2. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Multinomial Experiment The expected number of trials resulting in category i is E(Ni) = npi. When H0:p1 = p10,…,pk = pk0 is true, these expected values become E(N1) = np10,…, E(Nk) = npk0. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Recall: Chi-squared Critical Value Let ,v , called a chi-squared critical value, denote the number of the measurement axis such that of the area under the chi-squared curve with v df lies 2 to the right of ,v . 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. ,v Notation Illustrated 2 2 v pdf shaded area = ,v 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Multinomial Experiment Provided that npi 5 for every i, the random variable ( Ni npi ) (observed expected) npi e xpected i 1 all cells 2 k 2 2 has approximately a chi-squared distribution with k – 1 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Test With Significance Level Null hypothesis H0: p1 = p10,…,pk = pk0 Alternative hypoth Ha: at least one pi pi 0 . 2 k (ni npi 0 ) 2 Test statistic value: npi 0 i 1 2 2 Rejection region: ,k 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. P-Values for Chi-Squared Tests The P-value for an upper-tailed chi2 squared test is the area under the v curve 2 to the right of the calculated . Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. When the pi’s Are Functions 2 of Other Parameters Frequently the pi’s are hypothesized to depend on a smaller number of parameters 1,...,m (m < k). Then a specific hypothesis involving the i 's yields specific pi0’s, which are then used 2 in the test. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. When the Underlying Distribution 2 is Continuous Let X denote the variable being sampled. The hypothesized pdf is f0(x). Subdivide the measurement scale of X into k intervals [a0, a1),…, [ak, ak-1). The cell properties specified by H0 are pi 0 P(ai 1 X ai ) ai f0 ( x)dx ai 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 14.2 Goodness of Fit for Composite Hypotheses Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. When Parameters Are 2 Estimated The null hypothesis states that each pi is a function of a small number of parameters 1,...,m with the i 's otherwise unspecified. H0 : p1 1(θ),... pi i (θ) θ=(1,...,m ) Ha: the hypothesis is not true Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Joint Distribution For general k, the joint distribution of N1,…,Nk is the multinomial distribution with P( N1 n1,..., Nk nk ) n1 p1 ... nk pk when H0 is true this becomes P( N1 n1,..., N k nk ) 1 (θ ) ... k (θ ) n1 nk Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Method of Estimation Let n1,…,nk denote the observed values of N1,…,Nk. Then ˆ1,...,ˆm are those values of the i 's that maximize P( N1 n1,..., N k nk ) 1 (θ ) ... k (θ ) n1 nk ˆ1,...,ˆm are the maximum likelihood estimators of 1,..., m . Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Theorem Under general “regularity” conditions on 1,...,m and the i (θ) ’s if 1,...,m are estimated by the method of maximum likelihood as described previously and n is large, 2 k ˆ [ Ni n i (θ )] 2 ˆ) n ( θ i 1 i has approximately a chi-squared distribution with k – 1 – m df when H0 is true. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Level Test An approximate level test of H0 versus 2 2 Ha is then to reject H0 if ,k 1 m . In practice, the test can be used if n i (θˆ) 5 for every i. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Degrees of Freedom A general rule of thumb for degrees of freedom in a chi-squared test is number of freely number of independent df determined cell counts parameters estimated 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Test Procedure If ,k 1 , reject H 0 . 2 2 If 2 2 ,k 1m , do not reject H 0 . If ,k 1m ,k 1 withhold judgement. 2 2 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Goodness of Fit for Discrete Distributions Let ˆ1,...,ˆm be the maximum likelihood estimators of 1,...,m based on the full 2 sample X1,…,Xn and let denote the statistic based on these estimators. Then the critical value c that specifies a level upper-tailed test satisfies 2 ,k 1m c 2 ,k 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Goodness of Fit for Continuous Distributions The chi-squared test can be used to test whether the sample comes from a specified family of continuous distributions. Once the cells are chosen (independent of the observations) it is usually difficult to estimate unspecified parameters from the observed cell counts, so mle’s based on the full sample are computed. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Special Test for Normality 1 Let yi [(i 3.75) /(n .25)] and Compute r for the pairs (x(1),y1),…,(x(n),yn). The Ryan-Joiner test of H0: the population distribution is normal versus Ha: the pop. distribution is not normal consists of rejecting H0 when r c . Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 14.3 Two-Way Contingency Tables Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Data With Counts or Frequencies 1. There are I populations of interest, each corresponding to a different row of the table, and each population is divided into the same J categories. A sample is taken from the ith population, and the counts are entered in the cells in the ith row of the table. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Data With Counts or Frequencies 2. There is a single population of interest, with each individual in the population categorized with respect to two different factors. There are I categories associated with the first factor and J categories associated with the second factor. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Two-Way Contingency Table n11 n21 ni1 nI1 n12 n1j n1J nij nIJ Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Estimated Expected Counts Under H0 (Homogeneity) eˆij estimated expected count in cell (i, j ) ni n. j n ith row total jth column total n Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Test for Homogeneity Null hypothesis: H0: p1j = p2j =…= pIj Alternative hypoth.: Ha: H0 is not true I J (n eˆ ) 2 ij ij 2 Test statistic value: eˆij i 1 j 1 Rejection region: ,( I 1)( J 1) 2 2 Apply as long as eˆij 5 for all cells. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Multinomial Experiment Provided that npi 5 for every i, the random variable ( Ni npi ) (observed expected) npi e xpected i 1 all cells 2 k 2 2 has approximately a chi-squared distribution with k – 1 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Estimated Expected Counts (Independence) eˆij estimated expected count in cell (i, j ) ni n j ni n j n pˆ i pˆ j n n n n ith row total jth column total n Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Test for Independence Null hypothesis: H0: pij pi p j Alternative hypoth.: Ha: H0 is not true I J (n eˆ ) 2 ij ij 2 Test statistic value: eˆij i 1 j 1 Rejection region: ,( I 1)( J 1) 2 2 Apply as long as eˆij 5 for all cells. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.