Chapter 1 Looking at Data— Distributions

Download Report

Transcript Chapter 1 Looking at Data— Distributions

10
The Analysis of
Variance
10.1
Single-Factor ANOVA
Single-Factor ANOVA
Single-factor ANOVA focuses on a comparison of more
than two population or treatment means. Let
l = the number of populations or treatments being
compared
1 = the mean of population 1 or the true average response
when treatment 1 is applied
.
.
.
I = the mean of population I or the true average response
when treatment I is applied
3
Single-Factor ANOVA
The relevant hypotheses are
H0: 1 = 2 = ··· = I
versus
Ha: at least two the of the i’s are different
If I = 4, H0 is true only if all four i’s are identical. Ha would
be true, for example, if
1 = 2  3 = 4, if 1 = 3 = 4  2,
or if all four i’s differ from one another.
4
The Idea of ANOVA



The sample means for the three samples are the same for each set.
The variation among sample means for (a) is identical to (b).
The variation among the individuals within the three samples is much less
for (b).
 CONCLUSION: the samples in (b) contain a larger amount of variation
among the sample means relative to the amount of variation within the
samples, so ANOVA will find more significant differences among the
means in (b)
– assuming equal sample sizes here for (a) and (b).
– Note: larger samples will find more significant differences.
5
Comparing Several Means
Do SUVs, trucks and midsize
cars have same gas mileage?
Response variable: gas mileage
(mpg)
Groups: vehicle classification
31 midsize cars
31 SUVs
14 standard-size pickup trucks
Data from the Environmental Protection Agency’s Model Year
2003 Fuel Economy Guide, www.fueleconomy.gov.
6
Comparing Several Means
Means:
Midsize: 27.903
SUV:
22.677
Pickup: 21.286


Mean gas mileage for
SUVs and pickups
appears less than for
midsize cars.
Are these differences
statistically significant?
7
Comparing Several Means
Means:
Midsize: 27.903
SUV:
22.677
Pickup: 21.286
Null hypothesis:
The true means (for gas mileage) are the
same for all groups (the three vehicle
classifications).
We could look at separate t tests to compare each pair of
means to see if they are different:
27.903 vs. 22.677, 27.903 vs. 21.286, & 22.677 vs. 21.286
H0: μ1 = μ2
H0: μ1 = μ3
H0: μ2 = μ3
However, this gives rise to the problem of multiple
comparisons!
8
The One-Way ANOVA Model
Random sampling always produces chance variations. Any “factor
effect” would thus show up in our data as the factor-driven
differences plus chance variations (“error”):
Data = fit + residual
The one-way ANOVA model analyzes
situations where chance variations are
normally distributed N(0,σ) such that:
9
The ANOVA F Test
To determine statistical significance, we need a test statistic that we can
calculate:
The ANOVA F Statistic
The analysis of variance F statistic for testing the equality of several
means has this form:
variationamongsamplemeans
F
variation
among
the ssample
means
variation
among
individual
in same sample
F=
variation among individuals in the same sample
Difference in
means small
relative to
overall variability
Difference in
means large
relative to
overall variability
 F tends to be small
 F tends to be large
Larger F-values typically yield more significant results. How large depends on
the degrees of freedom (I− 1 and N− I).
The ANOVA F Test
F=

variation among the sample means
variation among individuals in the same sample
The measures of variation in the numerator and denominator are
mean squares:

Numerator: Mean Square for Treatments (MSTr)
n1(x1.  x ..)2  n2 (x2 .  x..)2    nI (x I .  x ..)2
MSTr 
I 1

Denominator: Mean Square for Error (MSE)
(n1  1)s12  (n2  1)s22    (nI  1)s I2
MSE 
NI
11
Notation
The individual sample means will be denoted by
X1, X2, . . ., XI.
ni
That is,
X i. 
X
j1
ij
for i=1,…,I
ni
Similarly, the average of all N observations, called the
grand mean, is
I
X .. 
ni
 X
i 1
j1
ij
N
12
Notation
Additionally, let
, denote the sample variances:
ni
S i
2
2
(
X

X
.)
 ij i
j1
ni  1
for i=1,…,I
13
The ANOVA Table
The computations are often summarized in a tabular
format, called an ANOVA table in below Table.
Tables produced by statistical software customarily include
a P-value column to the right of f.
Source of variation
Sum of squares
Df
Mean square
F
P value
F crit
Treatments
SSTr
I -1
SSTr/(I -1)
MSTr/MSE
Tail area
above F
Value of
F for a
Error
SSE
N–I
SSE/(N – I)
Total
SST=SSTr+SSE
N–1
An ANOVA Table
14
F Distributions and the F Test
Both v1 and v2 are positive integers. Figure 10.3 pictures an
F density curve and the corresponding upper-tail critical
value
Appendix Table A.9 gives these critical values
for a = .10, .05, .01, and .001.
Values of v1 are identified with different columns of the
table, and the rows are labeled with various values of v2.
An F density curve and critical value
Figure 10.3
15
Nematodes and plant growth
Do nematodes affect plant growth? A botanist prepares
16 identical planting pots and adds different numbers of
nematodes into the pots. Seedling growth (in mm) is
recorded two weeks later.
Hypotheses: All i are the same (H0)
versus not All i are the same (Ha)
xi
Nematodes
Seedling growth
0 10.8 9.1 13.5 9.2 10.65
1,000 11.1 11.1 8.2 11.3 10.43
5,000 5.4 4.6 7.4
5
5.6
 7.5 5.45
10,000 5.8 5.3 3.2
overall mean 8.03
Output for the one-way ANOVA
Menu/Tools/DataAnaly sis/Anov aSingleFactor
Anov a: Single Factor
SUMMARY
Groups
0 nematode
1000 nematodes
5000 nematodes
10000 nematodes
ANOVA
Source of Variation
numerator Between Groups
denominator Within Groups
Total
Count
4
4
4
4
SS
100.647
33.3275
133.974
Sum
Average
42.6
10.65
41.7 10.425
22.4
5.6
21.8
5.45
df
3
12
Variance
4.21667
2.20917
1.54667
3.13667
MS
F
P-value
33.549 12.0797 0.00062
2.77729
F crit
3.4902996
15
Here, the calculated F-value (12.08) is larger than Fcritical (3.49) for a0.05.
Thus, the test is significant at a 5%  Not all mean seedling lengths are
the same; the number of nematodes is an influential factor.
Using F-table
The F distribution is asymmetrical and has two distinct degrees of
freedom. This was discovered by Fisher, hence the label “F.”
Once again, what we do is calculate the value of F for our sample data
and then look up the corresponding area under the curve in F-Table.
ANOVA
Source of Variation SS df MS
Between Treatments
101
3 33.5
Within Treatments33.3 12 2.78
Total
134
F
P-value F crit
12.1 0.00062 3.4903
15
Fcritical for a 5% is 3.49
F = 12.08 > 10.80
Thus p< 0.001