Statistics 846.3(02) Statistics 349.3(02) Lecture Notes

Download Report

Transcript Statistics 846.3(02) Statistics 349.3(02) Lecture Notes

One way Analysis of Variance
(ANOVA)
Comparing k Populations
The F test – for comparing k means
Situation
• We have k normal populations
• Let mi and s denote the mean and standard
deviation of population i.
• i = 1, 2, 3, … k.
• Note: we assume that the standard deviation
for each population is the same.
s1 = s2 = … = sk = s
We want to test
H 0 : m1  m2  m3    mk
against
H A : mi  m j for at least one pair i, j
Computing Formulae:
Compute
ni
1)
2)
Ti   xij  T otalfor sample i
j 1
k
k
G  Ti   xij  Grand T otal
i 1
k
3)
i 1
ni
 x
ij
i 1 j 1
k
5)
i 1 j 1
N   ni  T otalsamplesize
k
4)
ni
2
Ti

i 1 ni
2
The data
• Assume we have collected data from each of
k populations
• Let xi1, xi2 , xi3 , … denote the ni observations
from population i.
• i = 1, 2, 3, … k.
Then
1)
3)
2
Ti G
SSBetween   
N
i 1 ni
k
2)
2
k
ni
k
2
Ti
SSWithin   xij  
i 1 j 1
i 1 ni
2
SSBetween k  1
F
SSWithin N  k 
Anova Table
Source
d.f.
Sum of
Squares
Between
k-1
SSBetween
Mean
Square
MSBetween
Within
N-k
SSWithin
MSWithin
Total
N-1
SSTotal
SS
MS 
df
F-ratio
MSB /MSW
Example
In the following example we are comparing weight
gains resulting from the following six diets
1. Diet 1 - High Protein , Beef
2. Diet 2 - High Protein , Cereal
3. Diet 3 - High Protein , Pork
4. Diet 4 - Low protein , Beef
5. Diet 5 - Low protein , Cereal
6. Diet 6 - Low protein , Pork
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
Diet
Mean
Std. Dev.
x
x2
1
73
102
118
104
81
107
100
87
117
111
100.0
15.14
1000
102062
2
98
74
56
111
95
88
82
77
86
92
85.9
15.02
859
75819
3
94
79
96
98
102
102
108
91
120
105
99.5
10.92
995
100075
4
90
76
90
64
86
51
72
90
95
78
79.2
13.89
5
107
95
97
80
98
74
74
67
89
58
83.9
15.71
792
839
64462 72613
6
49
82
73
86
81
97
106
70
61
82
78.7
16.55
787
64401
Thus
Ti 2 G 2
52722
SSBetween   
 467846
 4612.933
N
60
i 1 ni
2
k ni
k
Ti
2
SSWithin   xij  
 479432 467846 11586
i 1 j 1
i 1 ni
k
SSBetween k  1 4612.933/ 5 922.6
F


 4.3
SSWithin N  k  11586/ 54 214.56
F0.05  2.386 with1  5 and 2  54
Thus since F > 2.386 we reject H0
Anova Table
Source
d.f.
Sum of
Squares
Between
5
4612.933
Mean
Square
922.587
F-ratio
4.3**
(p = 0.0023)
SS
Within
54
11586.000
Total
59
16198.933
214.556
* - Significant at 0.05 (not 0.01)
** - Significant at 0.01
Equivalence of the F-test and the t-test
when k = 2
the t-test
xy
t
1 1
s Pooled

n m
sPooled 
n  1sx2  m  1s2y
nm2
the F-test
k
2
Between
2
Pooled
s
F
s

 n x  x 
2
i
i 1
i
k
2


n

1
s
 i i
i 1
k 1
 k

  ni  k 
 i 1

n1 x1  x   n2 x1  x 

2
2
n1 1s1  n1 1s1 n1  n2  2
2

denominator  s
2

2
pooled
numerator n1 x1  x   n2 x1  x 
2
2

n1 x1  n2 x2 

n1 x1  x   n1  x1 
n1  n2 

2
n1n2
2
x1  x2 

2
n1  n2 
2
n2 x2  x 
2

n1 x1  n2 x2 

 n2  x2 
n1  n2 

2
1 2
nn
2
x1  x2 

2
n1  n2 
2
2
nn n n
2
x1  x2 
n1 x1  x   n2 x2  x  
n1  n 
n1n2
x1  x2 2

n1  n2 
2
2

Hence
F
1
1 1
  
 n1 n2 
1
2
1 2
2
2 1
2
2
x1  x2 
2
x1  x2 
1 1 s
  
 n1 n2 
2
2
Pooled
t
2
Factorial Experiments
Analysis of Variance
• Dependent variable Y
• k Categorical independent variables A, B, C,
… (the Factors)
• Let
–
–
–
–
a = the number of categories of A
b = the number of categories of B
c = the number of categories of C
etc.
The Completely Randomized Design
• We form the set of all treatment combinations
– the set of all combinations of the k factors
• Total number of treatment combinations
– t = abc….
• In the completely randomized design n
experimental units (test animals , test plots,
etc. are randomly assigned to each treatment
combination.
– Total number of experimental units N = nt=nabc..
The treatment combinations can thought to be
arranged in a k-dimensional rectangular block
B
1
1
2
A
a
2
b
C
B
A
• The Completely Randomized Design is called
balanced
• If the number of observations per treatment
combination is unequal the design is called
unbalanced. (resulting mathematically more
complex analysis and computations)
• If for some of the treatment combinations
there are no observations the design is called
incomplete. (In this case it may happen that
some of the parameters - main effects and
interactions - cannot be estimated.)
Example
In this example we are examining the effect of
The level of protein A (High or Low) and
the source of protein B (Beef, Cereal, or
Pork) on weight gains (grams) in rats.
We have n = 10 test animals randomly
assigned to k = 6 diets
The k = 6 diets are the 6 = 3×2 Level-Source
combinations
1. High - Beef
2. High - Cereal
3. High - Pork
4. Low - Beef
5. Low - Cereal
6. Low - Pork
Table
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low) and s
ource of protein (Beef, Cereal, or Pork)
Level
of Protein High Protein
Low protein
Source
of Protein Beef Cereal Pork Beef Cereal Pork
Diet
1
2
3
4
5
6
73
98
94
90 107
49
102
74
79
76
95
82
118
56
96
90
97
73
104 111
98
64
80
86
81
95 102
86
98
81
107
88 102
51
74
97
100
82 108
72
74 106
87
77
91
90
67
70
117
86 120
95
89
61
111
92 105
78
58
82
Mean
100.0 85.9 99.5 79.2 83.9 78.7
Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55
Treatment combinations
Source of Protein
Level
of
Protein
Beef
Cereal
Pork
High
Diet 1
Diet 2
Diet 3
Low
Diet 4
Diet 5
Diet 6
Summary Table of Means
Source of Protein
Level of Protein Beef
High
100.00
Low
79.20
Overall
89.60
Cereal
85.90
83.90
84.90
Pork Overall
99.50 95.13
78.70 80.60
89.10 87.87
Profiles of the response relative
to a factor
A graphical representation of the
effect of a factor on a reponse
variable (dependent variable)
Profile Y for A
Y
This could be for an
individual case or
averaged over a group
of cases
This could be for
specific level of
another factor or
averaged levels of
another factor
1
2
3
Levels of A
…
a
Profiles of Weight Gain for
Source and Level of Protein
110
High Protein
Low Protein
Overall
Weight Gain
100
90
80
70
Beef
Cereal
Pork
Profiles of Weight Gain for
Source and Level of Protein
110
Beef
Cereal
Pork
Weight Gain
100
Overall
90
80
70
High Protein
Low Protein
Example – Four factor experiment
Four factors are studied for their effect on Y (luster
of paint film). The four factors are:
1)
Film Thickness - (1 or 2 mils)
2)
Drying conditions (Regular or Special)
3)
Length of wash (10,30,40 or 60 Minutes), and
4)
Temperature of wash (92 ˚C or 100 ˚C)
Two observations of film luster (Y) are taken
for each treatment combination
The data is tabulated below:
Regular Dry
Minutes 92 C
100 C
1-mil Thickness
20
3.4 3.4 19.6 14.5
30
4.1 4.1 17.5 17.0
40
4.9 4.2 17.6 15.2
60
5.0 4.9 20.9 17.1
2-mil Thickness
20
5.5 3.7 26.6 29.5
30
5.7 6.1 31.6 30.2
40
5.5 5.6 30.5 30.2
60
7.2 6.0 31.4 29.6
Special Dry
92C
100 C
2.1
4.0
5.1
8.3
3.8
4.6
3.3
4.3
17.2
13.5
16.0
17.5
13.4
14.3
17.8
13.9
4.5
5.9
5.5
8.0
4.5
5.9
5.8
9.9
25.6
29.2
32.6
33.5
22.5
29.8
27.4
29.5
Definition:
A factor is said to not affect the response if
the profile of the factor is horizontal for all
combinations of levels of the other factors:
No change in the response when you change
the levels of the factor (true for all
combinations of levels of the other factors)
Otherwise the factor is said to affect the
response:
Profile Y for A – A affects the response
Y






1
2
3
Levels of A
…
Levels of B
a
Profile Y for A – no affect on the response
Y






1
2
3
Levels of A
…
Levels of B
a
Definition:
• Two (or more) factors are said to interact if
changes in the response when you change
the level of one factor depend on the
level(s) of the other factor(s).
• Profiles of the factor for different levels of
the other factor(s) are not parallel
• Otherwise the factors are said to be
additive .
• Profiles of the factor for different levels of
the other factor(s) are parallel.
Interacting factors A and B
Y






1
2
3
Levels of A
…
Levels of B
a
Additive factors A and B
Y






1
2
3
Levels of A
…
Levels of B
a
• If two (or more) factors interact each factor
effects the response.
• If two (or more) factors are additive it still
remains to be determined if the factors
affect the response
• In factorial experiments we are interested in
determining
– which factors effect the response and
– which groups of factors interact .
The testing in factorial experiments
1. Test first the higher order interactions.
2. If an interaction is present there is no need
to test lower order interactions or main
effects involving those factors. All factors
in the interaction affect the response and
they interact
3. The testing continues with for lower order
interactions and main effects for factors
which have not yet been determined to
affect the response.
Models for factorial
Experiments
The Single Factor Experiment
Situation
• We have t = a treatment combinations
• Let mi and s denote the mean and standard
deviation of observations from treatment i.
• i = 1, 2, 3, … a.
• Note: we assume that the standard deviation
for each population is the same.
s1 = s2 = … = sa = s
The data
• Assume we have collected data for each of
the a treatments
• Let yi1, yi2 , yi3 , … , yin denote the n
observations for treatment i.
• i = 1, 2, 3, … a.
The model
Note:
yij  mi   yij  mi   mi   ij
 m   mi  m   ij  m  i  ij
 ij  yij  mi
where
1 k
m   mi
k i 1
i  mi  m
a
Note:

i 1
i
0
has N(0,s2) distribution
(overall mean effect)
(Effect of Factor A)
by their definition.
Model 1:
yij (i = 1, … , a; j = 1, …, n) are independent
Normal with mean mi and variance s2.
Model 2:
yij  mi   ij
where ij (i = 1, … , a; j = 1, …, n) are independent
Normal with mean 0 and variance s2.
Model 3:
yij  m  i   ij
where ij (i = 1, … , a; j = 1, …, n) are independent
Normal with mean 0 and variance s2 and
a

i 1
i
0
The Two Factor Experiment
Situation
• We have t = ab treatment combinations
• Let mij and s denote the mean and standard
deviation of observations from the treatment
combination when A = i and B = j.
• i = 1, 2, 3, … a, j = 1, 2, 3, … b.
The data
• Assume we have collected data (n observations)
for each of the t = ab treatment combinations.
• Let yij1, yij2 , yij3 , … , yijn denote the n observations
for treatment combination - A = i, B = j.
• i = 1, 2, 3, … a, j = 1, 2, 3, … b.
The model
Note:
yijk  mij   yijk  mij   mij   ijk
 m   mi  m    m j  m    mij  mi  m j  m    ij
 m   i   j   ij   ijk
where
 ijk  yijk  mij has N(0,s2) distribution
1 a b
1 b
1 a
m   mij , mi   mij and m j   mij
ab i 1 j 1
b j 1
a i 1
i  mi  m,  j  m j  m,
and
 ij  mij  mi  m j  m
The model
Note:
yijk  mij   yijk  mij   mij   ijk
 m   mi  m    m j  m    mij  mi  m j  m    ij
 m   i   j   ij   ijk
where
 ijk  yijk  mij has N(0,s2) distribution
1 a b
1 b
1 a
m   mij , mi   mij and m j   mij
ab i 1 j 1
b j 1
a i 1
i  mi  m,  j  m j  m,
a
Note:

i 1
i
0
by their definition.
Main effects
Interaction
Error
Mean
Model :
Effect
yijk  m   i   j   ij   ijk
where ijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are
independent Normal with mean 0 and variance s2 and
a

i 1
i
b

0
j 1
a
and
j
0
b
      
i 1
ij
j 1
ij
0
Maximum Likelihood Estimates
yijk  m   i   j   ij   ijk
where ijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are
independent Normal with mean 0 and variance s2 and
a
b
n
mˆ  y   yijk abn
i 1 j 1 k 1
b
n
ˆi  yi  y   yijk bn  y
j 1 k 1
a
n
ˆ j  y j   y   yijk an  y
i 1 k 1
^
 ij  yij  yi  y j  y
n
  yijk n  yi  y j   y
k 1
a
b
n
2
1
2
sˆ 
yijk  yij  


nab i 1 j 1 k 1
2
^ 

1 a b n 
ˆ     
ˆ
ˆ


y

m






ijk
i
j
ij 

nab i 1 j 1 k 1 

 
This is not an unbiased estimator of s2 (usually the
case when estimating variance.)
The unbiased estimator results when we divide by
ab(n -1) instead of abn
The unbiased estimator of s2 is
a
b
n
2
1
2
s 
yijk  yij 


ab  n  1 i 1 j 1 k 1
a
b
n 
^ 

1
ˆ     
ˆ
ˆ


y

m






ijk
i
j
ij 

ab  n  1 i 1 j 1 k 1 

 
1

SS Error  MSError
ab  n  1
where
SS Error    yijk  yij  
a
b
n
i 1 j 1 k 1
2
2
Testing for Interaction:
We want to test:
H0: ()ij = 0 for all i and j, against
HA: ()ij ≠ 0 for at least one i and j.
1
The test statistic
a  1 b  1

MS AB
F

MS Error
MS Error
SS AB
where
^
SS AB    ij    yij   yi  y j   y 
a
b
i 1 j 1
2
a
b
i 1 j 1
2
We reject
H0: ()ij = 0 for all i and j,
If
MS AB
F
 F  (a  1)(b  1), ab(n  1) 
MS Error
Testing for the Main Effect of A:
We want to test:
H0: i = 0 for all i, against
HA: i ≠ 0 for at least one i.
The test statistic
where
MS A
F

MS Error
a
1
SS A
 a  1
MS Error
a
SS A   ˆ    yi  y 
i 1
2
i
i 1
2
We reject
H0: i = 0 for all i,
If
MS A
F
 F  (a  1), ab(n  1) 
MS Error
Testing for the Main Effect of B:
We want to test:
H0: j = 0 for all j, against
HA: j ≠ 0 for at least one j.
The test statistic
where
MS B
F

MS Error
1
SS B
 b  1
MS Error
2
2
ˆ
SSB    j    y j   y 
b
b
j 1
j 1
We reject
H0: j = 0 for all j,
If
MS B
F
 F  (b  1), ab(n  1) 
MS Error
The ANOVA Table
Source
S.S.
d.f.
MS =SS/df
F
A
SSA
a-1
MSA
MSA / MSError
B
SSB
b-1
MSB
MSB / MSError
AB
SSAB
(a - 1)(b - 1)
MSAB
MSAB/ MSError
Error
SSError
ab(n - 1)
MSError
Total
SSTotal
abn - 1
Computing Formulae
a
b
n
Let T   yijk
i 1 j 1 k 1
b
n
a
n
n
Ti   yijk , T j   yijk , Tij   yijk
j 1 k 1
i 1 k 1
a
Then SSTotal
b
n
k 1
2
•••
T
  y 
nab
i 1 j 1 k 1
2
ijk
a
2
i ••
2
•••
a
2
• j•
T
2
•••
T
T
T
SS A  

, SSB  

nab
nab
i 1 nb
i 1 na
2
2
a T2
a
a T2
T
T•••
ij •
• j•
i ••
SS AB    


,
nab
i 1 n
i 1 nb
i 1 na
and SSError  SSTotal  SS A  SSB  SS AB