Testing and Estimating Variances

Download Report

Transcript Testing and Estimating Variances

Comparing k Populations
Means – One way Analysis of
Variance (ANOVA)
Example
In this example we are looking at the weight
gains (grams) for rats under six diets differing
in level of protein (High or Low) and source of
protein (Beef, Cereal, or Pork).
– Ten test animals for each diet
Diets
1.
2.
3.
4.
High protein, Beef
High protein, Cereal
High protein, Pork
Low protein, Beef
5.
Low protein, Cereal
6.
Low protein, Pork
Table
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
High Protein
Level
Low protein
Source
Beef
Cereal
Pork
Beef
Cereal
Pork
Diet
1
73
102
118
104
81
107
100
87
117
111
103.0
100.0
24.0
17.78
229.11
15.14
2
98
74
56
111
95
88
82
77
86
92
87.0
85.9
18.0
13.33
225.66
15.02
3
94
79
96
98
102
102
108
91
120
105
100.0
99.5
11.0
8.15
119.17
10.92
4
90
76
90
64
86
51
72
90
95
78
82.0
79.2
18.0
13.33
192.84
13.89
5
107
95
97
80
98
74
74
67
89
58
84.5
83.9
23.0
17.04
246.77
15.71
6
49
82
73
86
81
97
106
70
61
82
81.5
78.7
16.0
11.05
273.79
16.55
Median
Mean
IQR
PSD
Variance
Std. Dev.
Box Plots: Weight Gains for Six Diets
130
High Protein
120
Low Protein
110
Weight Gain
100
90
80
70
60
50
Beef
Cereal
Pork
Beef
2
3
4
Cereal
Pork
40
1
Diet
5
6
Non-Outlier Max
Non-Outlier Min
Median; 75%
25%
Exploratory Conclusions
• Weight gain is higher for the high protein meat
diets
• Increasing the level of protein - increases
weight gain but only if source of protein is a
meat source
The F test – for comparing k means
Situation
• We have k normal populations
• Let mi and s denote the mean and standard
deviation of population i.
• i = 1, 2, 3, … k.
• Note: we assume that the standard deviation
for each population is the same.
s1 = s2 = … = sk = s
We want to test
H 0 : m1  m2  m3    mk
against
H A : mi  m j for at least one pair i, j
The data
• Assume we have collected data from each
of th k populations
• Let xi1, xi2 , xi3 , … denote the ni
observations from population i.
• i = 1, 2, 3, … k.
Let
xi 
x
j 1
ni
ij
 x
n
ni
si 
i 1
ij
 xi 
ni  1
2
The pooled estimate of standard deviation and
variance:
n1  1s12  n2  1s22    nk  1sk2
n1  1  n2  1    nk  1
s Pooled 
s
2
Pooled

n1  1s

 n2  1s    nk  1s
n1  n2    nk  k
2
1
k

 n  1s
i
i 1
k
n k
i 1
i
2
2
2
i
2
k
Consider the statistic comparing the sample
means
n1 x1  x   n2 x2  x     nk xk  x 

k 1
2
s
2
Between
2
where
n1 x1  n2 x2    nk xk
x
n1  n2    nk
2
To test
H 0 : m1  m2  m3    mk
against
H A : mi  m j for at least one pair i, j
use the test statistic
k
2
Between
2
Pooled
s
F
s

 n x  x 
i 1
k
2
i
i
2


 ni  1 si
i 1
k 1
 k

  ni  k 
 i 1

Computing Formulae
ni
xi 
x
j 1
ni
ij
ni
Ti   xij  T otalfor sample i
Ti

ni
j 1
n1 x1  n2 x2    nk xk T1  T2    Tk G
x


n1  n2    nk
n1  n2    nk N
k
k
ni
G  Ti   xij  Grand T otal
i 1
k
i 1 j 1
N   ni  T otalsamplesize
i 1
k
Ti 2 G 2
 
N
i 1 ni
k
SSBetween   ni xi  x 
2
i 1
Also
ni
2
T
2
2
SSWithin   ni  1si   xij   i
i 1
i 1 j 1
i 1 ni
k
k
k
To Compute F:
Compute
ni
1)
2)
Ti   xij  T otalfor sample i
j 1
k
k
G  Ti   xij  Grand T otal
i 1
k
3)
i 1
ni
 x
ij
i 1 j 1
k
5)
i 1 j 1
N   ni  T otalsamplesize
k
4)
ni
2
Ti

i 1 ni
2
Then
1)
3)
2
Ti G
SSBetween   
N
i 1 ni
k
2)
2
k
ni
k
2
Ti
SSWithin   xij  
i 1 j 1
i 1 ni
2
SSBetween k  1
F
SSWithin N  k 
We reject
H 0 : m1  m2  m3    mk
if
F  F
F is the critical point under the F distribution
with n1 degrees of freedom in the numerator
and n2 degrees of freedom in the denominator
Example
In the following example we are comparing weight
gains resulting from the following six diets
1. Diet 1 - High Protein , Beef
2. Diet 2 - High Protein , Cereal
3. Diet 3 - High Protein , Pork
4. Diet 4 - Low protein , Beef
5. Diet 5 - Low protein , Cereal
6. Diet 6 - Low protein , Pork
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
Diet
Mean
Std. Dev.
x
x2
1
73
102
118
104
81
107
100
87
117
111
100.0
15.14
1000
102062
2
98
74
56
111
95
88
82
77
86
92
85.9
15.02
859
75819
3
94
79
96
98
102
102
108
91
120
105
99.5
10.92
995
100075
4
90
76
90
64
86
51
72
90
95
78
79.2
13.89
5
107
95
97
80
98
74
74
67
89
58
83.9
15.71
792
839
64462 72613
6
49
82
73
86
81
97
106
70
61
82
78.7
16.55
787
64401
Hence
i
Ti
1
2
1000 859
3
995
4
792
k
5
839
6 Total (G )
787
5272
N   ni  T otalsamplesize  60
i 1
ni
k
 x
i 1 j 1
ij
2
 479432
Ti 2
 467846

i 1 ni
k
Thus
Ti 2 G 2
52722
SSBetween   
 467846
 4612.933
N
60
i 1 ni
2
k ni
k
Ti
2
SSWithin   xij  
 479432 467846 11586
i 1 j 1
i 1 ni
k
SSBetween k  1 4612.933/ 5 922.6
F


 4.3
SSWithin N  k  11586/ 54 214.56
F0.05  2.386 withn1  5 andn 2  54
Thus since F > 2.386 we reject H0
The ANOVA Table
A convenient method for displaying
the calculations for the F-test
Anova Table
Source
d.f.
Sum of
Squares
Between
k-1
SSBetween
Mean
Square
MSBetween
Within
N-k
SSWithin
MSWithin
Total
N-1
SSTotal
F-ratio
MSB /MSW
Diet Example
Equivalence of the F-test and the t-test
when k = 2
the t-test
xy
t
1 1
s Pooled

n m
sPooled 
n  1sx2  m  1s2y
nm2
the F-test
k
2
Between
2
Pooled
s
F
s

 n x  x 
2
i
i 1
i
k
2


n

1
s
 i i
i 1
k 1
 k

  ni  k 
 i 1

n1 x1  x   n2 x1  x 

2
2
n1 1s1  n1 1s1 n1  n2  2
2

denominator  s
2

2
pooled
numerator n1 x1  x   n2 x1  x 
2
2

n1 x1  n2 x2 

n1 x1  x   n1  x1 
n1  n2 

2
n1n2
2
x1  x2 

2
n1  n2 
2
n2 x2  x 
2

n1 x1  n2 x2 

 n2  x2 
n1  n2 

2
1 2
nn
2
x1  x2 

2
n1  n2 
2
2
nn n n
2
x1  x2 
n1 x1  x   n2 x2  x  
n1  n 
n1n2
x1  x2 2

n1  n2 
2
2

Hence
F
1
1 1
  
 n1 n2 
1
2
1 2
2
2 1
2
2
x1  x2 
2
x1  x2 
1 1 s
  
 n1 n2 
2
2
Pooled
t
2
Using SPSS
Note: The use of another statistical package
such as Minitab is similar to using SPSS
Assume the data is contained in an Excel file
Each variable is in a column
1. Weight gain (wtgn)
2. diet
3. Source of protein (Source)
4. Level of Protein (Level)
After starting the SSPS program the following
dialogue box appears:
If you select Opening an existing file and press OK the
following dialogue box appears
The following dialogue box appears:
If the variable names are in the file ask it to read the
names. If you do not specify the Range the program will
identify the Range:
Once you “click OK”, two windows will appear
One that will contain the output:
The other containing the data:
To perform ANOVA select Analyze->General Linear
Model-> Univariate
The following dialog box appears
Select the dependent variable and the fixed factors
Press OK to perform the Analysis
The Output
Tests of Between-Subjects Effects
Dependent Variable: wtgn
Source
Corrected Model
Type III Sum of
Squares
df
Mean Square
F
Sig.
4612.933(a)
5
922.587
4.300
.002
463233.067
1
463233.067
2159.036
.000
4612.933
5
922.587
4.300
.002
Error
11586.000
54
214.556
Total
479432.000
60
16198.933
59
Intercept
diet
Corrected Total
a R Squared = .285 (Adjusted R Squared = .219)