Transcript 6.3 - ANOVA
CHAPTER 6
Statistical Inference & Hypothesis Testing
• 6.1 - One Sample
Mean μ, Variance σ 2, Proportion π
• 6.2 - Two Samples
Means, Variances, Proportions
μ1 vs. μ2 σ12 vs. σ22
π1 vs. π2
• 6.3 - Multiple Samples
Means, Variances,
μ1, …, μk σ12, …, σk2
Proportions
π1, …, πk
CHAPTER 6
Statistical Inference & Hypothesis Testing
• 6.1 - One Sample
Mean μ, Variance σ 2, Proportion π
• 6.2 - Two Samples
Means, Variances, Proportions
μ1 vs. μ2 σ12 vs. σ22
π1 vs. π2
• 6.3 - Multiple Samples
Means, Variances,
μ1, …, μk σ12, …, σk2
Proportions
π1, …, πk
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• Analysis via T-test (if equivariance holds):
“Group Means” y1
“Group
2
Variances” s1
5
2
( 667 630 ) 2
2
s ppooled
o o led
2
2
( 604 630 ) 2
5 1
SS1
s2 = SS/df
Pooled
Variance
667 653 614 612 604
630
Sample 2 = {593, 525, 520}; n2 = 3
y
Point estimates
y2
788.5 s 2
2
593 525 520
3
2
( 593 546 ) 2
y1 y 2 84
546
2
( 520 546 ) 2
3 1
yi / n
1663
F
1663
788.5
NOTE:
>0
2.11 4
SS2
2
2
( 5n11)(
3 s1)(
1663 )
1) 7s18 8.5()n2 ( 1)
2
n1 n52 32 2
1080
The pooled variance is a weighted average of the group
variances, using the degrees of freedom as the weights.
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
Sample 2 = {593, 525, 520}; n2 = 3
• Analysis via T-test (if equivariance holds): Point estimates
“Group Means” y1
“Group
2
Variances” s1
5
2
( 667 630 ) 2
s2 = SS/df
Pooled
Variance
667 653 614 612 604
2
( 604 630 ) 2
5 1
630
y2
788.5 s 2
2
593 525 520
3
2
( 593 546 ) 2
y
yi / n
y1 y 2 84
546
2
( 520 5546
46 ) 2
3 1
1663
F
1663
788.5
NOTE:
>0
2.11 4
SSErr = 6480
2
s ppooled
o o led
2
2
2
( 5n11)(
3 s1)(
1663 )
1) 7s18 8. 5()n2 ( 1)
2
n1 n5232 2
dfErr = 6
Standard
s.e.0
Error
2
pooled
s
1080
11 1 1
24
5n1 3 n 2
1080
The pooled variance is a weighted average of the group
variances, using the degrees of freedom as the weights.
p-value = 2 P (Y1 Y2 84 ) 2 P T6
84 0
24
2 P T6
3.5
> 2 * (1 - pt(3.5, 6)) Reject H0 at α = .05
stat signif, Hosp > Clinic
[1] 0.01282634
R code:
> y1 = c(667, 653, 614, 612, 604)
> y2 = c(593, 525, 520)
>
> t.test(y1, y2, var.equal = T)
Formal Conclusion
Two Sample t-test
p-value < α = .05
Reject H0 at this level.
data: y1 and y2
t = 3.5, df = 6, p-value = 0.01283
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
25.27412 142.72588
Interpretation
sample estimates:
mean of x mean of y
The samples provide evidence that the
630
546
difference between mean costs is (moderately)
statistically significant, at the 5% level,
with the hospital being higher than the clinic
(by an average of $84).
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
“Total Variability” = “Variability between groups” + “Variability within groups”
Y1
Y2
Yk
k
1
1
H0 :
1
k
2
2
=
2
=
HA: “At least one ‘treatment mean’ μi is
significantly different from the others.
= k
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
5
630
Sample 2 = {593, 525, 520}; n2 = 3
Point estimates
y2
593 525 520
3
y
546
3 (546)
5 (630)
“Grand Mean”
y
667 653 614 612 604 593 525 520
53
598.50
The grand mean is a weighted average of the group
means, using the sample sizes as the weights.
yi / n
y1 y 2 84
NOTE:
>0
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
“Total Variability” = “Variability between groups” + “Variability within groups”
Y1
Y2
Yk
k
1
1
H0 :
1
k
2
2
=
2
=
HA: “At least one ‘treatment mean’ μi is
significantly different from the others.
= k
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
“Grand Mean”
667 653 614 612 604
5
y
630
Point estimates
y2
5 ( 630 ) 3( 546 )
53
Sample 2 = {593, 525, 520}; n2 = 3
593 525 520
3
y
546
598.50
How far is the “total” sample from the grand mean?
yi / n
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
5
y
“Grand Mean”
SSTot = (667 598.5 )
630
2
Point estimates
y2
5 ( 630 ) 3( 546 )
53
593 525 520
3
y
yi / n
546
598.50
(653 598.5 ) (614 598.5 ) (612 598.5 ) (604 598.5 )
2
(593 598.5 ) (525 598.5 ) (520 598.5 )
2
Sample 2 = {593, 525, 520}; n2 = 3
2
2
2
2
= 19710
2
dfTot = (5+3) –1 = 7
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
“Total Variability” = “Variability between groups” + “Variability within groups”
Y1
Y2
Yk
k
1
1
H0 :
1
k
2
2
=
How can we measure this?
2
=
= k
Imagine zero variability within groups…
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
“Total Variability” = “Variability between groups” + “Variability within groups”
Y1
Y2
Yk
k
1
1
H0 :
1
k
2
2
=
How can we measure this?
2
=
= k
Imagine zero variability within groups…
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667,
{630, 653,
630, 614,
630, 612,
630, 630
604};
} n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
5
y
“Grand Mean”
SSTot = (667 598.5 )
630
2
Point estimates
y2
5 ( 630 ) 3( 546 )
53
593 525 520
3
y
yi / n
546
598.50
(653 598.5 ) (614 598.5 ) (612 598.5 ) (604 598.5 )
2
2
(593 598.5 ) (525 598.5 ) (520 598.5 )
2
SSTrt =
Sample 2 = {593,
{546, 525,
546, 520};
546} n2 = 3
2
5 ( 630 598.5 ) 3 ( 546 598.5 )
2
“The Clonemaster”
2
= 13230
2
2
= 19710
2
dfTot = (5+3) –1 = 7
dfTrt = (2) –1
=1
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
“Total Variability” = “Variability between groups” + “Variability within groups”
Y1
Y2
Yk
k
1
1
H0 :
1
k
2
2
=
2
=
= k
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
5
y
“Grand Mean”
SSTot = (667 598.5 )
630
2
Point estimates
y2
5 ( 630 ) 3( 546 )
53
593 525 520
3
y
yi / n
546
598.50
(653 598.5 ) (614 598.5 ) (612 598.5 ) (604 598.5 )
2
2
(593 598.5 ) (525 598.5 ) (520 598.5 )
2
SSTrt =
Sample 2 = {593, 525, 520}; n2 = 3
2
5 ( 630 598.5 ) 3 ( 546 598.5 )
2
2
2
2
= 19710
= 13230
How far is each sample from its own group mean?
2
dfTot = (5+3) –1 = 7
dfTrt = (2) –1
=1
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
630
5
y
“Grand Mean”
SSTot = (667 598.5 )
2
Point estimates
y2
5 ( 630 ) 3( 546 )
53
593 525 520
2
2
2
2
2
2
dfTrt = (2) –1
(653 630 ) (614 630 ) (612 630 ) (604 630 )
2
2
2
2
2
dfTot = (5+3) –1 = 7
= 19710
= 13230
(593 546 ) (525 546 ) (520 546 )
2
yi / n
598.50
2
5 ( 630 598.5 ) 3 ( 546 598.5 )
2
(653 598.5 ) (614 598.5 ) (612 598.5 ) (604 598.5 )
2
SSErr = (667 630 )
y
546
3
(593 598.5 ) (525 598.5 ) (520 598.5 )
SSTrt =
Sample 2 = {593, 525, 520}; n2 = 3
2
2
=1
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
Sample 2 = {593, 525, 520}; n2 = 3
• Analysis via T-test (if equivariance holds): Point estimates
“Group Means” y1
“Group
2
Variances” s1
5
2
( 667 630 )
2
s ppooled
o o led
2
(6
04 630 )
604
2
5 1
SS1
s2 = SS/df
Pooled
Variance
667 653 614 612 604
630
y2
788.5 s 2
2
593 525 520
3
2
( 593 546 ) 2
y
y1 y 2 84
546
2
( 520 5546
46 ) 2
3 1
yi / n
1663
F
1663
788.5
NOTE:
>0
2.11 4
SS2
2
2
( 5n11)(
3 s1)(
1663 )
1) 7s18 8.5()n2 ( 1)
2
n1 n52 32 2
1080
The pooled variance is a weighted average of the group
variances, using the degrees of freedom as the weights.
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
Sample 2 = {593, 525, 520}; n2 = 3
• Analysis via T-test (if equivariance holds): Point estimates
“Group Means” y1
“Group
2
Variances” s1
5
2
( 667 630 )
s2 = SS/df
Pooled
Variance
667 653 614 612 604
(6
04 630 )
604
2
5 1
630
y2
788.5 s 2
2
593 525 520
3
2
( 593 546 ) 2
y
y1 y 2 84
546
2
( 520 5546
46 ) 2
3 1
yi / n
1663
F
1663
788.5
NOTE:
>0
2.11 4
SSErr = 6480
2
s ppooled
o o led
2
2
2
( 5n11)(
3 s1)(
1663 )
1) 7s18 8.5()n2 ( 1)
2
n1 n52 32 2
dfErr = 6
1080
The pooled variance is a weighted average of the group
variances, using the degrees of freedom as the weights.
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
630
5
y
“Grand Mean”
SSTot = (667 598.5 )
2
Point estimates
y2
5 ( 630 ) 3( 546 )
53
593 525 520
2
2
2
2
2
2
dfTrt = (2) –1
(653 630 ) (614 630 ) (612 630 ) (604 630 )
2
2
2
2
2
dfTot = (5+3) –1 = 7
= 19710
= 13230
(593 546 ) (525 546 ) (520 546 )
2
yi / n
598.50
2
5 ( 630 598.5 ) 3 ( 546 598.5 )
2
(653 598.5 ) (614 598.5 ) (612 598.5 ) (604 598.5 )
2
SSErr = (667 630 )
y
546
3
(593 598.5 ) (525 598.5 ) (520 598.5 )
SSTrt =
Sample 2 = {593, 525, 520}; n2 = 3
2
2
=1
Example: Y = “$ Cost of a certain medical service”
Assume Y is known to be normally distributed at each of k = 2 health care facilities (“groups”).
Hospital: Y1 ~ N(μ1, σ1)
Clinic: Y2 ~ N(μ2, σ2)
• Null Hypothesis H0: μ1 = μ2,
i.e., μ1 – μ2 = 0
(“No difference exists.")
2-sided test at significance level α = .05
• Data: Sample 1 = {667, 653, 614, 612, 604}; n1 = 5
• ANOVA F-test (if equivariance holds):
“Group Means” y1
667 653 614 612 604
5
y
“Grand Mean”
SSTot = (667 598.5 )
630
2
Point estimates
y2
5 ( 630 ) 3( 546 )
53
593 525 520
3
y
yi / n
546
598.50
(653 598.5 ) (614 598.5 ) (612 598.5 ) (604 598.5 )
2
2
(593 598.5 ) (525 598.5 ) (520 598.5 )
2
SSTrt =
Sample 2 = {593, 525, 520}; n2 = 3
2
5 ( 630 598.5 ) 3 ( 546 598.5 )
2
2
= 13230
SSErr = 4 ( 788.5 ) 2 (1663 ) = 6480
SSTot = SSTrt + SSErr
2
2
= 19710
2
dfTot = (5+3) –1 = 7
dfTrt = (2) –1
=1
dfErr = (5+3) –2 = 6
dfTot = dfTrt + dfErr
SSTot = SSTrt + SSErr
Tot
dfTot = dfTrt + dfErr
Err
Trt
MS
ANOVA Table
SS
F
M S T rt
F1,6
M S E rr
df
12.25
Source
Treatment
df
SS
MS
1
13230
13230
s b etw een
F-ratio
p-value
12.25
.01282634
2
Error
Total
6
7
6480
19710
SSTot = SSTrt + SSErr
1080
s w2 ith in
–
on F1, 6 1–pf(12.25, 1, 6)
F-table: comp w/ α
Note:
2
This is also s p o o led .
dfTot = dfTrt + dfErr
SSTot = SSTrt + SSErr
dfTot = dfTrt + dfErr
Err
MS
ANOVA Table
SS
df
F
M S T rt
F1,6
M S E rr
12.25
Source
Treatment
df
SS
1
MS
F-ratio
p-value
12.25
.01282634
13230
s b etw een
2
Error
Total
6
6480
on F1, 6 1–pf(12.25, 1, 6)
F-table: comp w/ α
–
7
Thus, the treatment accounts for
1080
s w2 ith in
13230
19710
= 67.1% of the total variability in the response Y.
R code:
# ANOVA FOR UNBALANCED DESIGN
> y1 = c(667, 653, 614, 612, 604)
> y2 = c(593, 525, 520)
>
> Data = data.frame(
+
Y = c(y1, y2),
+
X = factor(rep(c("y1", "y2"), times = c(length(y1),
length(y2))))
+
)
>
> var.test(Y ~ X, data = Data)
# EQUIVARIANCE?
F test to compare two variances
data: Y by X
F = 0.4741, num df = 4, denom df = 2,
p-value = 0.4738
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.01208057 5.04920249
sample estimates:
ratio of variances
0.4741431
R code:
# ANOVA FOR UNBALANCED DESIGN
> out = aov(Y ~ X, data = Data)
> anova(out)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
X
1 13230
13230
12.25 0.01283 *
Residuals 6
6480
1080
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note: Vis-à-vis T-test vs. F-test,
• p-value is the same using either method (.01283), since the sample is unchanged!
• The square of the Tdf -score (3.5) is equal to the F1, df -score (12.25).
(Recall that the square of the Z-score is equal to the 1 -score.)
2
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
Y1
Y2
Yk
k
1
1
H0:
1
k
2
2
=
2
=
= k
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
• Equivariance can be tested via very similar “two variances” F-test in
6.2.2 (but this is very sensitive to normality assumption), or others.
If violated, can extend Welch Test for two means.
Y1
Y2
Yk
k
1
1
H0:
1
k
2
2
=
2
=
= k
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
• Normality can be tested via usual methods.
If violated, use nonparametric Kruskal-Wallis Test.
Y1
Y2
Yk
k
1
1
H0:
1
k
2
2
=
2
=
= k
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
• Extensions of ANOVA for data in matched “blocks” designs,
repeated measures, multiple factor levels within groups, etc.
Y1
Y2
Yk
k
1
1
H0:
1
k
2
2
=
2
=
= k
Alternate method ~
Main Idea: Among several (k 2) independent, equivariant,
normally-distributed “treatment groups”…
• How to identify significant group(s)? Pairwise testing, with correction
(e.g., Bonferroni) for spurious significance.
• Example: k = 5 groups result in 10 such tests, so let each α* = α / 10.
Y1
Y2
Yk
k
1
1
H0:
1
k
2
2
=
2
=
= k