Mean Comparison Tests - Crop and Soil Science
Download
Report
Transcript Mean Comparison Tests - Crop and Soil Science
Treatment Comparisons
ANOVA can determine if there are differences
among the treatments, but what is the nature of
those differences?
Are the treatments measured on a continuous scale?
Look at response surfaces (linear regression, polynomials)
Is there an underlying structure to the treatments?
Compare groups of treatments using orthogonal contrasts
or a limited number of preplanned mean comparison tests
Use simultaneous confidence intervals on preplanned
comparisons
Are the treatments unstructured?
Use appropriate multiple comparison tests (today’s topic)
Variety Trials
In a breeding program, you need to examine
large numbers of selections and then narrow to
the best
In the early stages, based on single plants or
single rows of related plants. Seed and space
are limited, so difficult to have replication
When numbers have been reduced
and there is sufficient seed, you can
conduct replicated yield trials and
you want to be able to “pick the
winner”
Comparison of Means
Pairwise Comparisons
– Least Significant Difference (LSD)
Simultaneous Confidence Intervals
– Tukey’s Honestly Significant Difference (HSD)
– Dunnett Test (making all comparisons to a control)
• May be a one-sided or two-sided test
– Bonferroni Inequality
– Scheffé’s Test – can be used for unplanned comparisons
Other Multiple Comparison Tests - “Data Snooping”
–
–
–
–
Fisher’s Protected LSD (FPLSD)
Student-Newman-Keuls test (SNK)
Waller and Duncan’s Bayes LSD (BLSD)
False Discovery Rate Procedure
Often misused - intended to be used only for data from
experiments with unstructured treatments
Multiple Comparison Tests
Fixed Range Tests – a constant value is used
for all comparisons
– Application
• Hypothesis Tests
• Confidence Intervals
Multiple Range Tests – values used for
comparison vary across a range of means
– Application
• Hypothesis Tests
Type I vs Type II Errors
Type I error - saying something is different when it is really the
same (false positive) (Paranoia)
– the rate at which this type of error is made is the significance
level
Type II error - saying something is the same when it is really
different (false negative) (Sloth)
– the probability of committing this type of error is designated b
– the probability that a comparison procedure will pick up a real
difference is called the power of the test and is equal to 1-b
Type I and Type II error rates are inversely related to each other
For a given Type I error rate, the rate of Type II error depends on
– sample size
– variance
– true differences among means
Nobody likes to be wrong...
Protection against Type I is choosing a significance level
Protection against Type II is a little harder because
– it depends on the true magnitude of the difference
which is unknown
– choose a test with sufficiently high power
Reasons for not using LSD to make all possible
comparisons
– the chance for a Type I error increases dramatically as
the number of treatments increases
Pairwise Comparisons
Making all possible pairwise comparisons
among t treatments
– # of comparisons:
t!
t(t 1)
t
2
2!(t 2)!
2
If you have 10 varieties and want to look at all
possible pairwise comparisons
– that would be t(t-1)/2 or 10(9)/2 = 45
– that’s quite a few more than t-1 df = 9
Comparisonwise vs Experimentwise Error
Comparisonwise error rate ( = C)
– measures the proportion of all differences that are
expected to be declared real when they are not
Experimentwise error rate (E)
– the risk of making at least one Type I error among the
set (family) of comparisons in the experiment
– measures the proportion of experiments in which one
or more differences are falsely declared to be
significant
– the probability of being wrong increases as the number
of means being compared increases
– Also called familywise error rate (FWE)
Comparisonwise vs Experimentwise Error
Experimentwise error rate (E)
Probability of no Type I errors = (1-C)x
where x = number of pairwise comparisons
Max x = t(t-1)/2 , where t=number of treatments
Probability of at least one Type I error
E = 1- (1-C)x
if t = 10, Max x = 45
E = 1-(1-0.05)45 = 90%
Comparisonwise error rate
C = 1- (1-E)1/x
Least Significant Difference
Calculating a t for testing the difference between two
means
t calc (Y1 Y2 ) / s 2Y1 Y2
– Any difference for which the tcalc > t would be declared
significant
2
t
s
Further, Y1 Y2
is the smallest difference for which
significance would be declared
– Therefore
LSD t s 2Y1 Y2
– For equal replication, where r is the number of observations
forming each mean
LSD t
2 * MSE
r
Do’s and Don’ts of using LSD
LSD is a valid test when
– Making comparisons planned in advance of seeing the
data (this includes the comparison of each treatment
with the control)
– Comparing adjacent ranked means
The LSD should not (unless F test for treatments
is significant**) be used for
– Making all possible pairwise comparisons
– Making more comparisons than df for treatments
**Some would say that LSD should never be used
unless the F test from ANOVA is significant
Pick the Winner
A plant breeder wanted to measure resistance to
stem rust for six wheat varieties
–
–
–
–
planted 5 seeds of each variety in each of four pots
placed the 24 pots randomly on a greenhouse bench
inoculated with stem rust
measured seed yield per pot at maturity
Ranked Mean Yields (g/pot)
Mean Yield
Difference
Yi1 - Yi
Variety
Rank
Yi
F
1
95.3
D
2
94.0
1.3
E
3
75.0
19.0
B
4
69.0
6.0
A
5
50.3
18.7
C
6
24.0
26.3
ANOVA
Source
df
MS
Variety
5
2,976.44
18
120.00
Error
F
24.80
Compute LSD at 5% and 1%
LSD t 0.05,df 18
2 * MSE
2 *120
2.101
16.27
r
4
LSD t 0.01,df 18
2 * MSE
2 *120
2.878
22.29
r
4
LSD=0.05 = 16.27
LSD=0.01 = 22.29
Back to the data...
Mean Yield
Difference
Yi1 - Yi
Variety
Rank
Yi
F
1
95.3
D
2
94.0
1.3
E
3
75.0
19.0*
B
4
69.0
6.0
A
5
50.3
18.7*
C
6
24.0
26.3**
Fisher’s protected LSD (FPLSD)
Uses comparisonwise error rate
Computed just like LSD but you don’t use it
unless the F for treatments tests significant
LSD t
2 * MSE
r
So in our example data, any difference between
means that is greater than 16.27 is declared to
be significant
Tukey’s Honestly Significant Difference (HSD)
From a table of Studentized range values (see handout),
select a value of Q which depends on p (the number of
means) and v (error df)
Compute:
HSD Q,p,v
MSE
r
For any pair of means, if the difference is greater than
HSD, it is significant
Uses an experimentwise error rate
Use the Tukey-Kramer test with unequal sample size
HSD Q,p,v
MSE 1 1
2 r1 r2
Student-Newman-Keuls Test (SNK)
Rank the means from high to low
Compute t-1 significant differences, SNKj , using the
studentized values for the HSD
SNK j Q,k,v
MSE
r
where j=1,2,..., t-1; k=2,3,...,t
k = number of means in the range
Compare the highest and lowest
– if less than SNK, no differences are significant
– if greater than SNK, compare next highest mean with
next lowest using next SNK
Uses experimentwise for the extremes
Uses comparisonwise for adjacent means
Using SNK with example data:
k
2
3
4
5
6
Q
2.97
3.61
4.00
4.28
4.49
19.77 21.91
23.44
24.59
2
1
SNK 16.27
Mean Yield
Variety Rank
F
1
Yi
5
4
3
= 15 comparisons
95.3
18 df for error
D
2
94.0
E
3
75.0
B
4
69.0
MSE
120
se
5.477
r
4
A
5
50.3
SNK=Q*se
C
6
24.0
Waller-Duncan Bayes LSD (BLSD)
Do ANOVA and compute F (MST/MSE) with q and f df
(corresponds to table nomenclature)
Choose error weight ratio, k
– k=100 corresponds to 5% significance level
– k=500 for a 1% test
Obtain tb from table (A7 in Petersen)
– depends on k, F, q (treatment df) and f (error df)
Compute
BLSD = tb 2MSE/r
Any difference greater than BLSD is significant
Does not provide complete control of experimentwise Type
I error
Reduces Type II error
Bonferroni Inequality
Theory
E X * C
where X = number of pairwise comparisons
To get critical probability value for significance
C = E / X where E = maximum desired experimentwise
error rate
Alternatively, multiply observed probability value
by X and compare to E (values >1 are set to 1)
Advantages
– simple
– strict control of Type I error
Disadvantage
– very conservative, low power to detect differences
False Discovery Rate
Bars show P values for simple t tests among means
– Largest differences have the smallest P values
Line represents critical P values = (i/X)* E
False Positive Procedure
0.25
Probability
0.20
0.15
0.10
i = 1 to X
Ranks for
Reject H0
0.05
Yi -Yj
0.00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Rank (i)
More Options!
Х Duncan’s New Multiple Range Test
– A multiple range test
– Less conservative than the SNK test
– Used to be popular, but no longer recommended
Dunnett’s Test
– Compare all treatments against a control
– Compare all treatments against the best treatment
– Conservative (controls Type 1, not Type 2 error)
Scheffe’s Test
– Can be used for comparisons that are not preplanned
– Very conservative!
Most Popular
FPLSD test is widely used, and widely abused
BLSD is preferred by some because
– It is a single value and therefore easy to use
– Larger when F indicates that the means are homogeneous and
small when means appear to be heterogeneous
The False Discovery Rate (FDR) has nice features
– Good experimentwise Type I error control
– Good power (Type II error control)
– May not be as well-known as some other tests
Tukey’s HSD test
– Widely accepted and often recommended by statisticians
– May be too conservative if Type II error has more serious
consequences than Type I error