Mean Comparison Tests - Crop and Soil Science

Transcript Mean Comparison Tests - Crop and Soil Science

Treatment Comparisons
 ANOVA can determine if there are differences
among the treatments, but what is the nature of
those differences?
 Are the treatments measured on a continuous scale?
 Look at response surfaces (linear regression, polynomials)
 Is there an underlying structure to the treatments?
 Compare groups of treatments using orthogonal contrasts
or a limited number of preplanned mean comparison tests
 Use simultaneous confidence intervals on preplanned
comparisons
 Are the treatments unstructured?
 Use appropriate multiple comparison tests (today’s topic)
Variety Trials
 In a breeding program, you need to examine
large numbers of selections and then narrow to
the best
 In the early stages, based on single plants or
single rows of related plants. Seed and space
are limited, so difficult to have replication
When numbers have been reduced
and there is sufficient seed, you can
conduct replicated yield trials and
you want to be able to “pick the
winner”
Comparison of Means
 Pairwise Comparisons
– Least Significant Difference (LSD)
 Simultaneous Confidence Intervals
– Tukey’s Honestly Significant Difference (HSD)
– Dunnett Test (making all comparisons to a control)
• May be a one-sided or two-sided test
– Bonferroni Inequality
– Scheffé’s Test – can be used for unplanned comparisons
 Other Multiple Comparison Tests - “Data Snooping”
–
–
–
–
Fisher’s Protected LSD (FPLSD)
Student-Newman-Keuls test (SNK)
Waller and Duncan’s Bayes LSD (BLSD)
False Discovery Rate Procedure
 Often misused - intended to be used only for data from
experiments with unstructured treatments
Multiple Comparison Tests
 Fixed Range Tests – a constant value is used
for all comparisons
– Application
• Hypothesis Tests
• Confidence Intervals
 Multiple Range Tests – values used for
comparison vary across a range of means
– Application
• Hypothesis Tests
Type I vs Type II Errors
 Type I error - saying something is different when it is really the
same (false positive) (Paranoia)
– the rate at which this type of error is made is the significance
level 
 Type II error - saying something is the same when it is really
different (false negative) (Sloth)
– the probability of committing this type of error is designated b
– the probability that a comparison procedure will pick up a real
difference is called the power of the test and is equal to 1-b
 Type I and Type II error rates are inversely related to each other
 For a given Type I error rate, the rate of Type II error depends on
– sample size
– variance
– true differences among means
Nobody likes to be wrong...
 Protection against Type I is choosing a significance level
 Protection against Type II is a little harder because
– it depends on the true magnitude of the difference
which is unknown
– choose a test with sufficiently high power
 Reasons for not using LSD to make all possible
comparisons
– the chance for a Type I error increases dramatically as
the number of treatments increases
Pairwise Comparisons
 Making all possible pairwise comparisons
among t treatments
– # of comparisons:

t!
t(t  1)
t 

2
2!(t  2)!
2
 If you have 10 varieties and want to look at all
possible pairwise comparisons
– that would be t(t-1)/2 or 10(9)/2 = 45
– that’s quite a few more than t-1 df = 9
Comparisonwise vs Experimentwise Error
 Comparisonwise error rate ( = C)
– measures the proportion of all differences that are
expected to be declared real when they are not
 Experimentwise error rate (E)
– the risk of making at least one Type I error among the
set (family) of comparisons in the experiment
– measures the proportion of experiments in which one
or more differences are falsely declared to be
significant
– the probability of being wrong increases as the number
of means being compared increases
– Also called familywise error rate (FWE)
Comparisonwise vs Experimentwise Error
 Experimentwise error rate (E)
Probability of no Type I errors = (1-C)x
where x = number of pairwise comparisons
Max x = t(t-1)/2 , where t=number of treatments
 Probability of at least one Type I error
E = 1- (1-C)x
if t = 10, Max x = 45
E = 1-(1-0.05)45 = 90%
 Comparisonwise error rate
C = 1- (1-E)1/x
Least Significant Difference
 Calculating a t for testing the difference between two
means
t calc  (Y1  Y2 ) / s 2Y1  Y2
– Any difference for which the tcalc > t would be declared
significant
2
t
s
 Further,  Y1  Y2
is the smallest difference for which
significance would be declared
– Therefore
LSD  t  s 2Y1  Y2
– For equal replication, where r is the number of observations
forming each mean
LSD  t 
2 * MSE
r
Do’s and Don’ts of using LSD
 LSD is a valid test when
– Making comparisons planned in advance of seeing the
data (this includes the comparison of each treatment
with the control)
– Comparing adjacent ranked means
 The LSD should not (unless F test for treatments
is significant**) be used for
– Making all possible pairwise comparisons
– Making more comparisons than df for treatments
**Some would say that LSD should never be used
unless the F test from ANOVA is significant
Pick the Winner
 A plant breeder wanted to measure resistance to
stem rust for six wheat varieties
–
–
–
–
planted 5 seeds of each variety in each of four pots
placed the 24 pots randomly on a greenhouse bench
inoculated with stem rust
measured seed yield per pot at maturity
Ranked Mean Yields (g/pot)
Mean Yield
Difference
Yi1 - Yi
Variety
Rank
Yi
F
1
95.3
D
2
94.0
1.3
E
3
75.0
19.0
B
4
69.0
6.0
A
5
50.3
18.7
C
6
24.0
26.3
ANOVA
Source
df
MS
Variety
5
2,976.44
18
120.00
Error
F
24.80
 Compute LSD at 5% and 1%
LSD  t 0.05,df 18
2 * MSE
2 *120
 2.101
 16.27
r
4
LSD  t 0.01,df 18
2 * MSE
2 *120
 2.878
 22.29
r
4
LSD=0.05 = 16.27
LSD=0.01 = 22.29
Back to the data...
Mean Yield
Difference
Yi1 - Yi
Variety
Rank
Yi
F
1
95.3
D
2
94.0
1.3
E
3
75.0
19.0*
B
4
69.0
6.0
A
5
50.3
18.7*
C
6
24.0
26.3**
Fisher’s protected LSD (FPLSD)
 Uses comparisonwise error rate
 Computed just like LSD but you don’t use it
unless the F for treatments tests significant
LSD  t 
2 * MSE
r
 So in our example data, any difference between
means that is greater than 16.27 is declared to
be significant
Tukey’s Honestly Significant Difference (HSD)
 From a table of Studentized range values (see handout),
select a value of Q which depends on p (the number of
means) and v (error df)
 Compute:
HSD  Q,p,v
MSE
r
 For any pair of means, if the difference is greater than
HSD, it is significant
 Uses an experimentwise error rate
 Use the Tukey-Kramer test with unequal sample size
HSD  Q,p,v
MSE  1 1 
  
2  r1 r2 
Student-Newman-Keuls Test (SNK)
 Rank the means from high to low
 Compute t-1 significant differences, SNKj , using the
studentized values for the HSD
SNK j  Q,k,v
MSE
r
where j=1,2,..., t-1; k=2,3,...,t
k = number of means in the range
 Compare the highest and lowest
– if less than SNK, no differences are significant
– if greater than SNK, compare next highest mean with
next lowest using next SNK
 Uses experimentwise  for the extremes
 Uses comparisonwise  for adjacent means
Using SNK with example data:
k
2
3
4
5
6
Q
2.97
3.61
4.00
4.28
4.49
19.77 21.91
23.44
24.59
2
1
SNK 16.27
Mean Yield
Variety Rank
F
1
Yi
5
4
3
= 15 comparisons
95.3
18 df for error
D
2
94.0
E
3
75.0
B
4
69.0
MSE
120
se 

 5.477
r
4
A
5
50.3
SNK=Q*se
C
6
24.0
Waller-Duncan Bayes LSD (BLSD)
 Do ANOVA and compute F (MST/MSE) with q and f df
(corresponds to table nomenclature)
 Choose error weight ratio, k
– k=100 corresponds to 5% significance level
– k=500 for a 1% test
 Obtain tb from table (A7 in Petersen)
– depends on k, F, q (treatment df) and f (error df)
 Compute
BLSD = tb 2MSE/r
 Any difference greater than BLSD is significant
 Does not provide complete control of experimentwise Type
I error
 Reduces Type II error
Bonferroni Inequality
 Theory
E  X *  C
where X = number of pairwise comparisons
 To get critical probability value for significance
C = E / X where E = maximum desired experimentwise
error rate
 Alternatively, multiply observed probability value
by X and compare to E (values >1 are set to 1)
 Advantages
– simple
– strict control of Type I error
 Disadvantage
– very conservative, low power to detect differences
False Discovery Rate
 Bars show P values for simple t tests among means
– Largest differences have the smallest P values
 Line represents critical P values = (i/X)* E
False Positive Procedure
0.25
Probability
0.20
0.15
0.10
i = 1 to X
Ranks for
Reject H0
0.05
Yi -Yj
0.00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Rank (i)
More Options!
Х Duncan’s New Multiple Range Test
– A multiple range test
– Less conservative than the SNK test
– Used to be popular, but no longer recommended
 Dunnett’s Test
– Compare all treatments against a control
– Compare all treatments against the best treatment
– Conservative (controls Type 1, not Type 2 error)
 Scheffe’s Test
– Can be used for comparisons that are not preplanned
– Very conservative!
Most Popular
 FPLSD test is widely used, and widely abused
 BLSD is preferred by some because
– It is a single value and therefore easy to use
– Larger when F indicates that the means are homogeneous and
small when means appear to be heterogeneous
 The False Discovery Rate (FDR) has nice features
– Good experimentwise Type I error control
– Good power (Type II error control)
– May not be as well-known as some other tests
 Tukey’s HSD test
– Widely accepted and often recommended by statisticians
– May be too conservative if Type II error has more serious
consequences than Type I error