Transcript Slide 1

Beyond Null Hypothesis Testing
Supplementary Statistical Techniques
Limitations of NHT
• Criticisms of NHT date from the 1930s.
– Null hypothesis is rarely true.
– The real question is not about the existence of an effect, but
about the nature of the effect:
• What is the direction of the effect?
• What is the size of the effect?
• How important is it?
• What are the underlying mechanisms (theory)?
PSYC 6130, PROF. J. ELDER
2
Direction of Effect
• NHT is reasonably well suited to testing the direction of an effect.
– For example, are tall men more or less likely to be wealthy?
X  69.87 "
s  2.63 "
n  7586
t
X  69.01"
69.87  69.01
2.632 2.852

7586 7777
 19.44
s  2.85 "
n  7777
(Canadian Community Health Survey, 2004)
PSYC 6130, PROF. J. ELDER
3
Magnitude of Effect
•
NHT by itself tells us nothing about the magnitude of an effect.
•
This is really a problem of descriptive statistics.
•
The simplest descriptor of the magnitude of an effect is a point estimate:
X1  X 2  69.87" 69.01"
X  69.87 "
s  2.63 "
n  7586
 0.86"
X  69.01"
s  2.85 "
n  7777
PSYC 6130, PROF. J. ELDER
4
Magnitude of Effect
•
A problem with a point estimate is that it suggests a certainty we do not really have.
•
A more complete and useful description of the magnitude of the effect is provided by
a confidence interval.
e.g., s X1  X2
2.632 2.852


 0.044"
7586 7777
 The 95% confidence interval for 1-2




  X1  X 2  z.05s X1  X2 , X1  X 2  z.05s X1  X2 
X  69.87 "
s  2.63 "
n  7586
 0.86" 1.96  0.044",0.86" 1.96  0.044"
 [0.77",0.95"]
X  69.01"
s  2.85 "
n  7777
PSYC 6130, PROF. J. ELDER
5
Importance of Effect
•
However, even a confidence interval does not really tell us whether a treatment or
factor is important.
•
One way to judge whether a difference of means is ‘big’ is to compare the size of the
difference of the means to the values of the means themselves, e.g.,
X1  X 2
X  69.87 "
s  2.63 "
n  7586

1
X  X2
2 1


0.86"
 0.012
69.44"
i.e., wealthy men are roughly 1.2% taller.
X  69.01"
s  2.85 "
n  7777
PSYC 6130, PROF. J. ELDER
6
Importance of Effect
• However it is often more meaningful to compare the treatment
effect to the overall variation in the measured variable.
• We call this normalized measure of the effect the effect size d.
X  69.87 "
s  2.63 "
n  7586
X  69.01"
s  2.85 "
n  7777
PSYC 6130, PROF. J. ELDER
7
Importance of Effect
• However it is often more meaningful to compare the treatment
effect to the overall variation in the measured variable:
d
X  69.87 "
s  2.63 "
n  7586
1  2

X1  X 2
sp
df1s12  df2s22
where s 
df1  df2
2
p
7585  2.632  7776  2.85 2

7585  7776
 7.53
X  69.01"
s  2.85 "
n  7777
Thus sp  2.74"
d
Note:
PSYC 6130, PROF. J. ELDER
0.86 "
 0.31
2.74 "
X1  X 2
as an estimator of d is sometimes called g.
sp
8
Example Effect Sizes
0.4
Group 1
0.4
Group 2
0.2
0
-5
0.4
0.2
0
d=.5
0.4
0.2
0
-5
PSYC 6130, PROF. J. ELDER
0
-5
5
0
d=1
5
0
d=4
5
0.2
0
d=2
0
-5
5
9
Importance of Effect
•
d provides a sense of how much of the variation in the dependent variable
is due to the ‘treatment’.
d is related to the point-biserial coefficient rpb .
2
rpb
measures the proportion of variance in the sample due to the treatment.
2
rpb
is an estimator of 2, the proportion of variance in the population due to the treatment:
d2
  2
d 4
2
X  69.87 "
s  2.63 "
n  7586
We will cover these topics later in the term.
X  69.01"
s  2.85 "
n  7777
PSYC 6130, PROF. J. ELDER
10
End of Lecture 6
Oct 22, 2008
Theory
•
Even when augmented with measures of effect size, NHT does not directly tell us
about the mechanism by which the treatment impacts the dependent variable.
•
e.g., Wealthy men are taller because…
– Tall men attract wealthy women?
– Wealthy men come from wealthy families that provided better care (e.g., nutrition).
•
To understand these relationships, it is not enough to test the significance of and
quantify effects.
•
Ultimately, we require detailed, mechanistic (causal), testable theories, and
experiments that test these theories.
•
These theories should generate quantitative predictions, that can be compared
against experimental outcomes.
•
The theory that provides the closest quantiative account of the data should be
considered our current ‘working hypothesis’ about how the system under study
operates.
•
When comparing theories, we must beware of “Occam’s Razor”.
•
This process is less dependent on NHT, and more dependent upon model fitting,
analysis of variance and cross-validation techniques.
PSYC 6130, PROF. J. ELDER
12
Planning Experiments: Statistical Power
Planning a Study
• There are many considerations that go into planning an
experiment or study.
• Here we focus on the statistical considerations.
• Some possible questions:
– How many samples (e.g., subjects) will I need for my study?
– I already know that I will only have access to n samples
(subjects). Will this be enough?
• Answering these questions depends on understanding
the relationship between sample size, effect size, and
statistical power.
PSYC 6130, PROF. J. ELDER
14
Sample Size and Effect Size Codetermine Power
Sample size n
+
Power 1  
Effect Size d
PSYC 6130, PROF. J. ELDER
+
15
Statistical Power
• Power is defined as the complement of the Type II error
rate.
• Thus understanding power means understanding Type II
errors.
PSYC 6130, PROF. J. ELDER
16
Type I Errors and the Null Hypothesis Distribution (NHD)
• To understand Type I errors, we considered the situation where the
null hypothesis is true, and modeled the null hypothesis distribution.
p(t )
X

t
1
 X2

s X1  X 2
t /2
0

PSYC 6130, PROF. J. ELDER
17
t /2
Understanding Type II Errors
• To understand the factors that determine Type II errors,
we need to model the situation when the null hypothesis
is false and the alternative hypothesis is true.
• The difficulty is that the alternative hypothesis typically
encompasses a range of possible population means,
and we do not know which one is the correct mean.
• But suppose for the moment we did. This defines the
alternative hypothesis distribution (AHD), which follows a
non-central t distribution.
• We will often approximate this as a normal distribution, in
order to compute rough estimates of power.
PSYC 6130, PROF. J. ELDER
18
Sampling Distributions of the Difference of the Means
NHD
AHD
0.4
d
Probability p
0.3
0.2


0.1
0
0
1  2 | H0
PSYC 6130, PROF. J. ELDER
X1  X2
1  2 | Ha
19
1  2

Standardizing the Alternative Hypothesis Distribution
• Just as for the NHD, it is useful to standardize the AHD:
t
X1  X 2
s X1  X 2
E[t | H0 ]  ?
0
E[t | H1]  ?
We use the symbol  to denote the expected t value under the alternate hypothesis.
PSYC 6130, PROF. J. ELDER
20
Standardized Distributions of the Difference of the Means
NHD
AHD
0.4
Probability p(t)
0.3

0.2
Power 1  
1
1

0.1
0
-4
-2
0
E[t | H0 ]
2
tcrit
t
PSYC 6130, PROF. J. ELDER
21
4
  E[t | Ha ]
6
8
Planning an Experiment: Approximations
• Estimates of effect size are always approximate, and so it is
reasonable to make approximations when planning a study.
• For example:
Homogeneity of variance: 1  2
Balanced samples: n1  n2
Large samples  
PSYC 6130, PROF. J. ELDER
s t

n
z
22
Standardized Distributions of the Difference of the Means
  E[t | Ha ]
X  X 
2
E 1

 s X1  X2 
1   2
 X X
1


2
1  2
2 / n
2
(Assume homogeneity of variance, equal sample sizes)
  2 
n

d, where d  effect size.  Recall that d  1
2
 

PSYC 6130, PROF. J. ELDER
23
Standardized Distributions of the Difference of the Means
NHD
AHD
0.4
Probability p(t)
0.3

Power 1  
0.2
1

0.1
0
-4
-2
0
E[t | H0 ]
2
tcrit
t
PSYC 6130, PROF. J. ELDER
24
4

n
d
2
6
8
Estimating Power
If we have an estimate of the expected t value  ,
we can estimate the power 1  
 Pr(t  tcrit   | E[t ]  0)
(Non-central t distribution)
(Central t distribution)
Pr(t)
Pr(t)
1   Pr(t  tcrit | E[t ]   )

tcrit
1 

1 

tcrit  
0
Expected t value
PSYC 6130, PROF. J. ELDER
25
Calculating Power from Sample Size and Effect Size

-
tcrit
Sample size n
+
  E [t ]
Effect Size d
+

PSYC 6130, PROF. J. ELDER
n
d
2
26
+
Power 1  
Planning Experiments
• Planning experiments may involve estimating any one of these
variables given knowledge or assumptions about the other two:
n, d  power:  
n
d
2
You have already decided on the size of your sample, and you have an estimate of the effect size.
What is the power of your experiment?
 
d, power  n : n  2  
d 
2
You have an estimate of the effect size, and know the minimum power you want.
What sample size do you need?
n, power  d : d 
2

n
You have already decided on the size of your sample, and you know the power you want.
What does the size of the effect have to be to give you this power?
PSYC 6130, PROF. J. ELDER
27
Example: Height Difference between Men and Women




PSYC 6130, PROF. J. ELDER
64.0 "
2.75 "
69.4 "
2.78 "
28
Example 1. n, d  power

n
d
2
•
From this large prior study we know men are on average 5.4” taller than women.
•
We wish to see if this also applies to University students, i.e., whether male students
are taller on average than female students.
•
What power will we obtain if we have a class of 10 males and 10 females?
PSYC 6130, PROF. J. ELDER
29
Or Use Appendix Power
PSYC 6130, PROF. J. ELDER
30
Example 2a.
d, power  n
 
n  2 
d 
2
•
From this large prior study we know men are on average 5.4” taller than women.
•
We wish to see if this also applies to University students, i.e., whether male students
are taller on average than female students.
•
What sample size do we need to obtain power of 0.8?




PSYC 6130, PROF. J. ELDER
64.0 "
2.75 "
69.4 "
2.78 "
31
Example 2b. d, power  n
 
n  2 
d 
2
•
Suppose we only care about differences greater than 1”
•
Suppose also that we wish to have power of at least .8 (i.e., 80% chance of
rejecting the null hypothesis, given it is false) for a 2-tailed test with =.05.
•
What is the maximum sample size worth collecting?
1   0.8    0.2  tcrit  
  0.05, 2  tailed  tcrit
0.85

2.8
1.96
1
d
 0.36
2.76
PSYC 6130, PROF. J. ELDER
32
 2.8 
n  2

 0.36 
2
121
Example 3. n, power  d
2
d

n
• Suppose we are stuck with a sample size of 10 (i.e., 10 men and 10
women). Is it worth doing the study?
• Let’s decide that it is not worth doing the study unless we have
power of at least .8 (i.e., 80% chance of rejecting the null
hypothesis, given it is false) for a 2-tailed test with =.05.
PSYC 6130, PROF. J. ELDER
33
Manipulating Power
• In theory, power can be manipulated by changing
– Sample size
– Alpha level
– Effect size
• Increase strength of treatment
• Decrease variability
– Control of nuisance variables
– Matched designs
PSYC 6130, PROF. J. ELDER
34
One-Sample Tests

1  2 1  2

 nd
X
/ n
Thus
 
2
n 
d
Note the greater power of one-sample tests, relative to two-sample tests!
PSYC 6130, PROF. J. ELDER
35
Unequal Sample Sizes
• When samples are of different size, apply same formulas for
estimating power, using average sample size.
• Most accurate method is to use the harmonic mean:
harmonic mean of n1 and n2 
2n1n2
n1  n2
Example:
Example:
n1=10, n2  10
n1=10, n2  2
PSYC 6130, PROF. J. ELDER
36
Effect Size for Paired Sample Designs
• Two methods for computing effect size for paired
designs:
1) g 
D
sp
(e.g., Dunlop et al., 1996)
2) g 
D
sD
(e.g., Rosenthal, 1991)
• Either method is fine, as long as you know what it
means!
PSYC 6130, PROF. J. ELDER
37