#### Transcript P-Values Slides

```P-Values
Stephen Senn
Competence Centre for Methodology and Statistics
(c)Stephen Senn 2012
1
Outline
•
Introduction to P_values
– History
– How they are defined and calculated
– What they are
– What they are not
• P-values as evidence?
– Distribution under null
– Distribution under alternative
– Relevance or irrelevance of stopping rules
• Some (further) controversies
– Multiple endpoints
– One sided versus two-sided tests
– Reproduction probabilities
– Two-trials rules and combining P-values
(c)Stephen Senn 2012
2
What am I trying to do?
• Good question!
• It’s a big field that statisticians (and others)
have been arguing about for nearly a century
• All I am going to be able to do is touch on
some issues
(c)Stephen Senn 2012
3
Introduction to P-values
– History
– How they are defined and calculated
– What they are
– What they are not
(c)Stephen Senn 2012
4
John Arbuthnot (1667-1753)
An argument for divine providence, taken from the constant regularity observ’d in
the births of both sexes.
Male and Female christenings 1629-1710 in London showed an excess of males in
every single one of 82 years.
“This Equality of [adult] Males and Females is not the Effect of Chance but Divine
Providence”*. Arbuthnot
* Quoted by Anders Hald in his book, A History of Probability and Statistics.
(c)Stephen Senn 2012
5
Arbuthnot’s Data
(c)Stephen Senn 2012
6
Sex Ratio Christenings
(c)Stephen Senn 2012
7
The Essence of Arbuthnot’s
Argument
• Suppose the null hypothesis is true
– The pattern is a chance pattern
• Calculate the probability of the result
• This probability is very small
– (1/2)82 (one tailed)
– 2(1/2)82 = (1/2)81(two tailed)
• Therefore we choose to reject the null
hypothesis
(c)Stephen Senn 2012
8
Daniel Bernoulli (1700-1782)
• Nephew of James Bernoulli (1654-1705), a key
figure in the history of probability
• Early example of significance test (1734)
• Planetary orbits appeared to be closely
aligned
– similar planes
(c)Stephen Senn 2012
9
D Bernoulli’s Data
Mercury
256'
Venus
410'
Earth
730'
Mars
549'
Jupiter
621'
Saturn
558'
(c)Stephen Senn 2012
10
D Bernoulli’s Test
• Most extreme angle is 7º30’
• Probability of any planet having inclination
 7º30’ = 7º30’/90 º= 1/12
• Probability of six planets with inclination
 7º30’ = (1/12)6 = 1/2985984
• This result is therefore unlikely if chance is
the explanation
(c)Stephen Senn 2012
11
Common Features
• A null hypothesis
– No difference in probability of male or female
birth
– Alignment of planets is random
• A test statistic
– Number of years excess male births
– Maximum angle of declination
• The probability under the null hypothesis
(c)Stephen Senn 2012
12
Difference Between the Two
• Exact versus tail area
– Arbuthnott : probability of most extreme case
– Benoulli : probability of all cases as extreme or
more extreme
• Interpretation
– Arbuthnott: P-value & likelihood & likelihood ratio
– Bernoulli: P-value only
(c)Stephen Senn 2012
13
(c)Stephen Senn 2012
14
What are P-values?
• They are a measure of unusualness of the data
• The probability of a result as extreme or more
extreme than that observed given that the
null-hypothesis is true
• This requires an agreed definition of what is
more extreme
(c)Stephen Senn 2012
15
Probability of results
as extreme or more
extreme than that
observed = P-value
(one-sided)
(c)Stephen Senn 2012
Observed value of
statistic
16
What P-Values are not
• They are not the probability that the null
hypothesis is true
• They are not (usually) the likelihood under the
null hypothesis
(c)Stephen Senn 2012
17
Warning!
• A common fallacy is to think that the Pvalue is a statement about the probability
of the hypothesis.
• It is a statement about the probability of
the data given the hypothesis.
– Strictly speaking of the data plus more extreme
data
• Probability statements are not reversible
– An example is given on the next slide
(c)Stephen Senn 2012
18
The Prosecutor’s Fallacy
• There is a one in a million probability of the DNA
matching
• Therefore there are 999,999 chances in a million
that the the suspect is guilty.
– Wrong
• In a population of 300 million (as say in USA)
there are 300 similar profiles.
• Therefore we could make the same statement
(c)Stephen Senn 2012
19
The Mistake
• The probability of the observed DNA profile
given innocence is one thing.
• The probability of innocence given the DNA
profile is another.
• They are not the same
NB P-value is not the probability of the
null hypothesis being true
(c)Stephen Senn 2012
20
Invalid Inversion
'You might just as well say,' added the March Hare, 'that “I like what I get" is
the same thing as "I get what I like"!'
Lewis Caroll, Alice in Wonderland
• Common fallacy
• Known to philosophers as ‘the error of the
transposed conditional’
• The probability of A given B is not the same as
the probability of B given A
(c)Stephen Senn 2012
21
A Simple Example
• Most women do not suffer from breast cancer
• It would be a mistake to conclude, however,
that most breast cancer victims are not
women
• To do so would be to transpose the
conditionals
• This is an example of invalid inversion
(c)Stephen Senn 2012
22
Some Plausible Figures
Probability breast cancer given female = 550/31,418=0.018
Probability female given breast cancer =550/553=0.995
(c)Stephen Senn 2012
23
A Little Maths
PA B 
P B A 
PA B
P B
PA B
P  A
U nless P  B   P  A  , P  A B   P  B A 
So invalid inversion is equivalent to a confusion of the marginal probabilities. The
same joint probability is involved in the two conditional probabilities but different
marginal probabilities are involved
(c)Stephen Senn 2012
24
P-values as evidence?
– Distribution under null
– Distribution under alternative
– Relevance or irrelevance of stopping rules
(c)Stephen Senn 2012
25
(c)Stephen Senn 2012
26
(c)Stephen Senn 2012
27
(c)Stephen Senn 2012
28
Sequential Trial
Definition: A clinical trial in which the results are analysed at various intervals with the
intention of stopping the trial when a 'conclusive result' has been reached. A stopping
rule is usually defined in advance.
In frequentist statistics the result of our analysis is a significance test (or hypothesis
test).
Now suppose that we propose to carry out a clinical trial in which we intend to study
500 patients at the most. However, after 250 patients have been studied we will carry
out a significance test to see if the treatment is ‘already significant’.
If it is we will stop.
Now if we use conventional ‘fixed sample size techniques’ to examine ‘significance’ we
shall claim significance rather more often than if we had run a fixed sample size trial,
even when the null hypothesis is true.
(c)Stephen Senn 2012
29
Illustration of ‘alpha inflation’
Circumstances under which fixed and sequentially run trials would
achieve significance if there were no adjustment for multiple testing.
It is assumed that 500 patients are studied if the trial runs to
completion but that one looks after 250
All 500
Not
Result Significant
First 250
Significant
Not
Significant
Neither
Both
Significant
Sequential
only
Both
(c)Stephen Senn 2012
30
Moral
• Other things being equal, the chance of
declaring a significant result is higher if we
analyse sequentially
• Therefore we should use more stringent
significance levels
• Or, equivalently, adjust our P-values upwards
• Or should we…….?
(c)Stephen Senn 2012
31
Example
A Theoretical Problem?
• Is there a bias in meta-analysing trials that have
themselves been run sequentially?
• Example on the next slide
– Single look
– No adjustment for repeated testing
– Information fraction at look varies in steps of 0.01
from 0.01 to 0.99
– Treatment effect is 1
– Standard error for full information fraction of 1 is 1
(c)Stephen Senn 2012
32
0.5
1.0
1.5
2.0
Two approaches to weighting sequential trials
0.0
Weighted by size
Weighted equally
0.0
0.2
0.4
0.6
0.8
1.0
Information fraction at first look
(c)Stephen Senn 2012
33
Simple proof?
• We can provide a general proof
• The lines go like this
– Trials that stop early will overestimate the treatment
effect
– Trials that don’t step early will have an early part
corresponding to stopped trials that underestimates
the treatment effect
– Provided that these two parts are added together in
the proportion in which they arise there can be no
bias
– This involves weighting trials as a whole according to
information provided
(c)Stephen Senn 2012
34
Some (further) controversies
– Multiple endpoints
– One sided versus two-sided tests
– Reproduction probabilities
– Two-trials rules and combining P-values
(c)Stephen Senn 2012
35
Multiple endpoints
• If you carry out many statistical tests then (unless they are all
perfectly concordant) the probability that at least one is
significant is greater than the probability that any one is
• Hence the probabity of rejecting at least one null hypothesis
is greater than the probability of rejecting any given null
hypothesis
• This suggests that the P-value for the tests as a family should
not be the minimum (naïve) P-value
• Hence if you are going to judge significance for the family by
using the lowest P-value, the individual P-values should be
(c)Stephen Senn 2012
36
Types of multiplicity
• Multiple treatments
 Many doses
 Gold standard trials
• Sequential testing
• Multiple outcomes
The first and second usually provide cases of (approximately) known
structural correlation: for example 0.5 for various treatments v placebo in the
case of multiple treatments
The third is (in principle) unstructured and the correlations are not known
Simplest common approach in the third case is the Bonferroni correction
Multiply each P-value by the number of tests
(c)Stephen Senn 2012
37
Types of Power
 Individual power = probability to reject an individual (specific)
false hypothesis
 Disjunctive* power = probability to reject at least one false
hypothesis
 Conjunctive* power = probability to reject all false hypotheses
 Average power = average proportion of rejected false
hypotheses
• These have also been referred to as minimal and maximal
power respectively but caution, in that case minimal power is
greater than maximal power
(c)Stephen Senn 2012
38
Bonferroni
Basic Considerations
• Protects you against worst possible configuration
of correlation matrix
• Is conservative
• Therefore the individual power is lower
• However, the disjunctive power may well be
higher
• Of course the conjunctive power is lower
(c)Stephen Senn 2012
39
Compound Symmetry for Standardised Variable
1









1









1 
Of course, this is not very
realistic. It is just being used as
a simple example to get some
sort of impression of the effect
of correlation on power.
(c)Stephen Senn 2012
40
Where we have compound symmetry we can use a single
latent variable (effectively the first principle component)
and consider the conditional distribution of all other
variables given this latent variable.
The other variables are, in fact, conditionally independent
given this latent variable and this considerably eases the
calculation of various error rates
(c)Stephen Senn 2012
41
Disjunctive power
80% Power for one outcome
P ow er
0.75
The case shows 80% power
for a single outcome for a 5%
type I error rate two-sided
0.8
It is assumed that the noncentrality parameter is the
same for all variables
0.6
0
0.2
0.4
0.6
0.8
Correlation
1 Outcome
2 Outcomes
5 Outcomes
10 Outcomes
(c)Stephen Senn 2012
42
Conjunctive power
80% Power for one outcome
1
P ow er
The case shows 80% power
for a single outcome for a 5%
type I error rate two-sided
It is assumed that the noncentrality parameter is the
same for all variables
0.5
0
0
0.2
0.4
0.6
0.8
Correlation
1 Outcome
2 Outcomes
5 Outcomes
10 Outcomes
(c)Stephen Senn 2012
43
One sided or two-sided tests?
• Should we carry out one sided or two sided
significance tests?
• The latter require us to consider departures
from the null in both directions
• This suggests that the P-value should be
doubled
• This simple issue has attracted a huge amount
of controversy
(c)Stephen Senn 2012
44
Since superiority cannot be excluded
we must use two-sided tests
Because our test situation is not to choose
between
H0 :  = 0
H 1:  > 0
but between
H0 :  = 0
H 1:   0
we must have two-sided tests
(c)Stephen Senn 2012
45
Because we would never register an
inferior drug we can use one-sided tests
According to this point of view in reality our task is to choose between
H0:   0
H1:  > 0.
A one sided test of size  for  = 0 will have type one error <  for any
other value of  in H0, therefore there is no need to have a two-sided
test.
(c)Stephen Senn 2012
46
P ra ctical S itu ation
P rob ab ility of
R egisterin g D ru g
T he dru g is harm ful
< 0.05
T he dru g is useless
0.05
T he dru g brin gs benefit
> 0.05
(c)Stephen Senn 2012
47
Label Invariance Requires us to Use
Two-Sided Tests
The advantage of a two-sided test is that it is ‘label invariant’. We can
reverse the labels of the drugs and come to the same conclusion.
(c)Stephen Senn 2012
48
If we use Two-Sided Tests we Cannot
Conclude Superiority
This is just silly. It is based on the premise that a two sided test implies that we
have
H0 :  = 0 and H1 :   0.
Rejection of H0 implies that the non-directional H1 only can be asserted.
However we could equally well write
H0 :  = 0 and H1a :  > 0.
H0 :  = 0 and H1b :  < 0.
Now if we test H0 twice, once against each of these two alternative hypotheses,
our overall type one error rate will be maintained at  provided each test is
carried out at size  /2.
(c)Stephen Senn 2012
49
We Have Used Two-Sided Tests in the
Past and Must Continue to do So
Whatever standard we use is arbitrary. The important thing is to be consistent.
Whether you regard standard tests as being two-sided at the 5% level or onesided at the 2.5% level is irrelevant: either way we must continue to do as we
did.
(c)Stephen Senn 2012
50
The Whole Debate is Irrelevant
This is the Bayesian view.
What you need is
1) Prior probabilities
2) Consequences of decisions.
All the frequentist confusions arise because one tries to do without one or
other of these elements.
(c)Stephen Senn 2012
51
Paper by Goodman
Statistics in Medicine
• Considers the situation where you have
P=0.05 for one trial and you then run another
• What is the probability of repeating a result
that is just significant at the 5% level (p=0.05)?
– If true difference is observed difference
– If uninformative prior for true treatment effect
(c)Stephen Senn 2012
52
Goodman’s Criticism
• What is the probability of repeating a result
that is just significant at the 5% level (p=0.05)?
– If true difference is observed difference
– If uninformative prior for true treatment effect
(c)Stephen Senn 2012
53
Sauce for the Goose and Sauce for
the Gander
• This property is shared by Bayesian
statements
• Hence, either
– The property is undesirable and hence is a
criticism of Bayesian methods also
– Or it is desirable and is a point in favour of
frequentist methods
(c)Stephen Senn 2012
54
Three Possible Questions
• Q1 What is the probability that in a future experiment,
taking that experiment's results alone, the estimate for B
would after all be worse than that for A?
• Q2 What is the probability, having conducted this
experiment, and pooled its results with the current one,
we would show that the estimate for B was, after all,
worse than that for A?
• Q3 What is the probability that having conducted a future
experiment and then calculated a Bayesian posterior using
a uniform prior and the results of this second experiment
alone, the probability that B would be worse than A
would be less than or equal to 0.05?
(c)Stephen Senn 2012
55
T
h
r
e
e
K
i
n
d
s
o
f
R
e
p
l
i
c
a
t
i
o
n
P
r
o
b
a
b
i
l
i
t
y
1
.
0
p
r
o
b
a
b
i
l
i
t
y
Q
1
p
r
o
b
a
b
i
l
i
t
y
Q
2
p
r
o
b
a
b
i
l
i
t
y
Q
3
0
.
8
Probailty
0
.
6
0
.
4
0
.
2
0
.
0
1
.
0
1
0
2
34
5
6
7
8
0
.
0
1
0
2
34
5
6
7
8
1
.
0
1
0
R
a
t
i
o
o
f
s
t
a
n
d
a
r
d
e
r
r
o
r
s
(c)Stephen Senn 2012
56
Why Goodman’s Criticism is
Irrelevant
“It would be absurd if our inferences about the world, having just completed a
clinical trial, were necessarily dependent on assuming the following. 1. We are
now going to repeat this experiment. 2. We are going to repeat it only once. 3.
It must be exactly the same size as the experiment we have just run. 4. The
inferential meaning of the experiment we have just run is the extent to which
it predicts this second experiment.”
(c)Stephen Senn 2012
57
The Two-Trials Rule
The FDA usually requires two trials to be significant.
If only two trials are run this implies a type I error rate of 1/20 x 1/20 =
1/400.
Similarly the consumer’s risk is at 1/40 x 1/40 = 1/1600.
If however we wish to have a consumer’s risk of 1/1600 then a more
powerful approach is to require
(z1 + z2) > 2 x 3.227
for two trials of equal size.
(c)Stephen Senn 2012
58
10
1.96
z score for trial 2
1.96
10
10
-10
two trials boundary
z score for trial 1
pooled trials boundary
(c)Stephen Senn 2012
59
So What’s the Point?
• Could use more powerful pooled trials
approach.
• Reason why not might be random effects
argument.
• Trials can be struck by a ‘random gremlin’.
• Then no logic, however, in making trials
identical.
(c)Stephen Senn 2012
60
Contour plots for Tippet boundary
0.95
0.90
0.85
0.80
0.8
0.6
0.550.650.75 0.85
0.35
0.25
0.45
0.9
0.20.30.40.5 0.6 0.7 0.8
0.95
0.75
0.70
0.65
0.4
0.60
0.55
0.50
0.2
0.45
0.40
0.2
0.4
0.6
0.35
0.8
0.30
0.25
0.20
0.15
0.10
0.05
Contour plots Fisher combined P-values
0.95
0.9
0.8
0.6
0.4
0.450.550.65
0.35
0.7
0.3 0.5
0.25
Contour plots for average Z score
0.85
0.6
0.8
0.75
0.2
0.4
0.15
0.2
0.9
0.3
0.25
0.2
0.85
0.8
0.75
0.7
0.5 0.6
0.65
0.4 0.55
0.2
0.4
0.1
0.45
0.35
0.15
0.6
0.2
0.95
0.8
0.05 0.1
0.4
0.6
0.8
0.2
(c)Stephen Senn 2012
0.4
0.6
0.8
61
In Summary
• It is a very bad idea to analyse data only by Pvalues
• Estimates and confidence intervals should be
given as well
• Be careful in interpreting P-values
• Look at data from lots of different ways
(c)Stephen Senn 2012
62
References
1.
Senn SJ. Two cheers for P-values. Journal of Epidemiology and
Biostatistics 2001; 6: 193-204.
2.
Senn SJ. A comment on replication, p-values and evidence S.N.Goodman,
Statistics in Medicine 1992; 11:875-879. Statistics in Medicine 2002; 21: 2437-2444.
3.
Senn SJ. P-Values. In Encylopedia of Biopharmaceutical Statistics, Chow
SC (ed). Marcel Dekker, 2003.
(c)Stephen Senn 2012
63
```