P-value A - Ferran Torres

Download Report

Transcript P-value A - Ferran Torres

Statistics for non-statisticians
Marco Pavesi
Lead Statistician
Liver Unit – Hospital Clínic i Provincial
Ferran Torres
Statistics and Methodology Support Unit. Hospital Clínic Barcelona
Biostatistics Unit. School of Medicine. Universitat Autònoma Barcelona (UAB)
Outline
• Why Statistics?
• Descriptive Statistics.


Populations and Samples.
Type of errors
• Inferential Statistics. Hypothesis testing



Statistical errors
p-value
Confidence Intervals
• Multiplicity issues. Type of tests. Sample size
• Multivariate analysis. More on p-values
• Conclusion: “little shop of horrors”
Intro. Why should we learn statistics ?
Inducción y Verdad
Bertrand Russell presents…
The inductivist turkey
Troubles for the plain researchers:

Induction and statistics ARE NOT a method to get a
sort of mathematical demonstration of Truth

The results observed for a population sample are not
necessarily true for the whole population
Smart turkeys / researchers…
1) …are aware that the relevance (weight) of statistical
inferences always depends on the sample size
Smart turkeys / researchers…
2) …do know that we can only model /estimate the
real world with a specific approximation error.
Smart turkeys / researchers…
3) …understand that true hipotheses do not exist, and
we can only reject or keep a hypothesis based on
the available evidence
What is statistics ?
• “I know (I’m making the assumption) that these dice are fair:
what is the probability of always getting a 1 in 15 runs?“
==> Probability mathematics
• “I have got always a 1 in 15 runs. Are these dice fair ?”
==> Inferential STATISTICS
So, why statistics? To account for chance & variability!
Why is Statistics needed?
• Statistics tells us whether events are likely to have happened
simply by chance
• Statistics is needed because we always work with sample
observation (variability) and never with populations
• Statistics is the only mean to predict what is more likely to
happen in new situations and helps us to make decisions
Introduction to descriptive statistics
Population and Samples
Sample
Study Population
Target Population
Random vs Sistematic error
Example: Systolic Blood Pressure (mm Hg)
130
Random
Systematic (Bias)
True Value
True Value
150
01
02 03
04
170
05
130
150
01 05
02 03
04
170
What Statistics?
• Descriptive Statistics

Position statistics (central tendency measures): mean, median

Dispersion statistics: variance, standard deviation, standard error

Shape statistics: symmetry, skewness and kurtosis measures.
The mean and the median
Arithmetic mean (average):
Median:
(50% of sample individuals have a value
higher than or equal to the median)
n x 
X  i 1  i 
n
1,3,3,4,6,13,14,14,18  6
1,3,3,4,6,13,14,14,17,18  6 - 13
Median=(6+13)/2=9.5
• Unlikely the median, the mean is affected by outliers
• Especially relevant for specific distributions (survival times)
Mean 1
Mean 2
New outlier
Median 1
Median 2
Dispersion measures
• The Variance is the mean of
squared differences from the
distribution mean:
 
• The Standard Deviation is the
square root of the Variance:
• The Standard Error is generally
•
expressed as the ratio between the
Variance and the sample size:
It is considered as the true SD of the
population mean (or parameter)
2
SD 
1
n 1
 x   
1
n 1
n
2
i 1
i
2


x


i1 i
n
SE = σ2 / N
Inference & tests
• Inferential Statistics



Draw conclusions (inferences) from incomplete (sample) data.
Allow us to make predictions about the target population based on the
results observed in the sample
Are computed in hypothesis testing
• Examples

95%CI’, t-test, chi square test, ANOVA, regression
Basic pattern of statistical tests
Observed Expected
TestStatistic 
Variability
• Based on the total number of observations and the size of the
test statistic, one can determine the P value.
How many noise units?
Signal
TestStatistic 
Noise
• Test statistic & sample size (degrees of freedom) convert to a
probability or P Value.
Overall hypothesis testing flow chart
Test Statistics value
Corresponding P-value (from known distribution)
Comparison with significance level
(previously defined)
P<α
Reject null hypothesis
P >= α
Keep null hypothesis
Introduction to inferential statistics
The role of statistics
“Thus statistical methods are no substitute for
common sense and objectivity. They should
never aim to confuse the reader, but instead
should be a major contributor to the clarity of
a scientific argument.”
The role of statistics. Pocock SJ . Br J Psychiat 1980;
137:188-190
23
Extrapolation
Sample
Study Results
Inferential analysis
Statistical Tests
Confidence Intervals
Population
“Conclusions”
Statistical Inference
Statistical Tests=> p-value
Confidence Intervals
25
Valid samples?
Population
Likely to occur
Invalid Sample and Conclusions
Unlikely to occur
P-value
• The p-value is a “tool” to answer the question:

Could the observed results have occurred by chance*?
p < .05
“statistically significant”

Remember:
– Decision given the observed results in a SAMPLE
– Extrapolating results to POPULATION
*: accounts exclusively for the random error, not bias
27
A intuitive definition
• The p-value is the probability of having observed our data when
the null hypothesis is true
• Steps:
1)
2)
3)
4)
Calculate the treatment differences in the sample (A-B)
Assume that both treatments are equal (A=B) and then…
…calculate the probability of obtaining a magnitude of at least the
observed differences, given the assumption 2
We conclude according the probability:
a. p<0.05: the differences are unlikely to be explained by random,
we assume that the treatment explains the differences
b. p>0.05: the differences could be explained by random,
we assume that random explains the differences
HYPOTHESIS TESTING
• Testing two hypotheses


H0: A=B
H1: A≠B
(Null hypothesis – no difference)
(Alternative hypothesis)
• Calculate test statistic based on the assumption that
H0 is true (i.e. there is no real difference)
• Test will give us a p-value: how likely are the collected
data if H0 is true
• If this is unlikely (small p-value), we reject H0
RCT from a statistical point of view
Treatment A
Randomisation
Treatment B (control)
1 homogeneous population
2 distinct populations
RCT
Sample
Population
Statistical significance/Confidence
• A>B
p<0.05
means:
• “I can conclude that the higher values
?
observed with treatment A vs treatment B
are linked to the treatment rather to
chance, with a risk of error of less than
5%”
Factors influencing statistical significance
• Signal
• Difference
• Noise (background)
• Variance (SD)
• Quantity
• Quantity of data
P-value
•A “very low” p-value do NOT imply:
 Clinical
relevance (NO!!!)
 Magnitude
With n
of the treatment effect (NO!!)
or
variability
 p
•Please never compare p-values!! (NO!!!)
P-value
• A “statistically significant” result
(p<.05)
tells us NOTHING about clinical or scientific
importance. Only, that the results were not
due to chance.
A p-value does NOT account for bias
only by random error
STAT REPORT
THE BASIC IDEA
•Statistics
can never PROVE
anything beyond any doubt, just
beyond reasonable doubt!!
•… because of working with
samples and random error
Type I & II Error & Power
Reality
(Population)
Conclusion
(sample)
A=B
A≠B
“A=B” p>0.05
OK
Type II error
()
A≠B p<0.05
Type I error
()
OK
Type I & II Error & Power
• Type I Error ()




False positive
Rejecting the null hypothesis when in fact it is true
Standard: =0.05
In words, chance of finding statistical significance when in fact there truly
was no effect
• Type II Error ()




False negative
Accepting the null hypothesis when in fact alternative is true
Standard: =0.20 or 0.10
In words, chance of not finding statistical significance when in fact there
was an effect
Type I & II Error & Power
• Power



1-Type II Error ()
Usually in percentage: 80% or 90% (for  =0.1 or 0.2, respectively)
In words, chance of finding statistical significance when in fact there is an
effect
Reality
(Population)
Conclusion
(sample)
A=B
A≠B
“A=B” p>0.05
OK
Type II error
()
A≠B p<0.05
Type I error
()
POWER
95%CI
• Better than p-values…

…use the data collected in the trial to give an estimate of the treatment
effect size, together with a measure of how certain we are of our
estimate
• CI is a range of values within which the “true” treatment effect is
believed to be found, with a given level of confidence.
 95% CI is a range of values within which the ‘true’ treatment effect
will lie 95% of the time
• Generally, 95% CI is calculated as

Sample Estimate ± 1.96 x Standard Error
Interval Estimation
A probability that the population parameter
falls somewhere within the interval.
Confidence
interval
Confidence
limit (lower)
Sample statistic
(point estimate)
Confidence
limit (upper)
Superiority study
Control better
Test better
IC95%
d < 0
- effect
d = 0
No differences
d > 0
+ effect
Superiority study
Control better
Test better
IC95%
d < 0
- effect
d = 0
No differences
d > 0
+ effect
Multiplicity
 To
say it
colloquially,
torture the data
until they
speak...
Lancet 2005; 365: 1591–95
45
Torturing data…

Investigators examine additional endpoints, manipulate group
comparisons, do many subgroup analyses, and undertake
repeated interim analyses.

Investigators should report all analytical comparisons
implemented. Unfortunately, they sometimes hide the
complete analysis, handicapping the reader’s understanding
of the results.
Lancet 2005; 365: 1591–95
46
Design
Conduction
Results
47
Multiplicity
K independent hypothesis : H01 , H02 , ... , H0K
S significant results ( p< )
Pr (S  1 | H01  H02  ...  H0K = H0.)
= 1 - Pr (S=0|H0.)
= 1- (1 - )K
K
Pr(S>=1|Ho.)
K
Pr(S>=1|Ho.)
1
0.0500
10
0.4013
2
0.0975
15
0.5367
3
0.1426
20
0.6415
4
0.1855
25
0.7226
5
0.2262
30
0.7854
48
Sources of multiplicity in RCT
• Multiple assessment criteria (variables)
• Multiple times of assessment (repeated measurements)
• Multiple inspections (interim analyses)
• Multiple comparisons (more than two treatments)
• Multiple subsets and subgroups
49
Same examples
Variables
Times
Subgroups
Comparisons
case A
2
2
2
1
case B
5
4
3
1
case C
5
4
3
3
total
False positive rate
8
33.66%
60
96.61%
180
99.99%
50
Multiplicity
• Bonferroni correction (simplified version)
K tests with level of signification of 
 Each test can be tested at the /k level

• Example:
5 independent tests
 Global level of significance=5%
 Each test shoud be tested at the 1% level
5% /5 => 1%

51
Interim Analyses in the CDP
+2
+1
Z Value 0
-1
-2
10
20
30
40
50
60
70
80
90
100
Month of Follow-up
(Month 0 = March 1966, Month 100 = July 1974)
Coronary Drug Project Mortality Surveillance
Circulation. 1973;47:I-1
http://clinicaltrials.gov/ct/show/NCT00000483;jsessionid=C4EA2EA9C3351138F8CAB6AFB7238
20A?order=23
52
Lancet 2005; 365: 1657–61
53
Sample Size
Sample Size
• The planned number of participants is calculated on the basis
of:

Expected effect of treatment(s)

Variability of the chosen endpoint

Accepted risks in conclusion
↗ effect ↘ number
↗ variability ↗ number
↗ risk ↘ number
Sample Size
• The planned number of participants is calculated on the basis
of:

Expected effect of treatment(s)

Variability of the chosen endpoint

Accepted risks in conclusion
↗ effect ↘ number
↗ variability ↗ number
↗ risk ↘ number
ALTURA
ALTURA
ALTURA
300
300
120
100
200
200
80
60
40
Desv . típ. = 25.54
0
Frecuencia
Frecuencia
100
Media = 165.1
Desv . típ. = 26.94
N = 2000.00
Media = 165.0
0
Media = 165.1
N = 2000.00
0
0
0.
22
0
0.
21
0
0.
20
0
0.
19
0
0.
18
0
0.
17
0
0.
16
0
0.
15
0
0.
14
0
0.
13
0
0.
12
0
0.
11
5
2.
20 .5
7
19 .5
2
19 .5
7
18 .5
2
18 .5
7
17 .5
2
17 .5
7
16 .5
2
16 .5
7
15 .5
2
15 .5
7
14 .5
2
14 .5
7
13 .5
2
13 .5
7
12 .5
2
12
N = 2000.00
Desv . típ. = 32.27
20
0
0.
25 .0
0
24 .0
0
23 .0
0
22 .0
0
21 .0
0
20 .0
0
19 .0
0
18 .0
0
17 .0
0
16 .0
0
15 .0
0
14 .0
0
13 .0
0
12 .0
0
11 0
0.
10
.0
90 0
.
80
Frecuencia
100
ALTURA
ALTURA
ALTURA
Sample Size
• The planned number of participants is calculated on the basis
of:

Expected effect of treatment(s)

Variability of the chosen endpoint

Accepted risks in conclusion
↗ effect ↘ number
↗ variability ↗ number
↗ risk ↘ number
Reality
(Population)
Conclusion
(sample)
A=B
A≠B
“A=B” p>0.05
OK
Type II error
()
A≠B p<0.05
Type I error
()
POWER
Which statistical test
58
Normal vs. Skewed Distributions
• Parametric statistical test can be used to
assess variables that have a “normal” or
symmetrical bell-shaped distribution curve for
a histogram.
• Nonparametric statistical test can be used to
assess variables that are skewed or nonnormal.
• “Inferential tests” vs Look at a histogram to
decide.
Examples of Normal and Skewed
44-DAYS IN ICU
35-SYSTOLIC BLOOD PRESSURE FIRST ER
1000
160
140
800
120
100
600
80
400
40
Frequency
60
200
Std. Dev = 3.99
Std. Dev = 27.74
20
Mean = .9
Mean = 146.9
N = 925.00
0
0
0.
250.0
24 .0
0
230.0
220.0
21 .0
0
200.0
19 .0
0
180.0
170.0
16 .0
0
150.0
140.0
13 .0
0
120.0
11 0
0.
10
.0
90 0
.
80 0
.
70.0
60
35-SYSTOLIC BLOOD PRESSURE FIRST ER
N = 933.00
0
0.0
10.0 20.0 30.0 40.0 50.0 60.0 70.0
5.0
15.0 25.0 35.0 45.0 55.0 65.0
44-DAYS IN ICU
Parametric vs. Nonparametric
•
•
•
•
•
Student’s t-test
One-way ANOVA
Paired t-test
Pearson correlation
Correlated F ratio
(repeatedmeasures ANOVA)
•
•
•
•
•
Mann-Whitney U test
Kruskal-Wallis test
Wilcoxon signed-rank
Spearman’s r
Friedman ANOVA
The type of Inferential Tests depend on data
• Repeated measures ?

UnMatched groups: different subsets of the population in each condition:
– Independent data (paired)

Matched groups : the same individuals in each condition:
– dependent data (unpaired)
• Type of data



Continuous Gaussian, Metric  mean, SD, ….
Continuous non-Gaussian, ordinal Ranks: 1,2,3,4,5,6,7,8,9,10 Median, interquartile
range
Nominal, Categories: 49% “yes”, 33 “no”, 18% “no opinion”, frequencies and
percentages
Qualitative dependent variable
Independent Variable
Qualitative (nominal)
No. of
groups
2
>2
Independent
Data
Fisher's test (chi-square)
Dependent
Data
McNemar's test
Cochrane Q
Quantitative independent variable,
Independent (unpaired) data
Independent Variable
Qualitative (nominal)
Quantitative
No. of
groups
Parametric
Non-Parametric
Measurement (from
Gaussian Population)
Rank, Score, or
Measurement (from
Non- Gaussian
Population)
2
t-Test
Mann-Whitney test
>2
One-way ANOVA
Kruskal-Wallis test
2
Pearson correlation
Spearman correlation
Quantitative independent variable,
dependent (paired) data
Independent Variable
Qualitative (nominal)
No. of
groups
Parametric
Non-Parametric
Measurement (from
Gaussian Population)
Rank, Score, or
Measurement (from
Non- Gaussian
Population)
2
t-Test (paired)
Wilcoxon test
>2
One-way ANOVA
(paired)
Friedman test
Type of Data
Goal
Measurement
(from Gaussian
Population)
Rank, Score, or
Measurement
(from NonGaussian
Population)
Binomial
(Two
Possible
Outcomes)
Survival Time
Describe one
group
Mean, SD
Median,
interquartile range
Proportion
Kaplan Meier
survival curve
Compare one
group to a
hypothetical
value
One-sample t
test
Wilcoxon test
Chi-square
or
Binomial test
**
Compare two
unpaired
groups
Unpaired t test
Mann-Whitney test
Fisher's test
(chi-square
for large
samples)
Log-rank test
or MantelHaenszel*
Compare two
paired groups
Paired t test
Wilcoxon test
McNemar's
test
Conditional
proportional
hazards
regression*
Compare three
or more
unmatched
groups
One-way
ANOVA
Kruskal-Wallis test
Chi-square
test
Cox
proportional
hazard
regression**
Compare three
or more
matched
groups
Repeatedmeasures
ANOVA
Friedman test
Cochrane Q**
Conditional
proportional
hazards
regression**
Quantify
association
between two
variables
Pearson
correlation
Spearman
correlation
Contingency
coefficients**
Predict value
from another
measured
variable
Simple linear
regression
or
Nonlinear
regression
Nonparametric
regression**
Simple logistic Cox
regression*
proportional
hazard
regression*
Predict value
from several
measured or
binomial
variables
Multiple linear
regression*
or
Multiple
nonlinear
regression**
Multiple
logistic
regression*
Cox
proportional
hazard
regression*
• http://statpages.org/
• http://www.microsiris.com/Statistical%20Decision%20Tree/
• http://www.socialresearchmethods.net/selstat/ssstart.htm
• http://www.wadsworth.com/psychology_d/templates/student_res
ources/workshops/stat_workshp/chose_stat/chose_stat_01.html
• http://www.graphpad.com/www/Book/Choose.htm
A Good Rule to Follow
• Always check your results with a nonparametric
(sensitivity analysis)
• If you test your null hypothesis with a Student’s t-test,
also check it with a Mann-Whitney U test.
• It will only take an extra 25 seconds.
• Use common sense and prior knowledge!!
Multivariate statistics:
why and when ?
Marco Pavesi
Lead Statistician
Liver Unit – Hospital Clínic i Provincial
Barcelona
2 or 3 more things on p-values
• P-values only depend on the magnitude of the test statistic
computed based on observed (sample) data.
• They are related to the evidence against the null hypothesis and
tell us how confortable we should feel when we reject it.
• They are not related in any way to the clinical relevance of the
“signal” (or effect, or difference, or whatever result) observed !!
Clinical study design chart
YES
EXPERIMENTAL
STUDY
NO
Repeated
measurements
taken?
(Ex. Randomized Clinical Trial)
Any
intervention
applied &
studied?
YES
NO
PROSPECTIVE
STUDY
CROSS-SECTIONAL
STUDY
(Ex. Cohort study designs)
(Ex. Case-control study
designs)
Randomization
1. Eliminates assignment bias
2.
Tends to produce comparable groups for known and unknown,
recoded and unrecorded factors
Design
Sources of Imbalance
Randomized
Concurrent (prospective)
Chance
Chance & Selection Bias
(Non-randomized)
Historical (retrospective)
Chance, Selection Bias & Time Bias
(Non-randomized)
3.
Adds validity (extrapolability) to the results of statistical tests
Reference: Byar et al (1976) NEJM
Confounding
• No randomization  Lack of homogeneity between groups in the
distribution of risk (protection) factors
• A potential confounder is:



Associated to the outcome
Associated to the main factor studied
Not involved in the causal association between factor and outcome as a
midway step
EXPOSURE
(coffee intake)
OUTCOME
(stroke)
CONFOUNDING FACTOR
(smoking)
Interactions
• Effect modification
• Different risk (effect) estimates are associated to different strata
of a specific factor.
20%
Factor A, stratum 2
Outcome
associated to a
specific factor
“A” (ex. death)
(ex. Female)
10%
Factor A, stratum 1
7%
(ex. Male)
Factor B, stratum 1
(ex. Age < 65)
Factor B, stratum 2
(ex. Age  65)
Multivariate analysis and statistical models
• A model is “a simplified representation (usually mathematical)
used to explain the workings of a real world system or event”
(Wikipedia)
• Two types of statistical models are used in clinical reasearch
/epidemiology:


Predictive models
Explanatory models
• Both are fitted by means of multivariate analysis techniques
Predictive models
• Used when we are interested in predicting the probability of a specific
outcome or the value of a specific dependent variable
• Focused on selection of the best subset of predictors and highest
precision of estimates
• The selection of predictors is based on their contribution to the
predictive ability of the model (i.e., on p-values)
• Ex. Framingham equations to predict the probability of developing
coronary events at 10 years
(http://www.framinghamheartstudy.org/risk/index.html)
Framingham predictive equation for CHD
Estimated Coefficients Underlying CHD
Prediction Sheets Using Total Cholesterol Categories
Variable
Men
Age,-y
0.04826
Age-squared,-y
Women
0.33766
-0.00268
TC,-mg/dL
<160
-0.65945
-0.26138
160-199
Referent
Referent
200-239
0.17692
0.20771
240-279
0.50539
0.24385
>=280
0.65713
0.53513
<35
0.49744
0.84312
35-44
0.2431
0.37796
45-49
Referent
0.19785
50-59
-0.05107
Referent
>=60
-0.4866
-0.42951
Optimal
-0.00226
-0.53363
Normal
Referent
Referent
High-normal
0.2832
-0.06773
Stage-I-hypertension
0.52168
0.26288
Stage-II-IV-hypertension
0.61859
0.46573
Diabetes
0.42839
0.59626
Smoker
0.52337
0.29246
Baseline-survival-function-at-10-years, S0(10)
0.90015
0.96246
Linear predictor at risk factor means
3.09750
9.92545
HDL-C,-mg/dL
Blood-pressure
Explanatory models
• Study objective: to assess (estimate) the effect of a specific
factor on the study outcome
• Multivariate analysis aimed at getting the best (most valid)
estimate of the studied effect
• Confounders must be accounted for in the model
• Evaluation of confounding variables is based on the change of
model estimates, NOT ON STATISTICAL SIGNIFICANCE.
• Rule of thumb: add each potential confounder into the model one
by one and keep only those modifying by more than 10% the
estimate of the main factor
Adjusting for confounders: an example
Outcome variables and statistical models
A summary table
• Continuous (normally distr.) outcome:

ANOVA, or ANCOVA or Linear Regression model
• Bivariate (YES/NO):

Logistic regression
• Categorical (with a ref.group):

Multinomial logistic regression
• Time-to-event (different follow-up times & censored cases):

Survival models (ex. Cox PH)
• Number of counts:

Poisson or Negative Binomial regression
Some “take home” hints
Marco Pavesi
Lead Statistician
Liver Unit – Hospital Clínic i Provincial
Barcelona
The p-value…
… is the probability of a result like that observed in our sample
when the null hypothesis is true in the population (i.e., simply due to
chance)
…is related to the evidence against the null hypothesis and to the
reliability of the observed result
…IT DOES NOT TELL US ANYTHING ON THE CLINICAL
RELEVANCE OF THE RESULT WE HAVE OBSERVED !!
Interpretation of a p-value
• The highest the p-value, the highest the probability that the
observed result is due simply to chance:
p = 0.75  75% probability (3 out of 4 studies) to reject a true H0
p = 0.015  1.5% probability (15 out of 1,000 studies) to reject a true H0
• A “small” p-value (significance level) is established conventionally as
the highest rate of false-positive results that we consider acceptable (for
instance, the common 5% rate)
Evidence and p-value: an example (1)
Drug A. Efficacy rate: 22%
Drug B. Efficacy rate: 11%
…observed results:
Drug A. Efficacy rate: 2 / 9
Drug B. Efficacy rate: 1 / 9
P-value=0,98
Evidence and p-value: an example (2)
Drug A. Efficacy rate: 22%
Drug B. Efficacy rate: 11%
…observed results:
Drug A. Efficacy rate: 35 / 154
Drug B. Efficacy rate: 18 / 158
P-value=0,008
Evidence and p-value: an example (3)
….on the other hand…
Drug A. Known efficacy rate:
50%
Drug B. Expected efficacy rate:
52%
Δ=2%; Type I error: 0.05; Type II error: 0.20
N (per arm): 9.806
Conclusion: little shop of horrors (1)
• “No significant difference is observed between the treatment
arms.
Conclusion: the treatments are equally effective…”
…AAAAAARGH !!!!
• “Absence of evidence is not evidence of absence”
(Altman DG, Bland JM. BMJ 1995;311:485)
Conclusion: little shop of horrors (2)
• The p-value of the comparison A vs. Placebo is lower than the pvalue for the comparison B vs. Placebo
Conclusion: treatment A is better than B…”
…AAAAAARGH !!!!
• The p-value gives us a measure of the evidence against that
specific null hypothesis in that specific hypothesis test.
Conclusion: little shop of horrors (3)
• A clinician speaking to the poor, helpless statistician: “Can we
just test variable A vs. the rest of variables and check if some
difference is significant…?”
…AAAAAARGH !!!!
• Type I error increases exponentially together with the number of
hypothesis tests performed:
1 test: Type I error = 5%……5 tests: Type I error > 20%