Three Common Misinterpretations of Significance Tests and p-values 1. The p-value indicates the probability that the results are due to sampling error or.

Download Report

Transcript Three Common Misinterpretations of Significance Tests and p-values 1. The p-value indicates the probability that the results are due to sampling error or.

Three Common Misinterpretations of Significance
Tests and p-values
1. The p-value indicates the probability that the results are
due to sampling error or “chance.”
2. A statistically significant result is a “reliable” result.
3. A statistically significant result is a powerful, important
result.
Misinterpretation # 1
• The p-value is a conditional probability. The probability of
observing a specific range of sample statistics GIVEN (i.e.,
conditional upon) that the null hypothesis is true. P(D|Ho)
• This is not equivalent to the probability of the null
hypothesis being true, given the data.
• P(Ho |D)  P(D| Ho)
Misinterpretation # 1
• This later question (i.e., “How likely is it that the results
are due to sampling error or chance?”) that tends to
motivate the use of significance tests on the part of
researchers. However, these tests do not answer this
question directly.
• In order to answer this question, one needs to consider
additional pieces of information: (a) the likelihood that the
null hypothesis is true before doing the study, (b) the
probability of observing the data given other hypotheses
(e.g., the alternative hypothesis), and (c) the probability
that other hypotheses are true before doing the study.
Bayes’ Theorem
• Bayes’ theorem provides a way to combine these different
pieces of information:
P( H 0 | D ) 
P( D | H 0 )  P( H 0 )
P( D | H 0 )  P( H 0 )  P( D | H1 )  P( H1 )
Note: You don’t need to memorize this formula, but please be
able to use it and understand it.
P( D | H 0 )  P( H 0 )
P( H 0 | D ) 
P( D | H 0 )  P( H 0 )  P( D | H1 )  P( H1 )
P(H0) P(H1) P(D|H0) P(D|H1) P(H0|D)
.50
.50
.05
.95
.05
Here, P(H0|D) does =
P(D|H0)
.50
.50
.05
.01
.83
Here, P(H0|D) >
P(D|H0)
.90
.10
.05
.05
.90
Here, P(H0|D) >
P(D|H0)
.10
.90
.15
.50
.03
Here, P(H0|D) <
P(D|H0)
Misinterpretation # 2
• Is a significant result a “reliable,” easily replicated result?
• Not necessarily. The p-value is a poor indicator of the
replicability of a finding.
• Replicability (assuming a real effect exists, that is, that he
null hypothesis is false), is primarily a function of
statistical power.
Misinterpretation # 2
• If a study had a statistical power equivalent to 80%, what is
the probability of obtaining a “significant” result twice?
• The probability of two independent events both occurring is
the simple product of the probability of each of them
occurring.
• .80  .80 = .64
• If power = 50%? .50  .50 = .25
• Bottom line: The likelihood of replicating a result is
determined by statistical power, not the p-value derived
from a significance test. When power of the test is low, the
likelihood of a long-run series of replications is even lower.
Misinterpretation # 3
• Is a significant result a powerful, important result?
• Not necessarily. The importance of the result, of course,
depends on the issue at hand, the theoretical context of the
finding, etc.
Misinterpretation # 3
• We can measure the practical or theoretical significance of
an effect using an index of effect size.
• An effect size is a quantitative index of the strength of the
relationship between two variables.
• Some common measures of effect size that we’ve
discussed in this class are correlations, regression weights,
and R-squared.
• (These same indices can be used when one or more of the
variables of interest is categorical.)
Some common effect sizes in the “real world”
Effect of aspirin on heart attacks
r  .01
Effect of psychotherapy on
psychological well-being
Correlation between personality as a
child and personality as an adult
r  .30
Correlation between SAT and college
GPA
r  .30
r  .25
Misinterpretation # 3
• Importantly, the same effect size can have different pvalues, depending on the sample size of the study.
• For example, a correlation of .30 would not statistically
significant with a sample size of 30, but would be
statistically significant with a sample size of 130.
• Bottom line: The p-value is a poor way to evaluate the
practical “significance” of a research result.