Chapter 21 More about tests
Download
Report
Transcript Chapter 21 More about tests
Chapter 21 More about tests
Math2200
Identify null and alternative hypotheses
The null must be a statement about the
value of a parameter for a model.
We use this value to compute the
probability that the observed sample
statistic—or something even farther from
the null value—would occur.
Identify null and alternative hypotheses
(cont’)
The appropriate null arises directly from the
context of the problem—it is not dictated by the
data, but instead by the situation.
A good way to identify both the null and
alternative hypotheses is to think about the Why
of the situation.
To write a null hypothesis, you can’t just choose
any parameter value you like.
The null must relate to the question at hand—it is
context dependent.
Remark: “null” does not automatically mean zero.
Example: Clinical trial
A pharmaceutical company wanting to
develop and market a new drug needs to
show the new drug is effective. FDA
requires the drug is proven effective in a
double-blinded, randomized clinical trial
Null hypothesis: the proportion of patients
recovering after receiving the new drug is
the same as we would expect of patients
receiving a placebo
Alternative hypothesis?
Clinical trial (cont’)
If the purpose of the study is to find a
treatment better than the current one?
Null hypotheses: The same proportion of
patients are treated by the new treatment
as expected by the current one.
alternative hypotheses should be the
proportion of patients recovering after
receiving the new drug is higher than what
we would expect of patients receiving the
current treatment.
Example: Therapeutic touch
Therapeutic touch (TT) practitioners believe that
by adjusting the “human energy field” they can
promote healing.
15 TT practitioners
A screen was placed so that the TT practitioners can not
see the girl’s hand
Each TT practitioner attempted 10 trials
Out of 150 trials, 70 times successful (46.7%)
Can we conclude the TT practitioners can
successfully detect a “human energy field”?
TT (cont’)
Hypotheses
p: probability of successful identification
H0 : p=0.5
HA : p>0.5 (one-sided)
This is about a proportion. Let’s use oneproportion z-test.
Check the conditions first!
Independence
Randomization (the choice of hand was randomized
with a coin flip.)
10% condition
Success/failure condition
TT (cont’)
STAT TESTS 5
1 PropZTest
P0= 0.5
prop > p0
x : 70 (the number of successes)
n: 150 (sample size)
Calculate
1-PropZTst
Prop>.5
Z=-.8164965809
p= .7928919719
P_hat = 0.4666666667
n=150
TT (cont’)
One-proportion z-test
The sample proportion has a normal model with mean
0.5 and sd 0.041
Observed proportion is 0.467
P-value = P(Z>(0.467-0.5)/0.41) = 0.788
Conclusion
The p-value suggests that, under the null hypothesis, an
observed proportion of 46.7% successes or more would
occur at random about 8 times in 10. So, we do not
reject the null hypothesis.
There is insufficient evidence to support that the TT
practitioners are performing better than they would if
they were just guessing.
Re-think p-values
P (observed or more extreme statistic values| H0)
P-value does NOT quantify how likely the null
hypothesis is true.
P-value is NOT the probability that the null
hypothesis is true.
P-value is NOT the conditional probability that the
null hypothesis is true given the data.
Significance level
Sometimes we want to make a firm decision
about whether or not to reject the null hypothesis
Significance level (alpha level): Threshold value
to make a decision based on P-value
Often denoted by α. If P-value< α, we reject the null
hypothesis and we call the results statistically significant.
When we reject the null hypothesis, we say that the test
is “significant at that level”.
Often, α = 0.10, 0.05, 0.01
More about α
Even if the null hypothesis is true, we still
have a probability α to reject the null
hypothesis. But this is rare if α is small.
Choose α before you look at the data.
Always report p-value, then people can
use different α to reach their conclusion
What Not to Say About Significance
What do we mean when we say that a test
is statistically significant?
All we mean is that the test statistic had a Pvalue lower than our alpha level.
Don’t be lulled into thinking that statistical
significance carries with it any sense of
practical importance or impact.
What Not to Say About Significance (cont.)
For large samples, even small, unimportant
(“insignificant”) deviations from the null
hypothesis can be statistically significant.
On the other hand, if the sample is not large
enough, even large, financially or scientifically
“significant” differences may not be statistically
significant.
It’s good practice to report the magnitude of the
difference between the observed statistic value
and the null hypothesis value (in the data units)
along with the P-value on which we base
statistical significance.
Revisit critical value
Critical value
1.96 for a 95% confidence interval
Any z-score larger in magnitude (i.e., more
extreme) than a particular critical value has to
be less likely, so it will have a p-value smaller
than the corresponding probability.
Comparing p-value with α is equivalent to
comparing the observed z-score with the critical
value for a given α level
Revisit Critical Values (cont’)
When the alternative
is one-sided, the
critical value puts all
of on one side:
When the alternative
is two-sided, the
critical value splits
equally into two tails:
Revisit Critical Values (cont’)
α
One-sided
Two-sided
0.1
1.28
1.645
0.05
1.645
1.96
0.01
2.28
2.575
0.001
3.09
3.29
Confidence Intervals and Hypothesis
Tests
Confidence intervals and hypothesis tests
are built from the same calculations.
You can approximate a hypothesis test by
examining a confidence interval.
Confidence Intervals and Hypothesis
Tests (cont.)
Confidence intervals are two-sided. They
correspond to two-sided tests.
In general, a 100(1- α)% confidence interval
corresponds to a two-sided hypothesis test with
significance level α
Example: Click it or ticket!
goal: achieve at least 80% compliance
with the law
Data: a roadblock resulted in 33 tickets out
of 134 stopped for inspection
Does the fact that only (134-33)/134 =
75.4% of these drivers were wearing their
seatbelts prove that the compliance rate
among the driving public is below 80%?
Click it or ticket! (cont’)
Hypotheses
p: compliance rate in the driving public
H0 : p=0.8
HA : p≠0.8 (two-sided)
Check the conditions
Independence
Random sampling
10% condition
Success/failure condition
Click it or ticket! (cont’)
STAT TESTS
5: 1-PropZTest (for hypotheses testing)
p0: 0.8
x: 101 (134-33)
n: 134
prop ≠p0
STAT TESTS
A: 1-PropZInt (for confidence interval)
x: 101 (134-33)
n:134
C-level: 0.9
Click it or ticket! (cont’)
We can use a normal model
Instead of a test, we find a one-proportion zinterval
Sample proportion is (134-33)/134 = 75.4%
Standard error is 0.037
Critical value is 1.645 for a 90% confidence
interval
Margin of error = 1.645 * 0.037 = 0.061
Confidence interval
0.754±0.061 = (0.693, 0.815)
Conclusion
Since 80% is in the 90% confidence interval, we do not
reject the null hypothesis at significance level 0.10.
Making errors
Type I error (false positive)
Reject the null hypothesis when the null hypothesis is true
The probability of Type I error is controlled by the significance
level α
Type II error (false negative)
Fail to reject the null hypothesis when the null hypothesis is false
Which error is more serious?
Depends on the context. In the classic hypothesis testing
framework, we control Type I error.
Making Errors (cont.)
When H0 is false and we fail to reject it, we
have made a Type II error.
We assign the letter to the probability of this
mistake.
It’s harder to assess the value of because we
don’t know what the value of the parameter
really is.
There is no single value for --we can think of a
whole collection of ’s, one for each incorrect
parameter value.
Making Errors (cont.)
One way to focus our attention on a particular
is to think about the effect size.
Ask “How big a difference would matter?”
We could reduce for all alternative parameter
values by increasing .
This would reduce but increase the chance of a Type I
error.
This tension between Type I and Type II errors is
inevitable.
The only way to reduce both types of errors is to
collect more data. Otherwise, we just wind up
trading off one kind of error against the other.
Power
When H0 is false and we reject it, we have done
the right thing.
A test’s ability to detect a false hypothesis is called the
power of the test.
The power of a test is the probability that it correctly
rejects a false null hypothesis.
Power = 1- probability of Type II error = 1- β
Power = P (reject H0 | H0 is false)
When the power is high, we can be confident
that we’ve looked hard enough at the situation.
The power of a test is 1 – .
Power (cont.)
Whenever a study fails to reject its null
hypothesis, the test’s power comes into
question.
When we calculate power, we imagine that
the null hypothesis is false.
The value of the power depends on how far
the truth lies from the null hypothesis value.
The distance between the null hypothesis value,
p0, and the truth, p, is called the effect size.
The larger the effect size, the higher the power.
A Picture Worth a Thousand Words (cont.)
This diagram shows the relationship
between these concepts:
Reducing Both Type I and Type II
Error
The previous figure seems to show that if
we reduce Type I error, we must
automatically increase Type II error.
But, we can reduce both types of error by
making both curves narrower.
How do we make the curves narrower?
Increase the sample size.
Reducing Both Type I and Type II Error
(cont.)
This figure has means that are just as far apart as in the
previous figure, but the sample sizes are larger, the
standard deviations are smaller, and the error rates are
reduced:
Reducing Both Type I and Type II Error
(cont.)
Original comparison
With a larger sample size:
What power do we need?
It depends on applications
It depends on the effect size
Meanwhile, it is often a financial
consideration.
Higher power needs more samples, hence
higher cost
What Can Go Wrong?
Don’t interpret the P-value as the probability that
H0 is true.
The P-value is about the data, not the hypothesis.
It is about the probability of the data given that H0 is true,
not the other way around.
Don’t believe too strongly in arbitrary alpha
levels.
It’s better to report your P-value and a confidence
interval so that the reader can make her/his own
decision.
What Can Go Wrong? (cont.)
Don’t confuse practical and statistical
significance.
Just because a test is statistically significant
doesn’t mean that it is significant in practice.
And, sample size can impact your decision
about a null hypothesis, making you miss an
important difference or find an “insignificant”
difference.
Don’t forget that in spite of all your care,
you might make a wrong decision.
What have we learned?
Type I and Type II errors
Significance level
Power
Effect size
Relationship between significance level
and power