Chapter 21 More about tests

Download Report

Transcript Chapter 21 More about tests

Chapter 21 More about tests
Math2200
Identify null and alternative hypotheses
The null must be a statement about the
value of a parameter for a model.
We use this value to compute the
probability that the observed sample
statistic—or something even farther from
the null value—would occur.
Identify null and alternative hypotheses
(cont’)
 The appropriate null arises directly from the
context of the problem—it is not dictated by the
data, but instead by the situation.
 A good way to identify both the null and
alternative hypotheses is to think about the Why
of the situation.
 To write a null hypothesis, you can’t just choose
any parameter value you like.
The null must relate to the question at hand—it is
context dependent.
Remark: “null” does not automatically mean zero.
Example: Clinical trial
A pharmaceutical company wanting to
develop and market a new drug needs to
show the new drug is effective. FDA
requires the drug is proven effective in a
double-blinded, randomized clinical trial
Null hypothesis: the proportion of patients
recovering after receiving the new drug is
the same as we would expect of patients
receiving a placebo
Alternative hypothesis?
Clinical trial (cont’)
If the purpose of the study is to find a
treatment better than the current one?
Null hypotheses: The same proportion of
patients are treated by the new treatment
as expected by the current one.
 alternative hypotheses should be the
proportion of patients recovering after
receiving the new drug is higher than what
we would expect of patients receiving the
current treatment.
Example: Therapeutic touch
 Therapeutic touch (TT) practitioners believe that
by adjusting the “human energy field” they can
promote healing.
15 TT practitioners
A screen was placed so that the TT practitioners can not
see the girl’s hand
Each TT practitioner attempted 10 trials
Out of 150 trials, 70 times successful (46.7%)
 Can we conclude the TT practitioners can
successfully detect a “human energy field”?
TT (cont’)
 Hypotheses
p: probability of successful identification
H0 : p=0.5
HA : p>0.5 (one-sided)
 This is about a proportion. Let’s use oneproportion z-test.
 Check the conditions first!
Independence
Randomization (the choice of hand was randomized
with a coin flip.)
10% condition
Success/failure condition
TT (cont’)
STAT TESTS 5
1 PropZTest
P0= 0.5
prop > p0
x : 70 (the number of successes)
n: 150 (sample size)
Calculate
1-PropZTst
Prop>.5
Z=-.8164965809
p= .7928919719
P_hat = 0.4666666667
n=150
TT (cont’)
 One-proportion z-test
The sample proportion has a normal model with mean
0.5 and sd 0.041
Observed proportion is 0.467
P-value = P(Z>(0.467-0.5)/0.41) = 0.788
 Conclusion
The p-value suggests that, under the null hypothesis, an
observed proportion of 46.7% successes or more would
occur at random about 8 times in 10. So, we do not
reject the null hypothesis.
There is insufficient evidence to support that the TT
practitioners are performing better than they would if
they were just guessing.
Re-think p-values
 P (observed or more extreme statistic values| H0)
 P-value does NOT quantify how likely the null
hypothesis is true.
P-value is NOT the probability that the null
hypothesis is true.
P-value is NOT the conditional probability that the
null hypothesis is true given the data.
Significance level
 Sometimes we want to make a firm decision
about whether or not to reject the null hypothesis
 Significance level (alpha level): Threshold value
to make a decision based on P-value
Often denoted by α. If P-value< α, we reject the null
hypothesis and we call the results statistically significant.
When we reject the null hypothesis, we say that the test
is “significant at that level”.
Often, α = 0.10, 0.05, 0.01
More about α
Even if the null hypothesis is true, we still
have a probability α to reject the null
hypothesis. But this is rare if α is small.
Choose α before you look at the data.
Always report p-value, then people can
use different α to reach their conclusion
What Not to Say About Significance
What do we mean when we say that a test
is statistically significant?
All we mean is that the test statistic had a Pvalue lower than our alpha level.
Don’t be lulled into thinking that statistical
significance carries with it any sense of
practical importance or impact.
What Not to Say About Significance (cont.)
 For large samples, even small, unimportant
(“insignificant”) deviations from the null
hypothesis can be statistically significant.
 On the other hand, if the sample is not large
enough, even large, financially or scientifically
“significant” differences may not be statistically
significant.
 It’s good practice to report the magnitude of the
difference between the observed statistic value
and the null hypothesis value (in the data units)
along with the P-value on which we base
statistical significance.
Revisit critical value
Critical value
1.96 for a 95% confidence interval
Any z-score larger in magnitude (i.e., more
extreme) than a particular critical value has to
be less likely, so it will have a p-value smaller
than the corresponding probability.
Comparing p-value with α is equivalent to
comparing the observed z-score with the critical
value for a given α level
Revisit Critical Values (cont’)
 When the alternative
is one-sided, the
critical value puts all
of  on one side:
 When the alternative
is two-sided, the
critical value splits 
equally into two tails:
Revisit Critical Values (cont’)
α
One-sided
Two-sided
0.1
1.28
1.645
0.05
1.645
1.96
0.01
2.28
2.575
0.001
3.09
3.29
Confidence Intervals and Hypothesis
Tests
Confidence intervals and hypothesis tests
are built from the same calculations.
You can approximate a hypothesis test by
examining a confidence interval.
Confidence Intervals and Hypothesis
Tests (cont.)
Confidence intervals are two-sided. They
correspond to two-sided tests.
In general, a 100(1- α)% confidence interval
corresponds to a two-sided hypothesis test with
significance level α
Example: Click it or ticket!
goal: achieve at least 80% compliance
with the law
Data: a roadblock resulted in 33 tickets out
of 134 stopped for inspection
Does the fact that only (134-33)/134 =
75.4% of these drivers were wearing their
seatbelts prove that the compliance rate
among the driving public is below 80%?
Click it or ticket! (cont’)
Hypotheses
p: compliance rate in the driving public
H0 : p=0.8
HA : p≠0.8 (two-sided)
Check the conditions
Independence
Random sampling
10% condition
Success/failure condition
Click it or ticket! (cont’)
 STAT TESTS
5: 1-PropZTest (for hypotheses testing)
p0: 0.8
x: 101 (134-33)
n: 134
prop ≠p0
 STAT TESTS
A: 1-PropZInt (for confidence interval)
x: 101 (134-33)
n:134
C-level: 0.9
Click it or ticket! (cont’)
 We can use a normal model
 Instead of a test, we find a one-proportion zinterval
 Sample proportion is (134-33)/134 = 75.4%
 Standard error is 0.037
 Critical value is 1.645 for a 90% confidence
interval
 Margin of error = 1.645 * 0.037 = 0.061
 Confidence interval
0.754±0.061 = (0.693, 0.815)
 Conclusion
Since 80% is in the 90% confidence interval, we do not
reject the null hypothesis at significance level 0.10.
Making errors
 Type I error (false positive)
 Reject the null hypothesis when the null hypothesis is true
 The probability of Type I error is controlled by the significance
level α
 Type II error (false negative)
 Fail to reject the null hypothesis when the null hypothesis is false
 Which error is more serious?
 Depends on the context. In the classic hypothesis testing
framework, we control Type I error.
Making Errors (cont.)
When H0 is false and we fail to reject it, we
have made a Type II error.
We assign the letter  to the probability of this
mistake.
It’s harder to assess the value of  because we
don’t know what the value of the parameter
really is.
There is no single value for --we can think of a
whole collection of ’s, one for each incorrect
parameter value.
Making Errors (cont.)
 One way to focus our attention on a particular 
is to think about the effect size.
Ask “How big a difference would matter?”
 We could reduce  for all alternative parameter
values by increasing .
This would reduce  but increase the chance of a Type I
error.
This tension between Type I and Type II errors is
inevitable.
 The only way to reduce both types of errors is to
collect more data. Otherwise, we just wind up
trading off one kind of error against the other.
Power
 When H0 is false and we reject it, we have done
the right thing.
A test’s ability to detect a false hypothesis is called the
power of the test.
The power of a test is the probability that it correctly
rejects a false null hypothesis.
Power = 1- probability of Type II error = 1- β
Power = P (reject H0 | H0 is false)
 When the power is high, we can be confident
that we’ve looked hard enough at the situation.
 The power of a test is 1 – .
Power (cont.)
Whenever a study fails to reject its null
hypothesis, the test’s power comes into
question.
When we calculate power, we imagine that
the null hypothesis is false.
The value of the power depends on how far
the truth lies from the null hypothesis value.
The distance between the null hypothesis value,
p0, and the truth, p, is called the effect size.
The larger the effect size, the higher the power.
A Picture Worth a Thousand Words (cont.)
This diagram shows the relationship
between these concepts:
Reducing Both Type I and Type II
Error
The previous figure seems to show that if
we reduce Type I error, we must
automatically increase Type II error.
But, we can reduce both types of error by
making both curves narrower.
How do we make the curves narrower?
Increase the sample size.
Reducing Both Type I and Type II Error
(cont.)
 This figure has means that are just as far apart as in the
previous figure, but the sample sizes are larger, the
standard deviations are smaller, and the error rates are
reduced:
Reducing Both Type I and Type II Error
(cont.)
 Original comparison
 With a larger sample size:
What power do we need?
It depends on applications
It depends on the effect size
Meanwhile, it is often a financial
consideration.
Higher power needs more samples, hence
higher cost
What Can Go Wrong?
 Don’t interpret the P-value as the probability that
H0 is true.
The P-value is about the data, not the hypothesis.
It is about the probability of the data given that H0 is true,
not the other way around.
 Don’t believe too strongly in arbitrary alpha
levels.
It’s better to report your P-value and a confidence
interval so that the reader can make her/his own
decision.
What Can Go Wrong? (cont.)
Don’t confuse practical and statistical
significance.
Just because a test is statistically significant
doesn’t mean that it is significant in practice.
And, sample size can impact your decision
about a null hypothesis, making you miss an
important difference or find an “insignificant”
difference.
Don’t forget that in spite of all your care,
you might make a wrong decision.
What have we learned?
Type I and Type II errors
Significance level
Power
Effect size
Relationship between significance level
and power