Transcript Document

Big Question:
Is the increase in the average test scores
attributable to the new pedagogy?
Solution: Conceptually, there are two possibilities:
• The increase in the average score may be plausibly
attributed to a run of good luck, or
• The average increased thanks to the new pedagogy.
We cannot give a decisive answer to this question,
since both of these explanations are possible. The first
option, which attributes the increase in the number of
customers to mere chance, is called the null
hypothesis. The other option is called the alternative
hypothesis. We need to assess which option is more
plausible.
Under certain extremes, the choice is fairly obvious.
• If the new average score was 100, then it can be
confidently concluded that student scores really went
up thanks to the new pedagogy. In this case, we would
reject the null hypothesis.
• On the other hand, if the new average was 75.01,
then such an increase is reasonably attributable to
chance. In this case, we would retain (or fail to reject)
the null hypothesis.
Somewhere between these two extremes will be a
certain cut-off point called the critical value. Above
this critical value, it will be more plausible to think
that the average number of customers increased after
the redesign. Below this critical value, it will be more
plausible to think that the increase is simply
attributable to chance.
75
75.01
Critical
value
100
Therefore, the question may be rephrased as follows:
Is your sample average of close enough to 75 to be
consistent with the assumption that the population
mean is still 75?
This question leads to two hypotheses:
H0 : The average is still 75.
Ha : The average is greater than 75.
(Why greater?)
H0 : The average is still 75.
Ha : The average is greater than 75.
75
Decision: Retain H0
Critical
value
Reject H0
No matter where we set the critical value, we will
occasionally make a mistake. First, through sheer
dumb luck, the students’ test scores could have gone
up due to a “good day,” just a fair coin may land heads
more times than the critical value.
This situation is called a Type I error, meaning that we
would decide to reject the null hypothesis even though
the coin was fair.
The probability of making a Type I error is called the
significance level of the test and is denoted by a.
Second, another error that could occur is if the new
pedagogy really worked, but the students’ average test
score nevertheless was below the critical value.
This is called a Type II error, meaning that we fail to
reject the null hypothesis even though the coin is
imbalanced. The probability of making a Type II error
is denoted by b.
We define the power of the test to be 1 - b.
Type I Error: The null hypothesis is true but we reject it.
Type II Error: The null hypothesis is false but we retain it.
THE WAY IT IS
Ha true
H0 true
Decide
to retain
H0
THE
WAY
WE
THINK
IT IS Decide
to reject
H0
Type II Error
Type I Error
Note. In a perfect world, we would have a = b = 0,
but this can only happen in trivial cases. For any
realistic scenario of hypothesis testing, decreasing a
will increase b, and vice versa.
In practice, we set the significance level a in
advance, usually at a fairly small number. We then
compute b for this level of a.
(Ideally, we would like to construct a test that makes
b as small as possible. This topic will be considered
in future statistics courses.)
• Test statistic: Suppose that your sample average is 80. Then
X   80  75
Z

 2.63523
 / n 12 / 40
• P-value. Assuming H0, we
must find the chance of getting
a test statistic at least this
extreme. For this problem, that
means that we must find the
area to the right of 2.5 under
the standard bell curve. Using
either Excel or a TI:
P  0.0042
In other words, if we assume the null hypothesis that
the average has not changed, then we must also accept
the fact that an improbable event — about one chance
in 240 of happening — just happened. This is less than
our prescribed value of a.
• Conclusion: We reject the null hypothesis. There is
good reason to believe that the average number of
customers has increases after the redesign of the
storefront.
Observations:
1. This test of significance is called the z-test,
named after the test statistic.
2. The z-test is best used with large samples –
so that the normal approximation may be
safely made.
3. Notice we have not proven beyond a shadow of a
doubt that the new storefront was effective in
increasing the number of patrons. We might have
been lucky, and the test was designed so that the
probability of a Type I error was 5%.
4. The alternative hypothesis is that the daily average
of patrons is greater than 75. It is not that the new
average is exactly equal to 80.
5. Small values of P are evidence against the
null hypothesis; they indicate that something
besides chance is at work.
6. We are NOT saying that there is 1 chance in
240 for the null hypothesis to be correct. Instead,
this figure is being used to make our decision
about the way we think things are.
7. If P < 5%, the result is often called
statistically significant.
If P < 1%, the result is called highly
statistically significant.
These phrases are often used in media
reports on scientific progress – especially
concerning breakthroughs in medical
research. Try searching for the phrase
“statistically significant” on your favorite
Web site for news.
8. The previous problem used a right-tailed test
because the alternative hypothesis was that the
population mean increased thanks to the new
pedagogy.
In other problems, depending on the phrasing of the
alternative hypothesis, a left-tailed test could be used,
or even a double-tailed test . Sometimes, reasonable
people can disagree about the appropriate test to use.
Other times, a test can be chosen somewhat
maliciously.
Secondhand smoke is classified
as a known carcinogen by the
Environmental Protection Agency
(EPA). This classification is based
on many scientific studies which
investigated the question of
whether secondhand smoke was
associated with a higher incidence
of cancer.
The EPA conducted its study
using a 5% significance level and a
one-tailed test. A one-tailed test
was used because it was already
independently determined that
first-hand smoke caused cancer
and the preliminary studies
indicated that second-hand smoke
was a probable cause of cancer.
However, the tobacco industry
argued that a one-tailed test was
inappropriate and that a two-tailed
test should be used. They claimed
that by using a one-tailed test at the
5% significance level, the EPA was
essentially using a two-tailed test at
the 10% significance level, since
each tail would then have area of
5%. The tobacco industry argued
that this doubled the probability of a
Type I error.
Nevertheless, since there was
good reason to think that secondhand smoke was a carcinogen, the
EPA followed the usual scientific
convention of using a one-tailed test.
“Secondhand Smoke:
Is it a Hazard?,”
Consumer Reports,
January 1995
9. For simplicity, the normal curve was used in the
previous problem. In practice, we have to use s
instead of .
If a small sample is taken, then the t-distribution must
be used instead of the normal curve to compute the
observed significance level.
Conceptual Questions:
1) True or False:
a) The observed significance level of 0.4%
depends on the data (i.e. sample)
b) There are 996 chances out of 1000 for the
alternative hypothesis to be correct.
Conceptual Questions:
2) True or False:
a) A “highly statistically significant” result
cannot possibly be due to chance.
b) If a sample difference is “highly
statistically significant,” there is less than a
1% chance for the null hypothesis to be
correct.
Conceptual Questions:
3) True or False:
a) If P = 43%, then the null hypothesis
looks plausible.
b) If P = 0.43%, then the null hypothesis
looks implausible.
Note: Previously, we considered another technique
of inferring information about a population from a
sample – namely, confidence intervals.
Confidence intervals provide a method of estimating
a population average.
Hypothesis testing checks if the difference between
the supposed population average and the sample
average is either real or due to chance.