Transcript Document
Big Question: Is the increase in the average test scores attributable to the new pedagogy? Solution: Conceptually, there are two possibilities: • The increase in the average score may be plausibly attributed to a run of good luck, or • The average increased thanks to the new pedagogy. We cannot give a decisive answer to this question, since both of these explanations are possible. The first option, which attributes the increase in the number of customers to mere chance, is called the null hypothesis. The other option is called the alternative hypothesis. We need to assess which option is more plausible. Under certain extremes, the choice is fairly obvious. • If the new average score was 100, then it can be confidently concluded that student scores really went up thanks to the new pedagogy. In this case, we would reject the null hypothesis. • On the other hand, if the new average was 75.01, then such an increase is reasonably attributable to chance. In this case, we would retain (or fail to reject) the null hypothesis. Somewhere between these two extremes will be a certain cut-off point called the critical value. Above this critical value, it will be more plausible to think that the average number of customers increased after the redesign. Below this critical value, it will be more plausible to think that the increase is simply attributable to chance. 75 75.01 Critical value 100 Therefore, the question may be rephrased as follows: Is your sample average of close enough to 75 to be consistent with the assumption that the population mean is still 75? This question leads to two hypotheses: H0 : The average is still 75. Ha : The average is greater than 75. (Why greater?) H0 : The average is still 75. Ha : The average is greater than 75. 75 Decision: Retain H0 Critical value Reject H0 No matter where we set the critical value, we will occasionally make a mistake. First, through sheer dumb luck, the students’ test scores could have gone up due to a “good day,” just a fair coin may land heads more times than the critical value. This situation is called a Type I error, meaning that we would decide to reject the null hypothesis even though the coin was fair. The probability of making a Type I error is called the significance level of the test and is denoted by a. Second, another error that could occur is if the new pedagogy really worked, but the students’ average test score nevertheless was below the critical value. This is called a Type II error, meaning that we fail to reject the null hypothesis even though the coin is imbalanced. The probability of making a Type II error is denoted by b. We define the power of the test to be 1 - b. Type I Error: The null hypothesis is true but we reject it. Type II Error: The null hypothesis is false but we retain it. THE WAY IT IS Ha true H0 true Decide to retain H0 THE WAY WE THINK IT IS Decide to reject H0 Type II Error Type I Error Note. In a perfect world, we would have a = b = 0, but this can only happen in trivial cases. For any realistic scenario of hypothesis testing, decreasing a will increase b, and vice versa. In practice, we set the significance level a in advance, usually at a fairly small number. We then compute b for this level of a. (Ideally, we would like to construct a test that makes b as small as possible. This topic will be considered in future statistics courses.) • Test statistic: Suppose that your sample average is 80. Then X 80 75 Z 2.63523 / n 12 / 40 • P-value. Assuming H0, we must find the chance of getting a test statistic at least this extreme. For this problem, that means that we must find the area to the right of 2.5 under the standard bell curve. Using either Excel or a TI: P 0.0042 In other words, if we assume the null hypothesis that the average has not changed, then we must also accept the fact that an improbable event — about one chance in 240 of happening — just happened. This is less than our prescribed value of a. • Conclusion: We reject the null hypothesis. There is good reason to believe that the average number of customers has increases after the redesign of the storefront. Observations: 1. This test of significance is called the z-test, named after the test statistic. 2. The z-test is best used with large samples – so that the normal approximation may be safely made. 3. Notice we have not proven beyond a shadow of a doubt that the new storefront was effective in increasing the number of patrons. We might have been lucky, and the test was designed so that the probability of a Type I error was 5%. 4. The alternative hypothesis is that the daily average of patrons is greater than 75. It is not that the new average is exactly equal to 80. 5. Small values of P are evidence against the null hypothesis; they indicate that something besides chance is at work. 6. We are NOT saying that there is 1 chance in 240 for the null hypothesis to be correct. Instead, this figure is being used to make our decision about the way we think things are. 7. If P < 5%, the result is often called statistically significant. If P < 1%, the result is called highly statistically significant. These phrases are often used in media reports on scientific progress – especially concerning breakthroughs in medical research. Try searching for the phrase “statistically significant” on your favorite Web site for news. 8. The previous problem used a right-tailed test because the alternative hypothesis was that the population mean increased thanks to the new pedagogy. In other problems, depending on the phrasing of the alternative hypothesis, a left-tailed test could be used, or even a double-tailed test . Sometimes, reasonable people can disagree about the appropriate test to use. Other times, a test can be chosen somewhat maliciously. Secondhand smoke is classified as a known carcinogen by the Environmental Protection Agency (EPA). This classification is based on many scientific studies which investigated the question of whether secondhand smoke was associated with a higher incidence of cancer. The EPA conducted its study using a 5% significance level and a one-tailed test. A one-tailed test was used because it was already independently determined that first-hand smoke caused cancer and the preliminary studies indicated that second-hand smoke was a probable cause of cancer. However, the tobacco industry argued that a one-tailed test was inappropriate and that a two-tailed test should be used. They claimed that by using a one-tailed test at the 5% significance level, the EPA was essentially using a two-tailed test at the 10% significance level, since each tail would then have area of 5%. The tobacco industry argued that this doubled the probability of a Type I error. Nevertheless, since there was good reason to think that secondhand smoke was a carcinogen, the EPA followed the usual scientific convention of using a one-tailed test. “Secondhand Smoke: Is it a Hazard?,” Consumer Reports, January 1995 9. For simplicity, the normal curve was used in the previous problem. In practice, we have to use s instead of . If a small sample is taken, then the t-distribution must be used instead of the normal curve to compute the observed significance level. Conceptual Questions: 1) True or False: a) The observed significance level of 0.4% depends on the data (i.e. sample) b) There are 996 chances out of 1000 for the alternative hypothesis to be correct. Conceptual Questions: 2) True or False: a) A “highly statistically significant” result cannot possibly be due to chance. b) If a sample difference is “highly statistically significant,” there is less than a 1% chance for the null hypothesis to be correct. Conceptual Questions: 3) True or False: a) If P = 43%, then the null hypothesis looks plausible. b) If P = 0.43%, then the null hypothesis looks implausible. Note: Previously, we considered another technique of inferring information about a population from a sample – namely, confidence intervals. Confidence intervals provide a method of estimating a population average. Hypothesis testing checks if the difference between the supposed population average and the sample average is either real or due to chance.