Transcript CHAPTER 10
Inference for the Slope of Regression Chapter 14 1 Review Before we begin the study of inference on regression, let’s try to remember some things about regression. Input the following into your calculator and try to answer the following questions. Review How do we make a scatterplot? What do we look for in a scatterplot? How do we find the correlation? What is the symbol we use for correlation? What does the correlation mean? What does R2 mean? How do we create a linear model? How do we know if the linear model is appropriate? How do we create a residual plot? What do we look for in a residual plot? What does the slope mean? What does the y-intercept mean? Inference for the Slope of an LSRL Remember, the LSRL is the line that best fits the data using the criteria that the sum of the squares of the residuals is minimized, yˆ a bx The slope b and intercept a of the least-squares line are statistics. That is, we calculate them from sample data. These statistics would take somewhat different values if we repeated the study with a different sample. To do formal inference, we think of a and b as estimates of unknown parameters. We test to determine if there is an association between the data. We almost always test to see if there is a change in b (the slope). Assumptions and Conditions The Linearity Assumption - The data must be linear to complete this inference procedure (this is the first and most important assumption). The straight enough condition - this condition is satisfied if the scatterplot looks straight. (It is generally NOT a good idea to draw a straight line through the scatterplot when checking - graders see this as making an absolute statement and it makes the line look straighter than it really is.) You can verify by checking the residuals (remember we are looking for no clear pattern). Assumptions and Conditions The Independence Assumption - The data must independent from one another Randomization condition – Did the data come from a random sample? 10% Condition: Is the data less than 10% of the population? Independent Sample Condition: Does the residual plot shows clear evidence of dependence in the data (a clear pattern)? If you notice patterns, clumps, or trends, then the data would suggest a failure of independence. Assumptions and Conditions Normality Assumption - The data must be approximately normal. The nearly normal condition – Look at the histogram or normal probability plot of the RESIDUALS. Equal Variance Assumption – The variability should be the same for all values of x. No thickening plot condition – Does the scatterplot thicken? Be aware of a fan shape or other growing and shrinking parts within the scatterplot. All four assumptions and their corresponding conditions must be checked in order to perform inference for the LSRL. You should check them in the exact order given in this lesson: 1) Linearity; 2) Independence; 3) Normality; and last 4) Equal Variance or LINE. Inference for the Slope of an LSRL The values of y that we observe vary about their means according to a normal distribution. The first step in inference is to estimate the unknown parameters , , and . When the regression model describes our data and we calculate the least-squares line y = a + bx, the slope b of the least-squares line is an unbiased estimator of the true slope , and the intercept a of the least-squares line is an unbiased estimator of the true intercept . Slope and Intercept Back to our previous example: We can easily find that the LSRL is y = 91.268 + 1.493x for an IQ score of infants at three years against the intensity of crying soon after birth. The slope is particularly important. A slope is a rate of change. The true slope says how much higher average IQ is for children with one more peak in their crying measurement. Because b = 1.493 estimates the unknown , we estimate that on the average IQ is about 1.5 points higher for each added crying peak. We need the intercept a = 91.27 to draw the line, but it has no statistical meaning in this case. No child had fewer than 9 crying peaks, so we have no data near x = 0. We suspect that all normal children would cry when snapped with a rubber band, so that we will usually not observe x = 0. Standard Error Remember: residual observedy - predictedy y yˆ There are n residuals, one for each data point. Because is the standard deviation of responses about the true regression line, we estimate it with the standard deviation of the residuals, se (many times just labeled s). The Standard Error for slope is SEb. The spread of the x-values is sx. The residuals from a LSRL always have mean zero, so that simplifies their standard error. se SEb , where sx n 1 n 1 n 1 2 2 ˆ se residuals ( y y ) i i n 2 i 1 n 2 i 1 The Test Statistics: Another t-test? The null hypothesis will be that there is no true linear relationship between x and y (usually the slope is 0 – make sure to put this into the context of the problem). The most common null hypothesis is H0: β = 0. We use the test statistic: b t SEb In terms of a random variable T having the df = n - 2 distribution, the P-value for a test of Ho against Ha: > 0 is P(T ≥ t) Ha: < 0 is P(T ≤ t) Ha: ≠ 0 is 2P(T ≥ |t|) This is like all of the other t-tests that we have completed. The test statistic is just the standardized version of the least-squares slope b. It is another t statistic. Confidence Intervals (CI) A confidence interval is more useful because it shows the accuracy for the estimate of b. The confidence interval for has the familiar form estimate t*SEb Because b is our estimate, the confidence interval becomes b ± t*SEb, df = n – 2, and (once again): se SEb , where sx n 1 n 1 n 1 2 2 ˆ se residuals ( y y ) i n 2 i 1 n 2 i 1 Example 1 A study conducted on child development took a random sample of 38 children aging from early infancy to speech to determine if crying activity could help predict a child’s intellectual development. Determine if there is there a useful linear relationship between the crying activity and intellectual development. Example 1 Step 1: Identify population Parameter, state the null and alternative Hypotheses, determine what you are trying to do (and determine what the question is asking). The population are babies. The parameter is the slope of the regression line. We want to determine if there is an association between crying and intellectual development. H0: β = 0 There is no linear association between crying and intellectual development. HA: β ≠ 0 There is a linear association between crying and intellectual development. Example 1 Step 2: Verify the Assumptions by checking the conditions Linearity Assumption Straight enough condition: There is no obvious bend in the scatterplot. The residual plot shows a random scatter about the line. Independence Assumption Randomization Condition: We are told that the children were chosen randomly. 10% Condition: This data is less than 10% of the population. Independent Sample Condition: The residual plot shows no clear evidence of dependence in the data (no clear pattern) Example 1 Step 2: Verify the Assumptions by checking the conditions Normality Assumption Nearly Normal Condition: A histogram of the residuals show a skewed right distribution. Since we have a moderately sized sample, we will proceeded with caution. Equal Variance Assumption No Thickening Scatterplot Condition: The residual plot shows no obvious trends in the spread. Example 1 Step 3: If conditions are met, Name the inference procedure, find the Test statistic, and Obtain the p-value in carrying out the inference: We will use a Linear Regression t-Test for Slope. Use the calculator to determine the estimated regression equation, the t-score, r, r2, df, and the pvalue. yˆ IQ based on cryingact ivit y: yˆ 91.268 1.493(cryingpeaks) Test statistic P-value t 3.065 df 36 r 0.455 p .0041 r 2 0.207 s 17.499 Example 1 Step 4: Make a decision and State your conclusion in context of the problem using pvalue. With a low p-value of .0041, we reject the null hypothesis at the α = .05 level and conclude that there is strong evidence that there is a useful linear relationship between crying activity and intellectual development; however, r is small implying that the association is relatively weak. Example 2 By conventional wisdom, some people with colds generally avoid dairy products since it’s thought to produce extra mucus. This claim was challenged by a group of researchers in Australia in 1990. They gathered a random sample of participants throughout the country and infected them with rhinovirus (or the common cold). For 10 days, each participant kept track of their milk consumption and collected all of the tissue used for any nasal discharge of mucus. Once all of the tissue was gathered, the researchers extracted the nasal mucus and measured the nasal mucus secretions. Example 2 Determine if dairy consumption increases mucus discharge when a person has a cold. Glasses of Milk per Day Ounces of Mucus 0.6 0.7 3 4 6 7 8 9 0.2 0.9 0.7 0.5 0.9 0.3 0.2 11 1.0 Glasses of Milk per Day 0 0 1 1 Ounces of Mucus 0.7 0.3 1.1 0.2 1 2 3 Example 2 Step 1: Identify population Parameter, state the null and alternative Hypotheses, determine what you are trying to do (and determine what the question is asking). The population are people with colds in Australia. The parameter is the slope of the regression line. We want to determine if there is a positive association between milk consumption (or dairy consumption) and mucus discharge when a person has a cold. We will perform a Linear Regression T-test on the slope: H0: β = 0 There is no linear association between milk consumption and mucus discharge. HA: β > 0 There is a positive linear association between milk consumption and mucus discharge. Example 2 Step 2: Verify the Assumptions by checking the conditions Linearity Assumption Straight enough condition: There is no obvious bend in the scatterplot. The residual plot shows a random scatter about the line. Independence Assumption Randomization Condition: We are told that the sample was chosen randomly. 10% Condition: This data is less than 10% of the population. Independent Sample Condition: The residual plot shows no clear evidence of dependence in the data (no clear pattern) Example 2 Step 2: Verify the Assumptions by checking the conditions Normality Assumption Nearly Normal Condition: A histogram of the residuals show a bimodal distribution. Since we have a small sample size, our assumption of normality may be violated. If normality is not appropriate, our results may not be valid. We proceed with caution. Equal Variance Assumption No Thickening Scatterplot Condition: The residual plot shows no obvious trends in the spread. Example 2 Step 3: If conditions are met, Name the inference procedure, find the Test statistic, and Obtain the p-value in carrying out the inference: We will use a Linear Regression t-Test for Slope. Use the calculator to determine the estimated regression equation, the t-score, r, r2, df, and the pvalue. yˆ Mucus Discharge: Test statistic P-value yˆ 0.5095 0.0208(glasses of milk) t 0.8481 df 12 r 0.2378 p .2065 r 2 0.0566 s 0.3184 Example 2 Step 4: Make a decision (reject or fail to reject H0). State your conclusion in context of the problem using p-value. With a p-value of 0.2065, we fail to reject the null hypothesis at the α = .05 level and conclude that there is little evidence to suggest that there is a linear relationship between dairy consumption and mucus discharge. We are likely to obtain similar results simply by sampling variation. However, our normality assumption was violated, so our results may not be valid. Example 3 Natives of Nenana, Alaska host a contest to guess the exact minute that a wooden tripod placed on the frozen Tanana River will fall through the breaking ice. The closest guess can win up to $300,000. Determine if there is a linear relationship between year and days to fall through the ice. The following shows the data: Year after 1900 Days Years Days 17 119.48 29 124.65 18 130.40 30 127.79 19 122.61 31 129.39 20 131.45 32 121.43 21 130.28 33 127.81 22 131.63 34 119.59 23 128.08 35 134.56 24 131.63 36 120.54 25 126.77 37 131.83 26 115.67 38 125.84 27 131.24 39 118.56 28 124.65 40 110.64 Example 3 Step 1: Identify population Parameter, state the null and alternative Hypotheses, determine what you are trying to do (and determine what the question is asking). We want to determine if there is an association between the year and the time it takes for the tripod to fall through the ice. We will perform a Linear Regression T-test on the slope: H0: β = 0 There is no linear association between the year and the number of days for the tripod to fall through the ice HA: β ≠ 0 There is a linear association between year and the number of days for the tripod to fall through the ice. Example 3 Step 2: Verify the Assumptions by checking the conditions Linearity Assumption Straight enough condition: There is no obvious bend in the scatterplot. The residual plot shows a random scatter about the line. Independence Assumption Randomization Condition: The sample is time bound which raises suspicions about independence. The data is not random. 10% Condition: This data is likely to be more than 10% of the population. Independent Sample Condition: The residual plot shows no clear evidence of dependence in the data (no clear pattern) Independence has been violated so our conclusion may not be valid. Example 3 Step 2: Verify the Assumptions by checking the conditions Normality Assumption Nearly Normal Condition: A histogram of the residuals show a unimodal and symmetric distribution. Equal Variance Assumption No Thickening Scatterplot Condition: The residual plot shows no obvious trends in the spread. Example 3 Step 3: If conditions are met, Name the inference procedure, find the Test statistic, and Obtain the p-value in carrying out the inference: We will use a Linear Regression t-Test for Slope. Use the calculator to determine the estimated regression equation, the t-score, r, r2, df, and the pvalue. yˆ Numbered day of ice break - up : yˆ 627.9704 0.2605( year) t 1.507 df 22 Test statistic P-value r 0.3059 p .1460 r 2 0.0936 s 5.86 Example 3 Step 4: Make a decision (reject or fail to reject H0). State your conclusion in context of the problem using p-value. With a p-value of 0.1460, we fail to reject the null hypothesis at the α = .05 level and conclude that there is little evidence to suggest that there is a linear relationship between year and the number of days until the ice breaks. We are likely to obtain similar results simply by sampling variation. However, our independence assumption is suspect, so our results may not be valid. Now perform a 95% Confidence Interval for β Example 3 (part 2) Step 1: Identify population Parameter that you wish to estimate and determine what you are trying to do (and determine what the question is asking). We want to approximate the true slope, β, of the Linear Regression Line between year and the number of days until the ice breaks with 95% confidence. Example 3 (part 2) Step 2: Verify the Assumptions by checking the conditions Linearity Assumption Straight enough condition: There is no obvious bend in the scatterplot. The residual plot shows a random scatter about the line. Independence Assumption Independent Sample Condition: The residual plot shows no clear evidence of dependence in the data (no clear pattern) Randomization Condition: The sample is time bound which raises suspicions about independence. The data is not random. 10% Condition: This data is likely to be more than 10% of the population. Independence has been violated so our conclusion may not be valid. Example 3 (part 2) Step 2: Verify the Assumptions by checking the conditions Normality Assumption Nearly Normal Condition: A histogram of the residuals show a unimodal and symmetric distribution. Equal Variance Assumption No Thickening Scatterplot Condition: The residual plot shows no obvious trends in the spread. Example 3 (part 2) Step 3: Name the inference, do the work, and state the Interval : We will use a 95% Confidence Interval for Linear Regression for Slope. ( y yˆ ) 2 755.49 se 5.86 n2 22 se 5.86 ˆb t * 0.2605 2.074 n2 sx n 1 7.07 23 0.2605 ( 2.074).1728 0.2605 0.3584 ( 0.618, 0.098) Example 3 (part 2) Step 4: Make a decision (reject or fail to reject H0). State your conclusion in context of the problem using p-value. I am 95% confident that the ice has been breaking up, on average, between 0.618 days earlier to 0.098 days later each year. However, independence was violated so our results may not be valid. Example 3 (part 3) Suppose data continued to be collected over the next 60 years and the results were found through the computer output below: yˆ break up day : R - Squared 11.3% s 5.673with 91-2 89 degrees of freedom Variable C oe ff S E(C oe ff) t - ratio P - Valu e Intercept 128.950 1.525 84.6 0.0001 Year Since 1900 - 0.07606 0.0226 - 3.36 Daˆte 128.95- 0.076(Year since1900) 0.0012 If we performed another hypothesis test, what would be our conclusion (assuming all the assumptions are satisfied) with the new output of data? Example 3 (part 3) Step 4: Make a decision (reject or fail to reject H0). State your conclusion in context of the problem using p-value. With a p-value of 0.0012, we reject the null hypothesis at the α = .05 level and conclude that there is strong evidence that, on average, the ice breakup has an association with the year after 1900 and it is breaking up earlier each year. Example 3 (part 4) yˆ break up day : R - Squared 11.3% s 5.673with 91-2 89 degrees of freedom Variable C oe ff S E(C oe ff) t - ratio P - Valu e Intercept 128.950 1.525 84.6 0.0001 Year Since 1900 - 0.07606 0.0226 - 3.36 Daˆte 128.95- 0.076(Year since1900) Now, create a 95% confidence interval for the slope and interpret your results. CI b tn*2 SEb * t89 1.987 0.0012 CI 0.07606 1.987(0.0226) 0.07606 0.0449 CI (0.12096,0.03116) We are 95% confident that the true slope is between -0.12096 and -0.03116; for every increase of one year, the ice appears to break up 0.03116 to 0.12096 days earlier. Example 3 (part 4) yˆ break up day : R - Squared 11.3% s 5.673with 91-2 89 degrees of freedom Variable C oe ff S E(C oe ff) t - ratio P - Valu e Intercept 128.950 1.525 84.6 0.0001 Year Since 1900 - 0.07606 0.0226 - 3.36 Daˆte 128.95- 0.076(Year since1900) How would you determine the t-score and the p-value if it were not given? Verify that the t-score and p-values are correct. t 0.0012 b 0 0.07606 0 3.3655 0.0226 SEb To find the p-value, we use the tcdf function of the calculator: p value tcdf (min,max,df ) tcdf (E99,3.3655,89) 0.000575 This is 2-tailed so multiply by 2: p-value ≈ 0.00115 Assignment Lesson: Read: Chapter 27 Inference for Linear Chapter 27 Regression Problems: 1 - 37 (odd) WS: Regression And Correlation