PowerPoint Show on linear regression

Download Report

Transcript PowerPoint Show on linear regression

Least Squares Regression
Models
© 2010 Pearson Prentice Hall. All rights reserved
The least-squares regression model is given by
y i  1 x i   0  i
where
• yi is the value of the response variable for the
ith individual
• 0 and 1 are the parameters to be estimated
based on sample data
• xi is the value of the explanatory variable for the
ith individual
• i is a random error term with mean 0 and
variance 2i   2, the error terms are
independent and normally distributed.
• i=1,…,n, where n is the sample size (number of
ordered pairs in the data set)

14-2
Formulas for the slope and intercept
estimates.
For the estimated regression equation given by the formula:
yˆi  b0  b1 xi
The slope b1 is calculated by:
And the intercept b0 can
be found with :
( x ) ( y )

 ( xy) 
n
2
( x)

2
 (x )  n
b0  y  b1 ( x )
b1 


The standard error of the estimate, se, is found
using the formula
yi  yˆ i 
2
se 

14-4
n 2

2
residuals

n 2
Parallel Example 2: Compute the Standard Error
Compute the standard error of the estimate for the
drilling data which is presented on the next slide.
14-5
Depth at Which
Drilling Begins, x
(in feet)
35
50
75
95
120
130
145
155
160
175
185
190
14-6
Time to Drill
5 Feet, y
(in minutes)
5.88
5.99
6.74
6.1
7.47
6.93
6.42
7.97
7.92
7.62
6.89
7.9
Solution
Step 1: Using technology (i.e. Minitab), we
find the least squares regression line
to be yˆ  0.0116 x  5.5273
Step
2,
3:
The
predicted
values
as
well
as
the

residuals for the 12 observations are
given in the table on the next slide
14-7
yˆ
Depth, x Time, y
35
5.88
5.9333
50
5.99
6.1073
75
6.74
6.3973
95
6.1
6.6293


120
7.47
6.9193
130
6.93
7.0353
145
6.42
7.2093
155
7.97
7.3253
160
7.92
7.3833
175
7.62
7.5573
185
6.89
7.6733
190
7.9
7.7313
14-8
y  yˆ
-0.0533
-0.1173
0.3427
-0.5293

0.5507
-0.1053
-0.7893
0.6447
0.5367
0.0627
-0.7833
0.1687
2
y  yˆ 
0.0028
0.0138
0.1174
0.2802
0.3033
0.0111
0.6230
0.4156
0.2880
0.0039
0.6136
0.0285
 resi d uals2  2.7 01 2
Solution
Step 4: We find the sum of the squared
residuals by summing the last column
of the table:
2
residuals
 2.7012

Step 5: The standard error of the estimate is
then given by:

residuals2
2.7012
se 

 0.5197
n 2
10
14-9
CAUTION!
Be sure to divide by n-2 when computing the
standard error of the estimate.
14-10
Parallel Example 4: Compute the Standard Error
Verify that the residuals from the drilling
example are normally distributed.
14-11
14-12
Conclusion: We have insufficient evidence at
the 5% level of significance to support the
claim that the residual errors from this model
are not normally distributed.
Hypothesis Test Regarding the Slope
Coefficient, 1
To test whether two quantitative variables are linearly
related, we use the following steps provided that
1. the sample is obtained using random sampling.
2. the residuals are normally distributed with
constant error variance.
14-14
Step 1: Determine the null and alternative
hypotheses. The hypotheses can be
structured in one of three ways:
Two-tailed
Left-Tailed
Right-Tailed
H0: 1 = 0
H0: 1 = 0
H0: 1 = 0
H1: 1  0
H1: 1 < 0
H1: 1 > 0
Step 2: Select a level of significance, , depending
on the seriousness of making a Type I
error.
14-15
Step 3: Compute the test statistic
b1  1 b1
t0 

sb1
sb1
which follows Student’s t-distribution with
n-2 degrees of freedom. Remember, when
computing the test statistic, we assume the
null hypothesis to be true. So, we assume
that 1=0.
14-16
P-Value Approach
Step 4: Use Table VI to estimate the P-value using
n-2 degrees of freedom.
14-17
P-Value Approach
Two-Tailed
14-18
P-Value Approach
Left-Tailed
14-19
P-Value Approach
Right-Tailed
14-20
P-Value Approach
Step 5: If the P-value < , reject the null hypothesis.
14-21
Step 6: State the conclusion.
14-22
CAUTION!
Before testing H0: 1 = 0, be sure to draw a residual
plot to verify that a linear model is appropriate.
14-23
Parallel Example 5: Testing for a Linear Relation
Test the claim that there is a linear relation between drill
depth and drill time at the  = 0.05 level of significance
using the drilling data.
14-24
Solution
Verify the requirements:
• We assume that the experiment was randomized so
that the data can be assumed to represent a random
sample.
• In Parallel Example 4 we confirmed that the residuals
were normally distributed by constructing a normal
probability plot.
• To verify the requirement of constant error variance,
we plot the residuals against the explanatory variable,
drill depth.
14-25
There is no discernable pattern.
14-26
Solution
Step 1: We want to determine whether a linear relation
exists between drill depth and drill time without
regard to the sign of the slope. This is a two-tailed
test with
H0: 1 = 0 versus H1: 1  0
Step 2: The level of significance is  = 0.05.
Step 3: Using technology, we obtained an estimate of 1 in
Parallel Example 2, b1=0.0116. To determine the
2
standard deviation of b1, we compute x i  x  .
The calculations are on the next slide.
14-27
Depth, x
35
50
75
95

120
130
145
155
160
175
185
190
14-28
2
x i  x 
xi  x
-91.25
-76.25
-51.25
-31.25 
-6.25
3.75
18. 75
28. 75
33. 75
48. 75
58. 75
63. 75
x
8326.5625
5814.0625
2626.5625
976.5625
39. 0625
14. 0625
351.5625
826.5625
1139.0625
2376.5625
3451.5625
4064.0625
 x   30006 .25
2
i
Solution
Step 3, cont’d: We have
sb1 
se
 x
 x
2
i

0.5197
 0.0030
30006 .25
The test statistic is
b1 0.0116
t0 

 3.867
sb1
0.003

14-29
Solution: P-Value Approach
Step 4: Since this is a two-tailed test, the P-value is the
sum of the area under the t-distribution with 122=10 degrees of freedom to the left of -t0 = -3.867
and to the right of t0 = 3.867. Using Table VI we
find that with 10 degrees of freedom, the value
3.867 is between 3.581 and 4.144 corresponding to
right-tail areas of 0.0025 and 0.001, respectively.
Thus, the P-value is between 0.002 and 0.005.
Step 5: Since the P-value is less than the level of
significance, 0.05, we reject the null hypothesis.
14-30
Solution
Step 6: There is sufficient evidence at the  = 0.05
level of significance to conclude that a linear
relation exists between drill depth and drill
time.
14-31
Confidence Intervals for the Slope of the
Regression Line
A (1- )100% confidence interval for the slope of the
true regression line, 1, is given by the following
formulas:
se
b1  t 2 

b

t

s
1

2
b1
2
Lower bound:
 x i  x 
Upper bound: b1  t 2 

se
 x
 x
2
i
 b1  t 2  sb1
Here, t/2 is computed using n-2 degrees of freedom.
14-32
Note: The confidence interval formula for 1 can be
computed only if the data are randomly obtained, the
residuals are normally distributed, and there is
constant error variance.
14-33
Parallel Example 7: Constructing a Confidence Interval for
the Slope of the True Regression Line
Construct a 95% confidence interval for the slope of the
least-squares regression line for the drilling example.
14-34
Solution
The requirements for the usage of the confidence
interval formula were verified in previous
examples.
We also determined
• b1 = 0.0116
• sb  0.0030
in previous examples.
1
14-35
Solution
Since t0.025=2.228 for 10 degrees of freedom, we
have
Lower bound = 0.0116-2.2280.003=0.0049
Upper bound = 0.0116+2.2280.003=0.0183.
We are 95% confident that the mean increase in
the time it takes to drill 5 feet for each additional
foot of depth at which the drilling begins is
between 0.005 and 0.018 minutes.
14-36
The Coefficient of Determination
The Coefficient of Determination is the
proportion of the variability in the response
variable that can be attributed to the least
squares regression model.
How to calculate R2
Using the sum of squares technique:
R2  1
2
(
residuals
)

2
(
y

y
)

But for the SLR models we can simplify
2
2
R

(r
)
the calculation slightly and
,
where r is the correlation between the
response and predictor variables.
Parallel Example 8: Calculating the Coefficient of Determination
Using technology for our drilling example we can
calculate the correlation between the response and
predictor to be 0.772822. Using the simplified
calculation for the coefficient of determination that
means:
R  (0.772822)  0.5973
2
14-39
2
Interpretation: Our model using the depth at
which drilling begins as a predictor is able to
explain 59.73% of the natural variability in the
time it takes to drill 5 feet.