least squares line

Download Report

Transcript least squares line

Summarizing Bivariate
Data
Introduction to Linear
Regression
1
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Linear Relations
The relationship y = a + bx is the equation
of a straight line.
The value b, called the slope of the line, is
the amount by which y increases when x
increases by 1 unit.
The value of a, called the intercept (or
sometimes the vertical intercept) of the
line, is the height of the line above the value
x = 0.
2
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example
y
15
y = 7 + 3x
y increases by b = 3
10
x increases by 1
5
a=7
0
0
3
2
4
6
8
x
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example
y
15
y changes by b = -4 (i.e., changes by –4)
10
a = 17
y = 17 - 4x
5
x increases by 1
0
0
4
2
4
6
8
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Least Squares Line
The most widely used criterion for
measuring the goodness of fit of a line
y = a + bx to bivariate data (x1, y1),
(x2, y2),, (xn, yn) is the sum of the of the
squared deviations about the line:
y  (a  bx)
2
 y1  (a  bx1 ) 
2
 yn  (a  bx n )
2
The line that gives the best fit to the data is the one
that minimizes this sum; it is called the least squares
line or sample regression line.
5
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Coefficients a and b
The slope of the least
squares line is
 x  x  y  y 

b
 x  x 
2
And the y intercept is a  y  bx
We write the equation of the least squares line as
yˆ  a  bx
where the ^ above y emphasizes that yˆ (read as y-hat)
is a prediction of y resulting from the substitution of a
particular value into the equation.
6
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Calculating Formula for b
b
7

x   y 


xy 
n
2
x


2
x  n
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Continued
8
(x  x)2
x
y
xx
240
430
69
607
257
480
340
467
335
239
95
178
496
4233
39
81
17
96
61
70.5
65
82
67
47
20
35
87
768
-85.615
104.385
-256.615
281.385
-68.615
154.385
14.385
141.385
9.385
-86.615
-230.615
-147.615
170.385
7329.994
10896.148
65851.456
79177.302
4708.071
23834.609
206.917
19989.609
88.071
7502.225
53183.456
21790.302
29030.917
323589.08
yy
-20.038
21.962
-42.038
36.962
1.962
11.462
5.962
22.962
7.962
-12.038
-39.038
-24.038
27.962
 x-x y-y
1715.60
2292.45
10787.72
10400.41
-134.59
1769.49
85.75
3246.41
74.72
1042.72
9002.87
3548.45
4764.22
48596.19
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Calculations
From the previous slide, we have
  x  x  y  y   48596.19 and
 x  x   323589.08
2
So
 x  x  y  y  48596.19

b

 0.15018
323589.08
 x  x 
2
Also n  13,
x  4233 and  y  768
4233
768
so x 
 325.615 and y 
 59.0385
13
13
This gives
a  y - bx  59.0385- 0.15018(325.615)  10.138
9
The regression line is yˆ  10.138  0.15018x.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab Graph
The following graph is a copy of the output from
a Minitab command to graph the regression line.
Regression Plot
Standard Fare= 10.1380 + 0.150179 Distance
S = 6.80319
R-Sq = 93.5 %
R-Sq(adj) = 92.9 %
105
95
Standard Fare
85
75
65
55
45
35
25
15
0
10
100
200
300
Distance
400
500
600
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
11
x
y
x2
xy
240
430
69
607
257
480
340
467
335
239
95
178
496
4233
39
81
17
96
61
70.5
65
82
67
47
20
35
87
768
57600
184900
4761
368449
66049
230400
115600
218089
112225
57121
9025
31684
246016
1701919
9360
34830
1173
58272
15677
33840
22100
38294
22445
11233
1900
6230
43152
298506
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
Using the calculation formula we have:
n  13,  x  4233,  y  768
x
2
 1701919, and  xy  298506
so

x   y 


xy 
298506 
4233768
13
n

2
2
4233
 
x


2
1701919

x  n
13
48596.19

 0.15018
323589.1
As before a  y - bx  59.0385- 0.15018(325.615)  10.138
and the regression line is yˆ  10.138  0.15018x.
b
Notice that we get the same result.
12
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Three Important Questions
To examine how useful or effective the line
summarizing the relationship between x and y,
we consider the following three questions.
1. Is a line an appropriate way to summarize
the relationship between the two variables?
2. Are there any unusual aspects of the data
set that we need to consider before
proceeding to use the regression line to
make predictions?
3. If we decide that it is reasonable to use the
regression line as a basis for prediction,
how accurate can we expect predictions
based on the regression line to be?
13
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Terminology
The predicted or fitted values result
from substituting each sample x value
into the equation for the least squares
line. This gives
yˆ 1  a  bx1 =1st predicted value
yˆ 2  a  bx 2 =2nd predicted value
...
yˆ n  a  bx n =nth predicted value
14
The residuals for the least squares line are the
values: y1  y
ˆ 1 , y2  yˆ 2 , ..., yn  yˆ n
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Continued
x
15
240
430
69
607
257
480
340
467
335
239
95
178
496
Predicted value
y
yˆ  10.1 .150x
39
46.18
81
74.72
17
20.50
96
101.30
61
48.73
70.5
82.22
65
61.20
82
80.27
67
60.45
47
46.03
20
24.41
35
36.87
87
84.63
Residual
y  yˆ
-7.181
6.285
-3.500
-5.297
12.266
-11.724
3.801
1.728
6.552
0.969
-4.405
-1.870
2.373
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Plot
A residual plot is a scatter plot of the data
pairs (x, residual). The following plot was
produced by Minitab from the Greyhound
example.
16
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Plot - What to look for.
Residual
Isolated points or patterns indicate potential
problems.
Ideally the the points should be randomly
spread out above and below zero.
This residual plot indicates no systematic
bias using the least squares line to predict
the y value.
Generally this is the kind of pattern that
you would like to see.
0
x
17
Note:
1.Values below 0 indicate over prediction
2.Values above 0 indicate under prediction.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Greyhound example continued
Predicted
fares are
too high.
Predicted
fares are
too low.
For the Greyhound example, it appears that the line
systematically predicts fares that are too high for cities
close to Rochester and predicts fares that are too little for
most cities between 200 and 500 miles.
18
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
More Residual Plots
Another common type of residual plot is a scatter
plot of the data pairs (y
ˆ , residual). The following plot
was produced by Minitab for the Greyhound data.
Notice, that this residual plot shows the same type of
systematic problems with the model.
19
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Definition formulae
The total sum of squares, denoted by
SSTo, is defined as
SSTo  (y1  y)2  (y2  y)2 
  (y  y)
 (yn  y)2
2
The residual sum of squares, denoted by
SSResid, is defined as
SSResid  (y1  yˆ 1 )2  (y2  yˆ 2 )2 
  (y  y)
ˆ
20
 (yn  yˆ n )2
2
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Calculational formulae
SSTo and SSResid are generally found
as part of the standard output from most
statistical packages or can be obtained
using the following computational
formulas:
2
2   y
SSTo   y 
n
SSResid   y2  a  y  b xy
The coefficient of determination, r2, can be
computed as 2
SSResid
r  1
21
SSTo
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Coefficient of Determination
The coefficient of determination,
denoted by r2, gives the proportion of
variation in y that can be attributed to an
approximate linear relationship between x
and y.
Note that the coefficient of determination is
the square of the Pearson correlation
coefficient.
22
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
n  13,  y  768,  y2  53119,  xy  298506
b  0.150179 and a  10.1380
2
y



768
2
SSTo   y 
 53119 
 78072.2
2
n
13
SSResid   y2  a  y  b xy
 53119 10.1380(768)  0.150179(298506)
 509.117
23
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
SSResid
509.117
r  1
 1
 0.9348
SSTo
7807.23
2
We can say that 93.5% of the variation in the
Fare (y) can be attributed to the least squares
linear relationship between distance (x) and
fare.
24
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
More on variability
The standard deviation about the least
squares line is denoted se and given by
SSResid
se 
n2
se is interpreted as the “typical” amount by
which an observation deviates from the least
squares line.
25
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
SSResid
509.117
se 

 $6.80
n2
11
The “typical” deviation of actual fare from the
prediction is $6.80.
26
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab output for Regression
a
b
Regression Analysis: Standard Fare versus Distance
Least squares
regression line
The regression equation is
Standard Fare = 10.1 + 0.150 Distance
Predictor
Constant
Distance
S = 6.803
Coef
10.138
0.15018
se
R-Sq = 93.5%
DF
1
11
12
r2
T
2.34
12.56
P
0.039
0.000
R-Sq(adj) = 92.9%
Analysis of Variance
Source
Regression
Residual Error
Total
27
SE Coef
4.327
0.01196
SSResid
SS
7298.1
509.1
7807.2
MS
7298.1
46.3
F
157.68
P
0.000
SSTo
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.