Simple Linear Regression and Correlation

Transcript Simple Linear Regression and Correlation

Simple Linear Regression
and Correlation
• The Model
• Estimating the Coefficients
•
EXAMPLE 1: USED CAR SALES
• Assessing the model
– T-tests
– R-square
1
The Model
• The first order linear model
y  b 0  b1x  
y = dependent variable
x = independent variable
b0 = y-intercept
b1 = slope of the line
 = error variable
y
b0 and b1 are unknown,
therefore, are estimated
from the data.
Rise
b0
b1 = Rise/Run
Run
x
2
Estimating the Coefficients
• The estimates are determined by
– drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.
y
The question is:
Which straight line fits best?
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
x
3
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Y
(2,4)
w
4
3
2.5
2
w (4,3.2)
(1,2) w
w (3,1.5)
1
1
The smaller the sum of
squared differences
the better the fit of the
line to the data.
2
3
4
X
4
To calculate the estimates of the coefficients
that minimize the differences between the data
points and the line, use the formulas:
b1 
The regression equation that estimates
the equation of the first order linear model
is:
cov(X, Y)
s 2x
yˆ  b 0  b1x
b 0  y  b1x
5
• Example 1 Relationship between odometer
reading and a used car’s selling price.
– A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
– A random sample of 100 cars
is selected, and the data
recorded.
– Find the regression line.
Car Odometer
1 37388
2 44758
3 45833
4 30862
5 31705
6 34010
.
.
.
.
.
.
Price
5318
5061
5008
5795
5784
5359
.
.
.
Independent variable x
Dependent variable y
6
• Solution
– Solving by hand
• To calculate b0 and b1 we need to calculate several
statistics first;
2
( x  x)


i
x  36,009.45;
s 2x
y  5,411.41;
( x  x)( y

cov(X, Y) 
 43,528,688
n 1
i
i
 y)
n 1
 1,356,256
where n = 100.
b1 
cov(X, Y)
s 2x
1,356,256

 .0312
43,528,688
b 0  y  b1x  5411.41  (.0312)( 36,009.45)  6,533
yˆ  b 0  b1x  6,533  .0312x
7
Assessing the Model
• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
– Are the coefficients different from zero? (T-stats)
– How closely does the line fit the data? (R-square)
8
• Sum of squares for errors
– This is the sum of differences between the points
and the regression line.
– It can serve as a measure of how well the line fits the
n
data. SSE   (y i  yˆ i ) 2 .
i 1
SSE  (n  1)s 2Y 
cov(X , Y )
s 2x
– This statistic plays a role in every statistical
technique we employ to assess the model.
9
•Testing the slope
– When no linear relationship exists between two
variables, the regression line should be horizontal.
q
q
qq
q
q
q
q
q
q
q
q
Linear relationship.
Different inputs (x) yield
different outputs (y).
No linear relationship.
Different inputs (x) yield
the same output (y).
The slope is not equal to zero
The slope is equal to zero
10
• We can draw inference about b1 from b1 by testing
H0: b1 = 0
H1: b1 = 0 (or < 0,or > 0)
– The test statistic is
b 1  b1
t
s b1
where
s b1 
s
(n  1)s 2x
The standard error of b1.
– If the error variable is normally distributed, the statistic is
Student t distribution with d.f. = n-2.
11
• Solution
– Solving by hand
– To compute “t” we need the values of b1 and sb1.
b1  .312
s b1 
t
s
(n  1)s 2x

151.6
(99)( 43,528,688
b1  b1  .312  0

 13.49
.00231
s b1
– Using the computer
 .00231
There is overwhelming evidence to infer
that the odometer reading affects the
auction selling price.
Coefficients Standard Error t Stat
Intercept
6533.383035
84.51232199 77.30687
Odometer -0.031157739
0.002308896 -13.4947
P-value
1.22E-89
4.44E-24
12
• Coefficient of determination
– When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
R 
2
[cov(X , Y )]2
s x2 s 2y
or R  1 
2

SSE
( yi  y) 2
13
– To understand the significance of this coefficient
note:
The regression model
Overall variability in y
The error
14
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
y2
y
y1
x1
Total variation in y =
(y1  y ) 2  (y 2  y ) 2 
x2
Variation explained by the
regression line)
(yˆ 1  y) 2  (yˆ 2  y) 2
+ Unexplained variation (error)
 (y1  yˆ 1 ) 2  (y 2  yˆ 2 ) 2
15
Variation in y = SSR + SSE
• R2 measures the proportion of the variation in y
that is explained by the variation in x.
2
R  1

SSE
(y i  y)


(y  y)
(y i  y) 2  SSE
2
i
2


SSR
(y i  y) 2
• R2 takes on any value between zero and one.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
16
• Example 2
– Find the coefficient of determination for example 17.1;
what does this statistic tell you about the model?
• Solution
2
– Solving by hand; R 
– Using the computer
[cov(X, Y)] 2
s 2x s 2y

[ 1,356,256]2
( 43,528,688)(64,999)
 .6501
• From the regression output we have
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.8063
0.6501
0.6466
151.57
100
65% of the variation in the auction
selling price is explained by the
variation in odometer reading. The
rest (35%) remains unexplained by
this model.
17

Simple Linear Regression and Correlation

Transcript Simple Linear Regression and Correlation

Directory