Linear Regression 1

Download Report

Transcript Linear Regression 1

Linear Regression 1
Sociology 5811 Lecture 19
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Final Project Proposals Due next week!
• Any questions?
• Today’s Class
• The linear regression model
Review: Linear Functions
• Linear functions can summarize the relationship
between two variables:
– Formula: Happy = 2 + .00005Income
• Linear functions can also be used to “predict”
(estimate) a case’s value of variable (Yi) based on
its value of another variable (Xi)
• If you know the constant and slope
• “Y-hat” indicates an estimation function:
Yˆi  a  bYX X i
• bYX denotes the slope of Y with respect to X
Review: The Linear Regression Model
• The value of any point (Yi) can be modeled as:
Yi  a  bYX X i  ei
• The value of Y for case (i) is made up of
• A constant (a)
• A sloping function of the case’s value on
variable X (bYX)
• An error term (e), the deviation from the line
• By adding error (e), an abstract mathematical
function can be applied to real data points
Review: The Linear Regression Model
Case 7: X=3, Y=5
• Visually: Yi = a + bXi + ei
e = 1.5
4
bX = 3(.5) = 1.5
Constant (a) = 2
2
a=2
-4
-2
0
Y=2+.5X
-2
-4
2
4
Review: Estimating Linear Equations
• Question: How do we choose the best line to
describe our real data?
• Idea: The best regression line is the one with the
smallest amount of error
• The line comes as close as possible to all points
• Error is simply deviation from the regression line
• Note: to make all deviation positive, we square it,
producing the “sum of squares error”:
N
N
i 1
i 1
2
2
ˆ
 (Yi  Yi )   ei
Review: Estimating Linear Equations
• A poor estimation (big error)
4
2
-4
-2
0
2
-2
Y=1.5-1X
-4
4
Review: Estimating Linear Equations
• Better estimation (less error)
4
2
-4
-2
0
Y=2+.5X
-2
-4
2
4
Review: Estimating Linear Equations
• Look at the improvement (reduction) in error:
High Error
vs.
Low Error
Review: Estimating Linear Equations
• Goal: Find values of constant (a) and slope (b)
that produce the lowest squared error
– The “least squares” regression line
• The formula for the slope (b) that yields the “least
squares error” is:
bYX
sYX
 2
sX
• Where s2x is the variance of X
• And sYX is the covariance of Y and X
• A concept we must now define and discuss
Covariance
• Variance: Sum of deviation about Y-bar over N-1
N
s 
2
Y
 (Y  Y )
i 1
2
i
N 1
• Covariance (sYX): Sum of deviation about Y-bar
multiplied by deviation around X-bar:
N
sYX 
 (Y  Y )( X
i 1
i
N 1
i
 X)
Covariance
• Covariance: A measure of how much variance of a
case in X is accompanied by variance in Y
• It measures whether deviation (from mean) in X
tends to be accompanied by similar deviation in Y
– Or if cases with positive deviation in X have negative
deviation in Y
– This is summed up for all cases in the data
• The covariance is one numerical measure that
characterizes the extent of linear association
– As is the correlation coefficient (r)
Covariance
• Covariance: based on multiplying deviation in X
This point deviates a
and Y
4
dev = 3
lot from both means
(3)(2.5) = 7.5
2
dev = 2.5
Y-bar = .5
-4
-2
0
-2
2
4
This point deviates very
little from X-bar, Y-bar
(.4)(-.25) =-.01
X-bar = -1
-4
Covariance
• Some points fall above both means (or below
both means)
4
2
Y-bar = .5
-4
-2
0
-2
X-bar = -1
-4
2
4
Points falling above both means (or
below both means) contribute
positively to the covariance: Two
positive (or two negative) deviations
multiply to give a positive number
Covariance
• Points falling above one mean but below the
other = one positive and one negative deviation
4
One positive and one negative
deviation multiply to be negative.
2
Y-bar = .5
-4
-2
0
-2
X-bar = -1
-4
2
4
Covariance
• Covariance is positive if cases cluster on diagonal
from lower-left to upper-right
– Cases that deviate positively on X also deviate
positively on Y (and negative X with negative Y)
• Covariance is negative if cases cluster on
opposite diagonal (upper-left to lower-right)
– Cases with positive deviation on X are negative on Y
(and negative on X with positive on Y)
• If points are scattered all around, positives and
negatives cancel out – the covariance is near zero
Covariance and Slope
• Note that the covariance has properties similar to
the slope
• In fact, the covariance can be used to calculate a
regression slope that minimizes error for all points
– The “Ordinary Least Squares” error slope.
Covariance and Slope
• The slope formula can be written out as follows:
bYX
sYX
 2
sX
N
 (Y  Y )( X
i 1
bYX 
i
i
 X)
N 1
N
(X
i 1
i
 X)
N 1
N

2
 (Y  Y )( X
i 1
i
N
(X
i 1
i
i
 X)
 X)
2
Computing the Constant
• Once the slope has been calculated, it is simple to
determine the constant (a):
a  Y  bYX X
•
•
•
•
Simply plug in the values of Y-bar, X-bar, and b
Notes:
The calculated value of b is called a “coefficient”
The value of a is called the constant.
Regression Example
• Example: Study time and student achievement.
– X variable: Average # hours spent studying per day
– Y variable: Score on reading test
Case
X
Y
1
2.6
28
2
1.4
13
3
.65
17
4
4.1
31
5
.25
8
6
1.9
16
Y axis
30
X-bar = 1.8
20
Y-bar = 18.8
10
X axis
0
0
1
2
3
4
Regression Example
• Slope = covariance (X and Y) / variance of X
– X-bar = 1.8, Y-bar = 18.8
Case
X
Y
X Dev
Y Dev
XD*YD
1
2.6
28
0.8
9.2
7.36
2
1.4
13
-0.4
-5.8
1.92
3
.65
17
1.15
-1.8
-2.07
4
4.1
31
2.3
12.2
28.06
5
.25
8
-1.55
-10.8
16.74
6
1.9
16
0.1
-2.8
-.28
Sum of X
deviation *
Y deviation
= 51.73
Regression Example
• Calculating the Covariance:
N
sYX 
 (Y  Y )( X
i 1
i
i
 X)
N 1
51 .73

 10 .36
6 1
• Standard deviation of X = 1.4
• Variance = square of S.D. = 1.96
sYX 10.36
• Finally:
bYX 
s
2
X

1.96
 5.3
a  Y  bYX X 18.8  5.3(1.8)  9.26
Regression Example
•
•
•
•
Results: Slope b = 5.3, constant a = 9.3
Equation: TestScore = 9.3 + 5.3*HrsStudied
Question: What is the interpretation of b?
Answer: For every hour studied, test scores
increase by 5.3 points
• Question: What is the interpretation of the
constant?
• Answer: Individuals who studied zero hours are
predicted to score 9.3 on a the test.
Computing Regressions
• Regression coefficients can be calculated in SPSS
– You will rarely, if ever, do them by hand
• SPSS will estimate:
– The value of the constant (a)
– The value of the slope (b)
– Plus, a large number of related statistics and results of
hypothesis testing procedures
Example: Education & Job Prestige
• Example: Years of Education versus Job Prestige
– Previously, we made an “eyeball” estimate of the line
100
90
80
Our estimate:
70
60
Y = 5 + 3X
50
40
30
20
10
0
-10
-20
-30
-40
0
2
EDUCATN
4
6
8
10
12
14
16
18
20
Example: Education & Job Prestige
• The actual SPSS regression results for that data:
Model Summary
Model
1
R
a
.521
R Sq uare
.272
Adjusted
R Sq uare
.271
Estimates of a and b:
“Constant” = a = 9.427
Slope for “Year of
School” = b = 2.487
Std. Error of
the Estimate
12.40
a. Predictors: (Constant), HIGHEST YEAR OF SCHOOL
COMPLETED
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
Unstandardized
Coefficients
B
Std. Error
9.427
1.418
2.487
Standardi
zed
Coefficien
ts
Beta
.108
.521
t
6.648
Sig .
.000
23.102
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE
• Equation: Prestige = 9.4 + 2.5 Education
• A year of education adds 2.5 points job prestige
Example: Education & Job Prestige
• Comparing our “eyeball” estimate to the actual
OLS regression line
100
90
80
Our estimate:
70
60
Y = 5 + 3X
50
40
30
20
Actual OLS
regression line
computed in
SPSS
10
0
-10
-20
-30
-40
0
2
EDUCATN
4
6
8
10
12
14
16
18
20
Example: Education & Job Prestige
• Much more information is provided:
Model Summary
Model
1
R
a
.521
R Sq uare
.272
Adjusted
R Sq uare
.271
The R and R-Square
indicate how well the line
summarizes the data
Std. Error of
the Estimate
12.40
a. Predictors: (Constant), HIGHEST YEAR OF SCHOOL
COMPLETED
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
Unstandardized
Coefficients
B
Std. Error
9.427
1.418
2.487
Standardi
zed
Coefficien
ts
Beta
.108
.521
t
6.648
Sig .
.000
23.102
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE
This information allows us to do
hypothesis tests about constant & slope
R-Square
• Issue: Even the “best” regression line misses data
points. We still have some error.
• Question: How good is our line at summarizing
the relationship between two variables?
– Do we have a lot of error?
– Or only a little? (i.e., the line closely estimates cases)
• Specifically, does knowledge of X help us
accurately understand values of Y?
• Solution: The R-Square statistic
– Also called “coefficient of determination”
R-Square
• Variance around Y-bar can be split into two parts:
“Error
Variance”
4
Y-bar
-4
“Explained
Variance”
2
-2
0
Y=2+.5X
-2
2
4
R-Square
• The total variation of a case Yi around Y-bar can
be partitioned into two parts (like ANOVA):
• 1. Explained variance
– Also called “Regression Variance”
– The variance we predicted based on the line
• 2. Error variance
– The variance not accounted for by the line
• Summing squared deviation for all cases give us:
SSTOTAL  SSREGRESSION  SSERROR
R-Square
• The R-Square statistic is computed as follows:
2
YX
R
2
YX
2 2
X Y
SSREGRESSION
s


SSTOTAL
s s
• Question: What is R-square if the line is perfect?
• (i.e., it hits every point, there is no error)
• Answer: R-square = 1.00
• Question: What is R-square if the line is NO
HELP in estimating points… (lots of error)
• Answer: R-square is zero
R-Square
• Properties of R-square:
• 1. Tells us the proportion of all variance in Y
that is explained as a linear function of X
– It measures “how good” our line is at predicting Y
• 2. Ranges from 0 to 1
– 1 indicates that perfect prediction of Y by X
– 0 indicates that the line explains no variance in Y
• The R-square indicates how well a variable (or
groups of variables) account for variation Y.
Interpreting R-Square
• R-square is often used as an overall indicators of
the “success” of a regression model
• Higher R-square is considered “better” than lower
• How high an R-square is “good enough”?
–
–
–
–
It varies depending on the dependent variable
Orderly phenomena can yield R-square > .9
“Messy”, random phenomena can yield values like .05
Look at literature to know what you should expect.
Interpreting R-Square
• But, finding variables that produce a high Rsquare is not the only important goal
– Not all variables that generate high R-square are
sensible to include in a regression analysis
– Example: Suppose you want to predict annual income
• Hourly wage is a very good predictor… Because it is
tautologically linked to the dependent variable
• More sociologically interesting predictors would be social
class background, education, race, etc.
– Example: Conservatism predicts approval of Bush.