Regression Basics - University of South Florida

Download Report

Transcript Regression Basics - University of South Florida

Regression Basics
Predicting a DV with a Single IV
Questions
• What are predictors and
criteria?
• Write an equation for
the linear regression.
Describe each term.
• How do changes in the
slope and intercept
affect (move) the
regression line?
• What does it mean to
test the significance of
the regression sum of
squares? R-square?
• What is R-square?
• What does it mean to choose
a regression line to satisfy
the loss function of least
squares?
• How do we find the slope
and intercept for the
regression line with a single
independent variable?
(Either formula for the slope
is acceptable.)
• Why does testing for the
regression sum of squares
turn out to have the same
result as testing for Rsquare?
Basic Ideas
• Jargon
– IV = X = Predictor (pl. predictors)
– DV = Y = Criterion (pl. criteria)
– Regression of Y on X e.g., GPA on SAT
• Linear Model = relations between IV
and DV represented by straight line.
Yi    X i   i (population values)
• A score on Y has 2 parts – (1) linear
function of X and (2) error.
Basic Ideas (2)
• Sample value: Yi  a  bX i  ei
• Intercept – place where X=0
• Slope – change in Y if X changes 1
unit. Rise over run.
• If error is removed, we have a predicted
value for each person at X (the line):
Y   a  bX
Suppose on average houses are worth about $75.00 a
square foot. Then the equation relating price to size
would be Y’=0+75X. The predicted price for a 2000
square foot house would be $150,000.
Linear Transformation
Y   a  bX
• 1 to 1 mapping of variables via line
• Permissible operations are addition and
multiplication (interval data)
C
4
0
3
5
3
0
n C
a
3
0
2
0
g h
in a
g n
5
Y
2
0
1
5
1
0
Y
=
Y=
1
Y
2
Y
h
=1
5
Y
1
50
+
Y
=
0
++
Y
2
=
5
22
=
X
5
+
5
0
0
0
2
4
6
X
Add a constant
8
1
0
2
0
4
6
8
X
Multiply by a constant
1
0
Linear Transformation (2)
Centigrade to Fahrenheit
Note 1 to 1 map 240
212 degrees F, 100 degrees C
200
Intercept?
160
120
Slope?
Degrees F
•
•
•
•
Y   a  bX
80
40
32 degrees F, 0 degrees C
0
0
30
60
90
120
Degrees C
Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.
Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes
from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is
rise over run is the slope. Y = 32+1.8X. F=32+1.8C.
Review
• What are predictors and criteria?
• Write an equation for the linear
regression with 1 IV. Describe each
term.
• How do changes in the slope and
intercept affect (move) the regression
line?
ig
Regression of Weight on
Height
X
61
105
62
120
63
120
65
160
65
120
68
145
69
175
70
160
72
185
75
210
N=10
N=10
M=67
M=150
SD=4.57
SD=
33.99
R
e
g
e
Wt
2
4
0
1
0
Y   a  bX
2
Y
=
1
8
0
1
5
0
R
R
W
Ht
1
2
9
0
6
0
0
Correlation
= .94.60 6
6
6 (r) 6
2
7
4
7
6
7
8
H Y’=-316.86+6.97X
e
Regression equation:
ig
e
Illustration of the Linear
Model. This concept is vital!
Yi    X i   i
Yi  a  bX i  ei
Y   a  bX
ei  Yi  Yi '
R
2
0
1
8
g
0
M
0
e
1
6
W
M
Consider Y as
a deviation
from the
mean.
e
1
D
1
0
L
4
e E
D
2
0
y
0
v r
ia r
e
0
(
1
Ye
in
6
0
Part of that deviation6 can be6associated
with
26
46 X (the
67linear87
H
e
part) and part cannot (the error).
ig
Predicted Values & Residuals
e
Y   a  bX
R
Numbers for linear part and error.
e
2
0
1
8
N
g
1
0
M
2
6
M
0
W
L
1
D
4
e E
1
D
6
e
0
26
46
5 ' a
e
6
0
6
y
67
NoteHM of Y’ e
and Residuals.
Note variance of
Y is V(Y’) +
V(res).
8
87 9
10
v
5
,
0
ig
62
120
2
Wt
Y'
s
108.19
n
s
Resid
-3.19
115.16
4.84
o
120
122.13
-2.13
65
160
136.06
23.94
120
136.06 o
-16.06 f
156.97
n
-11.97
163.94
io 11.06
r
io r 145
P
P
69 ia
175
70
160
170.91
-10.91
72
185
184.84
0.16
h
1
t
t
2
75
210
205.75
4.25
67
150
150.00
0.00
SD
4.57
33.99
31.85
11.89
V
20.89
1155.56
1014.37
141.32
M
io
63
68
t o
7
105
n 65
a
i6 r
a
0
(
1
Ye
in
Ht
e
61
a
4
0
v r
2
e
3
0
e
1
r
0
Finding the Regression Line
Need to know the correlation, SDs and means of X and Y.
The correlation is the slope when both X and Y are
expressed as z scores. To translate to raw scores, just bring
back original SDs for both.
SDY
z X zY

(rise over run)
b  rXY
rXY 
SDX
N
To find the intercept, use:
a  Y  bX
Suppose r = .50, SDX = .5, MX = 10, SDY = 2, MY = 5.
Slope
2
b  .5  2
.5
Intercept
Equation
a  5  2(10)  15
Y '  15  2 X
ig
e
Line of Least Squares
R
e
2
0
1
8
We have some points.
g
r
0
M
e
0
e
W
6
0
Assume linear relations 1
y
Ye
' a
M
L
in
e
is reasonable, so the 2
1
4
0
e E
v r
ia r
vbls can be represented D
D
e
1
2
0
by a line. Where
(
6
5
1
0
0
should the line go?
6
6
26
46
67
87
0
H
Place the line so errors (residuals) are small.
Thee line we ig
calculate has a sum of errors = 0. It has a sum of squared
errors that are as small as possible; the line provides the
smallest sum of squared errors or least squares.
v
Least Squares (2)
Review
• What does it mean to choose a regression line
to satisfy the loss function of least squares?
• What are predicted values and residuals?
Suppose r = .25, SDX = 1, MX = 10, SDY = 2, MY = 5.
What is the regression equation (line)?
Partitioning the Sum of
Squares
Y  a  bX  e
Y  Y ' e
Y '  a  bX
e  Y Y'
Y  Y  (Y 'Y )  (Y  Y ' )
reg
Definitions
= y, deviation from mean
error
2
2
(
Y

Y
)

[(
Y
'

Y
)

(
Y

Y
'
)]


(y)  (Y 'Y )  (Y  Y ' )
2
Sum of
squared
deviations
from the
mean
2
Sum of squares
2
(cross products
drop out)
Sum of squares Sum of squared
+
= due to
residuals
regression
Analog: SStot=SSB+SSW
Partitioning SS (2)
SSY=SSReg + SSRes
SSY SSRe g SSRe s


SSY
SSY
SSY
1  R 2  (1  R 2 )
Total SS is regression SS plus
residual SS. Can also get
proportions of each. Can get
variance by dividing SS by N if you
want. Proportion of total SS due to
regression = proportion of total
variance due to regression = R2
(R-square).
Partitioning SS (3)
Wt (Y)
M=150
(Y  Y ) 2
Y'
Y 'Y
(Y 'Y ) 2
Resid
(Y-Y')
Resid2
105
2025
108.19
-41.81
1748.076
-3.19
10.1761
120
900
115.16
-34.84
1213.826
4.84
23.4256
120
900
122.13
-27.87
776.7369
-2.13
4.5369
160
100
136.06
-13.94
194.3236
23.94
573.1236
120
900
136.06
-13.94
194.3236
-16.06
257.9236
145
25
156.97
6.97
48.5809
-11.97
143.2809
175
625
163.94
13.94
194.3236
11.06
122.3236
160
100
170.91
20.91
437.2281
-10.91
119.0281
185
1225
184.84
34.84
1213.826
0.16
0.0256
210
3600
205.75
55.75
3108.063
4.25
18.0625
Sum =
1500
10400
1500.01
0.01
9129.307
-0.01
1271.907
Variance
1155.56
1014.37
141.32
Partitioning SS (4)
Total
Regress
Residual
SS
10400
9129.31
1271.91
Variance
1155.56
1014.37
141.32
10400 9129 .31 1271 .91


 1  .88  .12
10400
10400
10400
Proportion of SS
1155 .56 1014 .37 141 .32


 1  .88  .12
1155 .56 1155 .56 1155 .56
rYY '  .94  rXY
rY ' X  1 .
Proportion of
Variance
R2 = .88
Note Y’ is linear function of X, so
rYY2 '  .88  R2 . rYE  .35 rYE2  .12 rY 'E  0
Significance Testing
Testing for the SS due to regression = testing for the variance
due to regression = testing the significance of R2. All are the
same. H : R2
0
0
population
F
SSreg / df1
F
SSreg / df1
SSres / df 2
SSres / df 2


SSreg / k
SSres / ( N  k  1)
9129.31 / 1
 57.42
127191
. / (10  1  1)
R2 / k
F
(1  R 2 ) /( N  k  1)
F
.88 / 1
 58.67
(1.88) / (10  1  1)
k=number of IVs (here
it’s 1) and N is the
sample size (# people).
F with k and (N-k-1)
df.
Equivalent test using R-square
instead of SS.
Results will be same within
rounding error.
Review
• What does it mean to test the
significance of the regression sum of
squares? R-square?
• What is R-square?
• Why does testing for the regression sum of
squares turn out to have the same result as
testing for R-square?