Linear regression models

Download Report

Transcript Linear regression models

Simple Linear Regression
Purposes:
• To describe the linear relationship between two
continuous variables: the response variable (yaxis) and a single predictor variable (x-axis)
• To determine how much of the variation in Y can
be explained by the linear relationship with X
and how much of this relationship remains
unexplained
• To predict new values of Y from new values of X
The linear regression model is:
Yi  0  1 X i   i
• β0 = population intercept (when Xi=0)
• β1 = population slope (measures the
change in Y per unit change in X)
• εi = random or unexplained error
associated with the i th observation.
Linear relationship
Y
Yi  0  1 X i   i
Yi
εi
β1
1.0
β0
Xi
X
Linear models may approximate non-linear
functions over a limited domain
extrapolation
interpolation
extrapolation
Y
μy2
μyi = βo + β1*xi + εi
μy1
X
x1
x2
Yi
Yi – Ŷi = εi (residual)
Ŷi
Fitting data to
a linear model
Xi
The squared residual:
2
ˆ
di  (Yi  Yi )
The residual sum of squares
2
ˆ
RSS  i 1 (Yi  Yi )
n
Estimating Regression Parameters
• The “best fit” estimates of the regression
population parameters (β0 and β1) are the
values that minimize the residual sum of
squares (RSS) between each observed
value and the predicted value of the
model:
n
2
ˆ
minimize (Yi  Yi )
i 1
Sum of squares
SS Y  i 1 (Yi  Y )  i 1 (Yi  Y )(Yi  Y )
n
2
n
Sum of cross products
SS XY  i 1 (Yi  Y )( X i  X )
n
Sample Variance
1
n
s 
(Yi  Y )(Yi  Y )

i 1
n 1
2
Y
Covariance
s XY
1
n

(Yi  Y )( X i  X )

i 1
n 1
Least-squares parameter estimates
s
SS
XY
XY
ˆ
1  2 
SS XX
sX
where
SS X  i 1 ( X i  X )
n
2
Solving for the intercept:
ˆ
ˆ
 0  Y  1 X
Thus, our estimated regression
equation is:
Yˆi  ˆ0  ˆ1 X i
Hypothesis Tests with Regression
• Null hypothesis is that there is no linear
relationship between X and Y:
H0: β1 = 0  Yi = β0 + εi
HA: β1 ≠ 0  Yi = β0 + β1 Xi + εi
• We can use an F-ratio (i.e., the ratio of
variances) to test these hypotheses
Variance of the error of regression:
ˆ


Y

Y

2
n
RSS
ˆ 

n2
2
i 1
i
i
n2
NOTE: this is also referred to as residual
variance, mean squared error (MSE) or
residual mean square (MSResidual)
Mean square of regression:
ˆ

Y
 Y 
2
n
MSRegression 
SSRegression
1

i 1
i
1
The F-ratio is: (MSRegression)/(MSResidual)
Variance components and the
Coefficient of Determination (r2 or R2)
SSY  SSreg  RSS
SSreg  SSY  RSS
Coefficient of determination
r 
2
SSreg
SSY

SSreg
SSreg  RSS
Pearson’s product-moment
correlation coefficient (r)
SSXY
s XY
r

SSX SSY  s X sY
ANOVA table for regression
Source
Regression
Residual
Total
Degrees
Sum of squares
of freedom

n
Mean
square
SSreg 
(Yˆi  Yi ) 2 SSreg
i 1
1
n
RSS
n-2 RSS 
(Yi  Yˆi ) 2
i 1
n2
n
SSY
n-1 SSY 
(Yi  Yi ) 2
i 1
n 1
1
Expected
mean square
F
ratio
N
SSreg / 1
 2  12

i 1

2

 Y2
X2
RSS /(n  2 )
Publication form of ANOVA table for
regression
Source
Regression
Residual
Total
Sum of
Squares
Mean
Square
df
11.479
1
11.479
8.182
15
.545
19.661
16
F
21.044
Sig.
0.00035
Variance of estimated intercept
ˆ  0
2
2

X
2 1
 ˆ  
n
SS
X





ˆ0  t  ,n  2 ˆ ˆ   0  ˆ0  t  ,n  2 ˆ ˆ
0
0
Variance of the slope estimator
ˆ 1 
2
ˆ
2
SS X
ˆ1  t ,n  2 ˆ ˆ  1  ˆ1  t ,n  2 ˆ ˆ
1
1
Variance of the fitted value
2
ˆ (Yˆ | X )
2


Xi  X  
2 1
 ˆ

n

SS
X


Yˆ  t ,n2ˆ (Yˆ | X )  Yˆ  Yˆ  t ,n2ˆ(Yˆ | X )
Regression
8
7
6
5
4
3
2
1
-2
0
2
4
Ln( Island Area)
6
8
10
Assumptions of regression
• The linear model correctly describes the
functional relationship between X and Y
• The X variable is measured without error
• For a given value of X, the sampled Y
values are independent with normally
distributed errors
• Variances are constant along the
regression line
Residual plot for species-area
relationship
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
2.5
3.0
3.5
4.0
4.5
Unstandardized Predicted Value
5.0
5.5
6.0