R tutorial - Blog of Applied Algorithm Lab., KAIST

Download Report

Transcript R tutorial - Blog of Applied Algorithm Lab., KAIST

Linear Regression using R
Sungsu Lim
Applied Algorithm Lab.
KAIST
1/35
Regression
•
Regression analysis answers questions about the dependencies
of a response variable on one or more predictors,
• including prediction of future values of a response, discovering
which predictors are important, and estimating the impact of
changing a predictor or a treatment on the value of the response.
• In linear regression, models of the unknown parameters are
estimated from the data using linear functions.
(Usually, the conditional mean of Y given the value of X)
2/35
Correlation Coefficient
• The correlation coefficient between two random variables X and
Y is defined as
• If we have a series of n measurements of X and Y, then the
sample correlation coefficient is defined as
• It has a value between -1 and 1, and it indicates the degree of
linear dependence between the variables. It detects only linear
dependencies between two variables.
3/35
Example
> install.packages("alr3")
# Installing a package
> library(alr3)
# loading a package
> data(fuel2001)
# loading a specific data set
> fueldata<-fuel2001[,1:5]
> fueldata[,1]<-fuel2001$Tax
> fueldata[,2]<-1000*fuel2001$Drivers/fuel2001$Pop
> fueldata[,3]<-fuel2001$Income/1000
> fueldata[,4]<-log(fuel2001$Miles,2)
> fueldata[,5]<-1000*fuel2001$FuelC/fuel2001$Pop
> colnames(fueldata)<-c("Tax","Dlic","Income","logMiles","Fuel")
4/35
Example
> cor(fueldata)
Tax
Dlic
Income
logMiles
Fuel
Tax
1.00000000 -0.08584424 -0.01068494 -0.04373696 -0.2594471
Dlic
-0.08584424 1.00000000 -0.17596063 0.03059068 0.4685063
Income -0.01068494 -0.17596063 1.00000000 -0.29585136 -0.4644050
logMiles -0.04373696 0.03059068 -0.29585136 1.00000000 0.4220323
Fuel
-0.25944711 0.46850627 -0.46440498 0.42203233 1.0000000
> round(cor(fueldata),2)
Tax Dlic Income logMiles Fuel
Tax
1.00 -0.09 -0.01
-0.04 -0.26
Dlic
-0.09 1.00 -0.18
0.03 0.47
Income -0.01 -0.18
1.00
-0.30 -0.46
logMiles -0.04 0.03 -0.30
1.00 0.42
Fuel
-0.26 0.47 -0.46
0.42 1.00
> cor(fueldata$Dlic,fueldata$Fuel)
[1] 0.4685063
5/35
Example
> pairs(fuel2001)
6/35
Simple Linear Regression
• We make n paired observations on two variables:
( x1, y1 ),
,( xn , yn )
• The objective is to test for a linear relationship between them,
yi   0  1 xi 
" predictable "
i
" random " error
• How to quantify a good fit?
The least squares approach: Choose
β  (0 , 1 ) '
to minimize
n
L(  )   ( yi  0  1 xi )2
i 1
7/35
Simple Linear Regression
• L(  ) is the sum of squared errors (SSE).
• It is minimized by solving L '( )  0, and we have
n
ˆ0  y  ˆ1 x and ˆ1 
 ( x  x)( y  y )
i
i 1
i
n
 ( x  x)
i 1
2
i
• If we assume  i ~ N (0, 2 ) i.i.d. (identically & independently)
then it yields MLE (maximum likelihood estimates).
8/35
Simple Linear Regression
• Assumptions of the linear model
yi  0  1 xi   i
1. Errors  i ~ N (0, 2 ) (오차의 정규성).
2. Error variances are equal (오차의 등분산성).
3. Errors are independent (오차의 독립성).
4. Y has a linear dependence on X.
9/35
Example
> library(alr3)
> data(forbes)
> forbes
Temp Pressure Lpres
1 194.5 20.79 131.79
…
17 212.2 30.06 147.80
> g<-lm(Lpres ~Temp, data=forbes)
>g
Call:
lm(formula = Lpres ~ Temp, data = forbes)
Coefficients:
(Intercept)
Temp
-42.1378
0.8955
> plot(forbes$Temp,forbes$Lpres)
> abline(g$coeff,col="red")
10/35
Example
> summary(g)
Call:
lm(formula = Lpres ~ Temp, data = forbes)
Residuals:
Min
1Q Median
3Q Max
-0.32220 -0.14473 -0.06664 0.02184 1.35978
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -42.13778 3.34020 -12.62 2.18e-09 ***
Temp
0.89549 0.01645 54.43 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.379 on 15 degrees of freedom
Multiple R-squared: 0.995, Adjusted R-squared: 0.9946
F-statistic: 2963 on 1 and 15 DF, p-value: < 2.2e-16
11/35
Multiple Linear Regression
• Assumptions of the linear model
E(Y | X )  0  1x1    p x p
Var(Y | X )   2
1. Errors  i ~ N (0, 2 ).
2. Error variances are equal.
3. Errors are independent.
4. Y has a linear dependence on X.
12/35
Multiple Linear Regression
• Using the matrix representation,
y j  0  1x j1    p x jp  e j , j  1,, n
e j ~ NID(0,  2 )
 y1  1 x11
  
 y2  1 x21
    
  
y  1 x
n1
 n  
x12
x22
xn 2
x1 p   0   e1 
   

 x2 p  1   e2 

     
   
 xnp   p   en 

Y  X  e
e ~ N (0, 2 I n )
13/35
Multiple Linear Regression
• The residual sum or squares
n
L(  )   ( yi  x 'i  )2  (Y  X  ) '(Y  X  )  e ' e
i 1
• We can compute ˆ that minimizes L(  ) by using the matrix
representation . The OLS (ordinary least squares) estimates.
L( )  (Y  X  ) '(Y  X  )  Y 'Y   ' X 'Y  Y ' X    ' X ' X   Y 'Y   ' X ' X   2 ' X 'Y
L(  )
|ˆ  2 X ' X ˆ  2 X ' Y  0

(matrix version of the normal equations)
 ˆ  ( X ' X )1 X 'Y
14/35
Multiple Linear Regression
• To minimize SSE=e’e, we have X’e=0.
15/35
Multiple Linear Regression
• Fact : ˆ 2 
SSE
( MSE )
n  ( p  1)
is an unbiased estimator of  2.
• If e is normally distributed,
SSE

2

(n  ( p  1))ˆ 2

2
~  2 (n  ( p  1))
• Define SSreg=SYY-SSE (SYY= the sum of squares of Y)
As with the simple regression, the coefficient of determination is
R2 
SSreg
SYY
 1
SSE
SYY
It is also called the multiple correlation coefficient because it is
the maximum of the correlation between Y and any linear
combination of the terms in the mean function.
16/35
Example
> summary(lm(Fuel~.,data=fueldata))
Call:
lm(formula = Fuel ~ ., data = fueldata)
# How can we analyze this results?
Residuals:
Min
1Q Median
3Q Max
-163.145 -33.039 5.895 31.989 183.499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
154.1928 194.9062 0.791 0.432938
-4.2280
2.0301 -2.083 0.042873 *
0.4719
0.1285 3.672 0.000626 ***
-6.1353
2.1936 -2.797 0.007508 **
18.5453
6.4722 2.865 0.006259 **
(Intercept)
Tax
Dlic
Income
logMiles
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2
Radj
 1
Residual standard error: 64.89 on 46 degrees of freedom
Multiple R-squared: 0.5105, Adjusted R-squared: 0.4679
F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07
RSS /(n  p  1)
p
 R2  [
](1  R 2 )
SYY /(n  1)
n  p 1
17/35
t-test
• We want to test
H 0 : i  bi
H1 : i  bi
• Assume e ~ N (0, 2 I ), then ˆi ~ N (i , cii 2 ) where cii  [( X ' X )1 ]ii
(n  p  1) MSE
2
• Since ˆi ~ N (i , cii 2 ) and RSS

~

(n  p  1)
2
2


We have t 
ˆi   i
cii MSE
~ t (n  p  1)
18/35
t-test
• Hypothesis concerning one of the terms
H 0 : i  bi
H1 : i  bi
• t-test statistic: t 
ˆi  bi
cii MSE
~ t (n  p  1)
• If H0 is true, t ~ t (n  p  1),
so we reject H0 at level  if | t | 2t / 2 (n  p 1)
The confidence interval for i is bi  t / 2 cii MSE
19/35
Example: t-test
> summary(lm(Fuel~.,data=fueldata))
Call:
lm(formula = Fuel ~ ., data = fueldata)
Residuals:
Min
1Q Median
3Q Max
-163.145 -33.039 5.895 31.989 183.499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
154.1928 194.9062 0.791 0.432938
-4.2280
2.0301 -2.083 0.042873 *
0.4719
0.1285 3.672 0.000626 ***
-6.1353
2.1936 -2.797 0.007508 **
18.5453
6.4722 2.865 0.006259 **
(Intercept)
Tax
Dlic
Income
logMiles
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 64.89 on 46 degrees of freedom
Multiple R-squared: 0.5105, Adjusted R-squared: 0.4679
F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07
20/35
F-test
• We refer to the full model with all the predictors as the complete
model. The model containing only some of these predictors is
called the reduced model. (nested with in the complete model)
• Testing whether the complete model is identical to the reduced
model is equivalent to testing whether the extra parameters in
the complete model equal 0. (none of the extra variables
increases the explained variability in Y)
21/35
F-test
• We may assume:
full : E (Y | X  x)  0  1 x1    r xr    p x p
reduced: E (Y | X  x)  0  1 x1    r xr
• Hypothesis test for the reduced model
H 0 :  r 1    p  0
H1 : not H 0
• When H0 is true,
SSE R
2
~  2 (n  r  1)
SSE R
2
SSE F
2

SSE F
2

SSE R  SSE F
2
~  2 (n  p  1)
SSE R  SSE F
2
~  2 ( p  r)
22/35
F-test
• Hypothesis test for the reduced model
H 0 :  r 1    p  0
H1 : not H 0
• F test statistic: F  (
SSER  SSEF
SSEF
) /(
)
pr
n  p 1
• If H0 is true, F ~ F ( p  r, n  p  1)
so we reject H0 at level  if F  F ( p  r, n  p 1)
• From this test, we conclude that the hypotheses are plausible or
not. And we say that which model is adequate.
23/35
Example: F-test
> summary(lm(Fuel~.,data=fueldata))
Call:
lm(formula = Fuel ~ ., data = fueldata)
Residuals:
Min
1Q Median
3Q Max
-163.145 -33.039 5.895 31.989 183.499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
154.1928 194.9062 0.791 0.432938
-4.2280
2.0301 -2.083 0.042873 *
0.4719
0.1285 3.672 0.000626 ***
-6.1353
2.1936 -2.797 0.007508 **
18.5453
6.4722 2.865 0.006259 **
(Intercept)
Tax
Dlic
Income
logMiles
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 64.89 on 46 degrees of freedom
Multiple R-squared: 0.5105, Adjusted R-squared: 0.4679
F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07
24/35
ANOVA
• In the analysis of variance, the mean function with all the terms
E(Y | X  x)  0  1x1    p x p
is compared with the mean function that includes only an
E(Y | X  x)  0
intercept.
• For the second case, ˆ0  y and the residual sum of squares is
SYY.
• We have SSE<=SYY, and the difference between these two
SSreg=SYY-SSE explained by the larger mean function that is
25/35
not explained by the smaller mean function.
ANOVA
• By F-test, we measure the goodness of fit of the regression
model.
Source
df
SS
MS
F
Regression p
SSreg
MSreg
=SSreg / p
MSreg / MSE
Residual
n-(p+1)
SSE
MSE 
Total
n-1
SYY
P-value
SSE
n  ( p  1)
26/35
Example: ANOVA
NH : 1  0
> g1<-lm(Fuel~.-Tax,data=fueldata)
AH : not NH
> g2<-lm(Fuel~.,data=fueldata)
> anova(g1,g2)
# Full model vs Reduced model
Analysis of Variance Table
Model 1: Fuel ~ (Tax + Dlic + Income + logMiles) - Tax
Model 2: Fuel ~ Tax + Dlic + Income + logMiles
Res.Df
RSS Df Sum of Sq
F Pr(>F)
1
47 211964
2
46 193700 1
18264 4.3373 0.04287 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.
•
The p-value 0.04 (<0.05), we have modest evidence that the coefficient for Tax is
different from 0. This is called a partial F-test.
27/35
Example: sequential ANOVA
> f0<-lm(Fuel~1,data=fueldata)
> f1<-lm(Fuel~Dlic,data=fueldata)
> f2<-lm(Fuel~Dlic+Tax,data=fueldata)
> f3<-lm(Fuel~Dlic+Tax+Income,data=fueldata)
> f4<-lm(Fuel~.,data=fueldata)
> anova(f0,f1,f2,f3,f4)
Analysis of Variance Table
Model 1: Fuel ~ 1
Model 2: Fuel ~ Dlic
Model 3: Fuel ~ Dlic + Tax
Model 4: Fuel ~ Dlic + Tax + Income
Model 5: Fuel ~ Tax + Dlic + Income + logMiles
Res.Df RSS Df Sum of Sq
F
Pr(>F)
1 50 395694
2 49 308840 1 86854 20.6262 4.019e-05 ***
3 48 289681 1 19159 4.5498 0.0382899 *
4 47 228273 1 61408 14.5833 0.0003997 ***
5 46 193700 1 34573 8.2104 0.0062592 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.
28/35
Variable Selection
• Usually, we don’t expect every candidate predictor to be related
to response. We want to identify a useful subset of the variables.
Forward selection
Backward elimination
Stepwise method
• By using F-test, we can add or remove some variables to the
model. The procedure ends when none of the candidate
variables have a p-value smaller than the pre-specified value.
29/35
Multicollinearity
• When the independent variables are correlated among
themselves, multicollinearity among them is said to exist.
• Estimated regression coefficients vary widely when the
independent variables are highly correlated.
• Variable Inflation Factor (VIF): Large changes in the estimated
regression coefficients when a predictor variable is added or
deleted, or when an observation is altered or deleted.
30/35
Multicollinearity
• VIFj 
1
1  R 2j
where R2j is a coefficient of determination when
X j is regressed on the other X variables.
• VIFs measure how much the variances of the estimated
regression coefficients are inflated as compared to when the
predictor variables are not linearly related.
• Generally, a maximum VIF value in excess of 5~10 is taken as an
indication of multicollinearity.
> vif(lm(Fuel~.,data=fueldata))
Tax
Dlic Income logMiles
1.010786 1.040992 1.132311 1.099395
31/35
Model Assumption
> par(mfrow=c(2,2))
> plot(lm(Fuel~.,data=fueldata)))
Check the model assumptions.
1. 선형 모형 ?
2. 오차의 정규성 ?
3. 오차의 등분산성 ?
4. 추정식에 많은 영향을 준 값 ?
=> 이상값 (outlier) 검출
32/35
Residual Analysis
> result=lm(Fuel~.,data=fueldata)
> plot(resid(result))
> line1=sd(resid(result))*1.96
> line2=sd(resid(result))*-1.96
> abline(line1,0)
> abline(line2,0)
> abline(0,0)
> par(mfrow=c(1,2))
> boxplot(resid(result))
> hist(resid(result))
33/35
Variable Transformation
• Box-Cox transformation
Y   0  1x1 
  p xp
Select  minimizing SSE. (Generally, it is between -2 and 2)
> boxcox(lm(Fuel~.,data=fueldata))
34/35
Regression Procedure
• 전처리
▫
▫
▫
▫
▫
▫
데이터 분석
설명변수의 다중 공선성 등을 통해 독립성 검사
오차항의 정규성 조사
잔차 분석을 통해 데이터의 선형성, 오차의 등분산성 검사
Box-Cox 변환
이상값 조사
• 회귀분석
▫
▫
▫
▫
▫
Scatterplot과 covariance (혹은 correlation) 행렬 조사
Full model에 대한 회귀분석을 통해 각 변수의 t-test
여러 가지 변수선택방법을 통해 상호 비교 후 최적 변수 선택
모델 해석
최종 후보 모델 제시
35/35