Data Analysis Using R: Introduction to the R language

Download Report

Transcript Data Analysis Using R: Introduction to the R language

Data Analysis Using R:
Multiple linear regression analysis
Tuan V. Nguyen
Garvan Institute of Medical Research,
Sydney, Australia
Data
• Effects of temperature (x1), time (x2) on CO2 production.
Id
y
X1
X2
X3
1
36.98
5.1
400
51.37
2
13.74
26.4
400
3
10.08
23.8
4
8.53
5
X4
X5
X6
X7
4.24
1484.83
2227.25
2.06
72.33
30.87
289.94
434.90
1.33
400
71.44
33.01
320.79
481.19
0.97
46.4
400
79.15
44.61
164.76
247.14
0.62
36.42
7.0
450
80.47
33.84
1097.26
1645.89
0.22
6
26.59
12.6
450
89.90
41.26
605.06
907.59
0.76
7
19.07
18.9
450
91.48
41.88
405.37
608.05
1.71
8
5.96
30.2
450
98.60
70.79
253.70
380.55
3.93
9
15.52
53.8
450
98.05
66.82
142.27
213.40
1.97
…
…
…
…
…
…
…
…
…
25
11.19
11.5
450
77.88
25.20
663.09
994.63
1.61
26
75.62
5.2
470
75.50
8.66
1464.11
2196.17
4.78
27
36.03
10.6
470
83.15
22.39
720.07
1080.11
5.88
Chú thích: y = sản lượng CO2; X1 = thời gian (phút); X2 = nhiệt độ (C); X3 = phần trăm hòa tan; X4
= lượng dầu (g/100g); X5 = lượng than đá; X6 = tổng số lượng hòa tan; X7 = số hydrogen tiêu thụ.
Read data into R
y <- c(36.98,13.74,10.08, 8.53,36.42,26.59,19.07, 5.96,15.52,56.61,
26.72,20.80, 6.99,45.93,43.09,15.79,21.60,35.19,26.14, 8.60,
11.63, 9.59, 4.42,38.89,11.19,75.62,36.03)
x1 <- c(5.1,26.4,23.8,46.4, 7.0,12.6,18.9,30.2,53.8,5.6,15.1,20.3,48.4,
5.8,11.2,27.9,5.1,11.7,16.7,24.8,24.9,39.5,29.0, 5.5, 11.5,
5.2,10.6)
x2 <- c(400,400, 400, 400, 450, 450, 450, 450, 450, 400, 400, 400,
400, 425, 425, 425, 450, 450, 450, 450, 450, 450, 450, 460,
450, 470, 470)
x3 <- c(51.37,72.33,71.44,79.15,80.47,89.90,91.48,98.60,98.05,55.69,
66.29,58.94,74.74,63.71,67.14,77.65,67.22,81.48,83.88,89.38,
79.77,87.93,79.50,72.73,77.88,75.50,83.15)
x4 <- c(4.24,30.87,33.01,44.61,33.84,41.26,41.88,70.79,66.82,
8.92,17.98,17.79,33.94,11.95,14.73,34.49,14.48,29.69,26.33,
37.98,25.66,22.36,31.52,17.86,25.20, 8.66,22.39)
x5 <- c(1484.83, 289.94, 320.79, 164.76, 1097.26, 605.06, 405.37,
253.70, 142.27,1362.24, 507.65, 377.60, 158.05, 130.66,
682.59, 274.20, 1496.51, 652.43, 458.42, 312.25, 307.08,
193.61, 155.96,1392.08, 663.09,1464.11, 720.07)
x6 <- c(2227.25, 434.90, 481.19, 247.14,1645.89, 907.59, 608.05,
380.55, 213.40,2043.36, 761.48, 566.40, 237.08,1961.49,1023.89,
411.30,2244.77, 978.64, 687.62, 468.38, 460.62, 290.42,
233.95,2088.12, 994.63,2196.17,1080.11)
x7 <- c(2.06,1.33,0.97,0.62,0.22,0.76,1.71,3.93,1.97,5.08,0.60,0.90,
0.63,2.04,1.57,2.38,0.32,0.44,8.82,0.02,1.72,1.88,1.43,
1.35,1.61,4.78,5.88)
REGdata <- data.frame(y, x1,x2,x3,x4,x5,x6,x7)
Results of analysis
> reg <- lm(y ~ x1+x2+x3+x4+x5+x6+x7, data=REGdata)
> summary(reg)
Residuals:
Min
1Q Median
3Q
Max
-20.035 -4.681 -1.144
4.072 21.214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.937016 57.428952
0.939
0.3594
x1
-0.127653
0.281498 -0.453
0.6553
x2
-0.229179
0.232643 -0.985
0.3370
x3
0.824853
0.765271
1.078
0.2946
x4
-0.438222
0.358551 -1.222
0.2366
x5
-0.001937
0.009654 -0.201
0.8431
x6
0.019886
0.008088
2.459
0.0237 *
x7
1.993486
1.089701
1.829
0.0831 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.61 on 19 degrees of freedom
Multiple R-Squared: 0.728,
Adjusted R-squared: 0.6278
F-statistic: 7.264 on 7 and 19 DF, p-value: 0.0002674
Inter-correlations
30
50
50
70
90
200
1000
0
4
8
10 40 70
10
10 30 50
y
400 440
x1
50 70 90
x2
10 40 70
x3
200 1000
x4
2000
x5
8
500
x6
0
4
x7
10
40
70
400
440
10
40
70
500
2000
Stepwise regression
reg <- lm(y ~ ., data=REGdata)
step(reg, direction=”both”)
Stepwise regression
Start: AIC= 134.07
y ~ x1 + x2 + x3 +
Df Sum of Sq
- x5
1
4.54
- x1
1
23.17
- x2
1
109.34
- x3
1
130.90
<none>
- x4
1
168.31
- x7
1
377.09
- x6
1
681.09
x4 + x5 + x6 + x7
RSS
AIC
2145.37 132.13
2164.00 132.36
2250.18 133.42
2271.74 133.68
2140.83 134.07
2309.14 134.12
2517.92 136.45
2821.92 139.53
Step 2: AIC= 130.42
y ~ x2 + x3 + x4 + x6 + x7
Df Sum of Sq
RSS
- x2
1
96.8 2264.9
- x3
1
122.0 2290.0
<none>
2168.1
- x4
1
187.4 2355.5
+ x1
1
22.7 2145.4
+ x5
1
4.1 2164.0
- x7
1
385.0 2553.1
- x6
1
1526.2 3694.3
AIC
129.6
129.9
130.4
130.7
132.1
132.4
132.8
142.8
Step 1: AIC= 132.13
y ~ x1 + x2 + x3 + x4 + x6
Df Sum of Sq
RSS
- x1
1
22.7 2168.1
- x2
1
113.8 2259.1
- x3
1
133.5 2278.9
<none>
2145.4
- x4
1
170.8 2316.2
+ x5
1
4.5 2140.8
- x7
1
375.7 2521.1
- x6
1
1058.5 3203.8
+ x7
AIC
130.4
131.5
131.8
132.1
132.2
134.1
134.5
141.0
Step 3: AIC= 129.59
y ~ x3 + x4 + x6 + x7
Df Sum of Sq
RSS
- x3
1
25.4 2290.3
- x4
1
90.9 2355.8
<none>
2264.9
+ x2
1
96.8 2168.1
+ x5
1
8.3 2256.5
+ x1
1
5.7 2259.1
- x7
1
384.9 2649.7
- x6
1
2015.6 4280.5
AIC
127.9
128.7
129.6
130.4
131.5
131.5
131.8
144.8
Stepwise regression
Step 4: AIC= 127.9
y ~ x4 + x6 + x7
Df Sum of Sq
- x4
1
73.5
<none>
+ x3
1
25.4
+ x1
1
11.3
+ x5
1
6.3
+ x2
1
0.3
- x7
1
486.6
- x6
1
1993.8
RSS
2363.8
2290.3
2264.9
2279.0
2284.0
2290.0
2776.9
4284.1
AIC
126.7
127.9
129.6
129.8
129.8
129.9
131.1
142.8
Call:
lm(formula = y ~ x6 + x7, data =
REGdata)
Coefficients:
(Intercept)
x6
x7
2.52646
0.01852
2.18575
Step 5: AIC= 126.75
y ~ x6 + x7
Df Sum of Sq
RSS
<none>
2363.8
+ x4
1
73.5 2290.3
+ x1
1
33.4 2330.4
+ x3
1
8.1 2355.8
+ x5
1
7.7 2356.1
+ x2
1
7.3 2356.6
- x7
1
497.3 2861.2
- x6
1
4477.0 6840.8
AIC
126.7
127.9
128.4
128.7
128.7
128.7
129.9
153.4
Bayesian Model Average (BMA) analysis
library(BMA)
xvars <- REGdata[,-1]
co2 <- REGdata[,1]
bma <- bicreg(xvars, co2, strict=FALSE, OR=20)
summary(bma)
Bayesian Model Average (BMA) analysis
16 models were selected
Best 5 models (cumulative posterior probability = 0.6599 ):
p!=0
EV
SD
model 1
model 2
model 3
Intercept 100.0
5.75672 14.6244
2.5264
6.1441
8.6120
x1
12.4 -0.01807
0.1008
.
.
.
x2
10.4 -0.00075
0.0282
.
.
.
x3
10.7
0.00011
0.0791
.
.
.
x4
20.2 -0.03059
0.1020
.
.
-0.1419
x5
10.5 -0.00023
0.0030
.
.
.
x6
100.0
0.01815
0.0040
0.0185
0.0193
0.0164
x7
73.7
1.60766
1.2821
2.1857
.
2.1628
model 4
7.5936
-0.1393
.
.
.
.
0.0162
2.1233
model 5
7.3537
.
.
-0.0572
.
.
0.0179
2.2382
nVar
r2
BIC
post prob
3
0.704
-22.9721
0.072
3
0.701
-22.6801
0.063
2
0.700
-25.8832
0.311
1
0.636
-24.0238
0.123
3
0.709
-23.4412
0.092
Bayesian Model Average (BMA) analysis
imageplot.bma(bma)
Models selected by BMA
x1
x2
x3
x4
x5
x6
x7
1
2
3
Model #
4
5
6
7
8
10
13