Regression Model Building

Download Report

Transcript Regression Model Building

Regression Model Building
Predicting Number of Crew Members of
Cruise Ships
Data Description
• n=158 Cruise Ships
• Dependent Variable – Crew Size (100s)
• Potential Predictor Variables






Age (2013 – Year Built)
Tonnage (1000s of Tons)
Passengers (100s)
Length (100s of feet)
Cabins (100s)
Passenger Density (Passengers/Space)
Data – First 20 Cases
Ship
Journey
Quest
Celebration
Conquest
Destiny
Ecstasy
Elation
Fantasy
Fascination
Freedom
Glory
Holiday
Imagination
Inspiration
Legend
Liberty*
Miracle
Paradise
Pride
Sensation
Cruise Line
Azamara
Azamara
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Carnival
Age
Tonnage Pssngrs Length Cabins
PassDens Crew
6
30.277
6.94
5.94
3.55
42.64
3.55
6
30.277
6.94
5.94
3.55
42.64
3.55
26
47.262
14.86
7.22
7.43
31.8
6.7
11
110
29.74
9.53
14.88
36.99
19.1
17 101.353
26.42
8.92
13.21
38.36
10
22
70.367
20.52
8.55
10.2
34.29
9.2
15
70.367
20.52
8.55
10.2
34.29
9.2
23
70.367
20.56
8.55
10.22
34.23
9.2
19
70.367
20.52
8.55
10.2
34.29
9.2
6 110.239
37
9.51
14.87
29.79
11.5
10
110
29.74
9.51
14.87
36.99
11.6
28
46.052
14.52
7.27
7.26
31.72
6.6
18
70.367
20.52
8.55
10.2
34.29
9.2
17
70.367
20.52
8.55
10.2
34.29
9.2
11
86
21.24
9.63
10.62
40.49
9.3
8
110
29.74
9.51
14.87
36.99
11.6
9
88.5
21.24
9.63
10.62
41.67
10.3
15
70.367
20.52
8.55
10.2
34.29
9.2
12
88.5
21.24
9.63
11.62
41.67
9.3
20
70.367
20.52
8.55
10.2
34.29
9.2
Full Model (6 Predictors, 7 Parameters, n=158)
Regression Statistics
Multiple R
0.9615
R Square
0.9245
Adjusted R Square
0.9215
Standard Error
0.9819
Observations
158
ANOVA
df
Regression
Residual
Total
Intercept
Age
Tonnage
Pssngrs
Length
Cabins
PassDens
6
151
157
SS
1781.5
145.6
1927.1
MS
296.9
1.0
F Significance F
308.0
0.0000
Coefficients
Standard Error t Stat
P-value Lower 95%Upper 95%
-0.52134 1.05703
-0.493
0.6226
-2.610
1.567
-0.01254 0.01420
-0.884
0.3783
-0.041
0.016
0.01324 0.01189
1.113
0.2673
-0.010
0.037
-0.14976 0.04759
-3.147
0.0020
-0.244
-0.056
0.40348 0.11445
3.525
0.0006
0.177
0.630
0.80163 0.08922
8.985
0.0000
0.625
0.978
-0.00066 0.01581
-0.042
0.9669
-0.032
0.031
Backward Elimination – Model Based AIC (minimize)

 SSE  Model   
AIC  Model    n  ln 
    2  parms(Model)   constant

n




 145.57  
Full Model (7 Parms, constant=0)  AIC  158  ln 
   2(7)  1.055
158





 145.57  
 146.39  
Round 2:AIC  158  ln 

2(6)


0.943
Round
3:
AIC

158

ln


   2(5)  2.062

158
158






FullMod
-passdens
-age
-tonnage
Df
SS
1
1
1
RSS
AIC
0.002
145.57
-0.943
0.753
146.32
-0.13
1.195
146.77
0.347
145.57
1.055
9.548
155.12
9.092
11.98
157.55
11.551
77.821
223.39
66.721
1
1
1
1
146.39
150.25
158.13
160.66
225.25
1
1
1
< none>
-passengers
-length
-cabins
Round3
< none>
- t onnage
- lengt h
- pas s engers
--c abins
3.866
11.739
14.275
78.861
-2.062
0.056
8.126
10.64
64.028
Round2
- age
1
0.815
1
1
1
1
2.007
12.069
14.027
79.556
< none>
- t onnage
- lengt h
- pas s engers
- c abins
146.39
145.57
147.58
157.64
159.6
225.13
-2.062
-0.943
-0.78
9.641
11.591
65.944
Forward Selection (AIC Based)
 1927.08 
SSTOTAL  1927.08  AIC  Null Model   158  ln 
  2(1)  397.18
 158 
Null Model
Df
+ c abins
+ t onnage
+ pas s engers
+ lengt h
+ age
+ pas s dens
< none>
Round3
Df
+ pas s engers
+ pas s dens
< none>
+ age
+ t onnage
Round5
SS
RSS
AIC
1 1742.21
184.88
28.82
1 1658.03
269.05
88.1
1 1614.23
312.86
111.94
1
1546.6
380.49
142.86
1
542.66 1384.42
346.93
1
46.6 1880.48
395.32
1927.08
397.18
Round2
SS
RSS
AIC
1 11.6609
150.25
0.0565
1
6.3732
155.54
5.5212
161.91
9.8661
1
1.9702
159.94
9.9317
1
1.2514
160.66 10.6402
Round4
Df
SS
< none>
+ age
+ pas s dens
1
1
RSS
AIC
146.39 -2.06164
0.81467
145.57 -0.94339
0.06366
146.32 -0.13037
Df
+ lengt h
+ pas s dens
+ t onnage
+ pas s engers
+ age
< none>
+ t onnage
+ age
+ pas s dens
< none>
SS
RSS
AIC
1 22.9636
161.91
9.8661
1 14.9541
169.92 17.4948
1 12.5135
172.36
19.748
1
7.0656
177.81 24.6647
1
5.4442
179.43 26.0989
184.88 28.8215
Df
SS
1
1
1
RSS
AIC
3.8656
146.39 -2.06164
2.6733
147.58 -0.77996
2.5635
147.69 -0.66241
150.25
0.0565
Stepwise Regression (AIC Based)
Null Model
Df
+ c abins
+ t onnage
+ pas s engers
+ lengt h
+ age
+ pas s dens
< none>
Round3
Df
+ pas s engers
+ pas s dens
SS
RSS
AIC
1 1742.21
184.88
28.82
1 1658.03
269.05
88.1
1 1614.23
312.86
111.94
1
1546.6
380.49
142.86
1
542.66 1384.42
346.93
1
46.6 1880.48
395.32
1927.08
397.18
SS
1
1
1
1
RSS
AIC
11.661
150.25
0.056
6.373
155.54
5.521
161.91
9.866
1.97
159.94
9.932
1.251
160.66
10.64
22.964
184.88
28.821
218.571
380.49 142.859
1
1
1
1
1
1
146.39
145.57
146.32
150.25
158.13
160.66
225.25
1
1
< none>
+ age
+ t onnage
- lengt h
- c abins
Round5
< none>
+ age
+ pas s dens
- t onnage
- lengt h
- pas s engers
- c abins
0.815
0.064
3.866
11.739
14.275
78.861
-2.062
-0.943
-0.13
0.056
8.126
10.64
64.028
Round2
Df
+ lengt h
+ pas s dens
+ t onnage
+ pas s engers
+ age
SS
1
1
1
1
1
< none>
- c abins
Round4
+ t onnage
+ age
+ pas s dens
1
Df
SS
1
1
1
< none>
- pas s engers
- lengt h
- c abins
RSS
AIC
22.96
161.91
9.87
14.95
169.92
17.49
12.51
172.36
19.75
7.07
177.81
24.66
5.44
179.43
26.1
184.88
28.82
1742.21 1927.08
397.18
1
1
1
RSS
AIC
3.866
146.39
-2.062
2.673
147.58
-0.78
2.563
147.69
-0.662
150.25
0.056
11.661
161.91
9.866
27.559
177.81
24.665
95.781
246.03
75.974
Summary of Automated Models
• Backward Elimination
 Drop Passenger Density (AIC drops from 1.055 to -0.943)
 Drop Age (AIC drops from -0.943 to -2.062)
 Stop: Keep Tonnage, Passengers, Length, Cabins
• Forward Selection





Add Cabins (AIC drops from 397.18 to 28.82)
Add Length (AIC drops from 28.82 to 9.8661)
Add Passengers (AIC drops from 9.8661 to -0.0565)
Add Tonnage (AIC drops from -0.0565 to -2.06)
Stop: Keep Tonnage, Passengers, Length, Cabins
• Stepwise – Same as Forward Selection
All Possible (Subset) Regressions
p '  Number of parameters (including intercept) in Model
R 2  Model  
SS  Regression(Model) 
SS  Residual(Model) 
 1
Goal:Maximize within reason
SS  Total 
SS  Total 
 n  1  SS  Residual(Model) 
Adj-R 2  Model   1  
Goal:Maximize

SS  Total 
 n p'
C p  Model  
SS  Residual(Model) 
 2 p ' n
2
s
Goal: C p  p ' where s 2  MS  Residual(Full Model) 

 SS  Residual(Model)   
BIC  Model    n  ln 
    ln(n)  p '  constant Goal:Minimize

n



All Possible (Subset) Regressions (Best 4 per Grp)
#preds Int
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
3
1
3
1
3
1
3
1
4
1
4
1
4
1
4
1
5
1
5
1
5
1
5
1
6
1
Age
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
0
1
1
1
Ton Pass Lngth Cabin PassDen R-Sq
Adj-R2 Cp
BIC
0
0
0
1
0
0.904
0.903
37.772 -360.238
1
0
0
0
0
0.86
0.859 125.086 -300.954
0
1
0
0
0
0.838
0.837 170.523 -277.122
0
0
1
0
0
0.803
0.801 240.675 -246.201
0
0
1
1
0
0.916
0.915
15.952 -376.131
0
0
0
1
1
0.912
0.911
24.261 -368.502
1
0
0
1
0
0.911
0.909
26.792 -366.249
0
1
0
1
0
0.908
0.907
32.443 -361.332
0
1
1
1
0
0.922
0.921
5.857 -382.878
0
0
1
1
1
0.919
0.918
11.341 -377.413
1
1
0
1
0
0.918
0.916
14.023 -374.808
0
0
1
1
0
0.917
0.915
15.909 -373.002
1
1
1
1
0
0.924
0.922
3.847 -381.933
0
1
1
1
0
0.923
0.921
5.084 -380.652
0
1
1
1
1
0.923
0.921
5.197 -380.534
1
0
1
1
1
0.919
0.917
13.056 -372.631
1
1
1
1
0
0.924
0.922
5.002 -377.752
1
1
1
1
1
0.924
0.922
5.781 -376.939
0
1
1
1
1
0.924
0.921
6.24 -376.462
1
0
1
1
1
0.92
0.917
14.904 -367.717
1
1
1
1
1
0.924
0.921
7 -372.692
BIC
Adj-R2
Cp
Cross-Validation
• Hold-out Sample (Training Sample = 100, Validation = 58)
 Fit Model on Training Sample, and obtain Regression Estimates
 Apply Regression Estimates from Training Sample to Validation
Sample X levels for Predicted  MSEP = sum(obs-pred)2/n
 Fit Model on Validation Sample and Compare regression
coefficients with model for Training Sample
• PRESS Statistic (Delete observations 1-at-a-time)
 Fit model with each observation deleted 1-at-a-time
 Obtain Residual for each observation when it was deleted
 PRESS = sum(obs-pred(deleted))2
• K-fold Cross-validation
 Extension of PRESS to where K groups of cases are deleted
 Useful for computationally intensive models (not OLS)
Hold-Out Sample – nin = 100 nout = 58
Training Sample
(Intercept)
tonnage
passengers
length
cabins
Estimate Std Err
t-stat
P-Value
-1.1018
0.7735
-1.424
0.1576
0.0048
0.0118
0.407
0.6851
-0.1919
0.0545
-3.525
0.0007
0.4565
0.1457
3.132
0.0023
0.9506
0.1451
6.551
0.0000
Coefficients keep
signs, but
significance levels
change a lot.
See Tonnage and
Length.
Validation Sample Estimate Std Err
t-stat
P-Value
(Intercept)
-0.0970
0.9142
-0.106
0.9159
tonnage
0.0286
0.0124
2.303
0.0252
passengers
-0.1234
0.0582
-2.119
0.0388
length
0.2321
0.1917
1.211
0.2313
cabins
0.7058
0.1060
6.656
0.0000
 '
1
MSEP 

n
nV
2
^


 yiV  y iV (T )   0.7578


i 1 
nV
 
Bias 2  
2
 0.0005182738
 0.0005182738 
Percent Bias of MSEP = 100  Bias 2 / MSEP   100 
  0.06838787 (%)
0.7578


Testing Bias = 0 from Training data to Validation
  -0.02276563 s  0.8778456

s
0.8778456
s  

 0.1152668
nV
58

-0.02276563
t

 -0.003405238
0.1152668
s 

 No evidence of systematic bias for samples
PRESS Statistic
^
^
^
^
Y predi ( i }   0(i )   1( i ) X i1 


PRESS    Yi  Y predi ( i } 

i 1 
n
Compare
^
 p (i ) X ip
where regression was fit without case i
2
PRESS
with MS  Residual  for the full model
n
^
^
Note: Yi  Y predi ( i } 
Yi  Y i
1  pii
where pii  i th diagonal element of P = X  X'X  X'
PRESS / n  0.9801
-1
MS  Residual   0.96
Model appears to be valid, very little difference between PRESS/n and MS(Resid)