Selection of predictor variables

Transcript Selection of predictor variables

Variable selection
and model building
Part II
Statement of situation
• A common situation is that there is a large set of
candidate predictor variables.
•
(Note: The examples herein are not really that large.)
• Goal is to choose a small subset from the larger set
so that the resulting regression model is simple
and useful:
– provides a good summary of the trend in the response
– and/or provides good predictions of response
– and/or provides good estimates of slope coefficients
Two basic methods
of selecting predictors
• Stepwise regression: Enter and remove
predictors, in a stepwise manner, until no
justifiable reason to enter or remove more.
• Best subsets regression: Select the subset
of predictors that do the best at meeting
some well-defined objective criterion.
Two cautions!
• The list of candidate predictor variables
must include all the variables that actually
predict the response.
• There is no single criterion that will always
be the best measure of the “best” regression
equation.
Best subsets regression
…. or all possible subsets regression
Best subsets regression
• Consider all of the possible regression
models from all of the possible
combinations of the candidate predictors.
• Identify, for further evaluation, models with
a subset of predictors that do the “best” at
meeting some well-defined criteria.
• Further evaluate the models identified in the
last step. Fine-tune the final model.
Example: Cement data
• Response y: heat evolved in calories during
hardening of cement on a per gram basis
• Predictor x1: % of tricalcium aluminate
• Predictor x2: % of tricalcium silicate
• Predictor x3: % of tetracalcium alumino ferrite
• Predictor x4: % of dicalcium silicate
Example: Cement data
Why best subsets regression?
# of predictors
(p-1)
# of regression models
1
2 : ( ) (x1)
2
4 : ( ) (x1) (x2) (x1, x2)
3
8: ( ) (x1) (x2) (x3) (x1, x2)
(x1, x3) (x2, x3) (x1, x2, x3)
16: 1 none, 4 one, 6 two,
4 three, 1 four
4
Why best subsets regression?
• If there are p-1 possible predictors, then
there are 2p-1 possible regression models
containing the predictors.
• For example, 10 predictors yields 210 = 1024
possible regression models.
• A best subsets algorithm determines best
subsets of each size, so that candidates for a
final model can be identified by researcher.
Common ways of judging “best”
• Different criteria quantify different aspects
of the regression model, so can lead to
different choices for best set of predictors:
–
–
–
–
–
R-squared
Adjusted R-squared
MSE (or S = square root of MSE)
Mallow’s Cp
(PRESS statistic)
Increase in R-squared
SSR
SSE
R 
 1
SSTO
SSTO
2
• R2 can only increase as more variables are added.
• Use R-squared values to find the point where
adding more predictors is not worthwhile, because
it yields a very small increase in R-squared.
• Most often, used in combination with other criteria.
Cement example
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
67.5
66.6
97.9
97.2
98.2
98.2
98.2
64.5
63.6
97.4
96.7
97.6
97.6
97.4
138.7
142.5
2.7
5.5
3.0
3.0
5.0
8.9639
9.0771
2.4063
2.7343
2.3087
2.3121
2.4460
x x x x
1 2 3 4
X
X
X X
X
X
X X
X
X X X
X X X X
Largest adjusted R-squared
 n  1  SSE 
 n 1 

R  1  
  1 
 MSE
 SSTO 
 n  p  SSTO 
2
a
• Makes you pay a penalty for adding more predictors.
• According to this criterion, the best regression model
is the one with the largest adjusted R-squared.
Smallest MSE
SSE
MSE 

n p
2
ˆ
  yi  yi 
n p
 n  1  SSE 
 n 1 

R  1  
  1 
 MSE
 SSTO 
 n  p  SSTO 
2
a
• According to this criterion, the best regression
model is the one with the smallest MSE.
• Adjusted R-squared increases only if MSE
decreases, so the adjusted R-squared and MSE
criteria yield the same models.
Cement example
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
67.5
66.6
97.9
97.2
98.2
98.2
98.2
64.5
63.6
97.4
96.7
97.6
97.6
97.4
138.7
142.5
2.7
5.5
3.0
3.0
5.0
8.9639
9.0771
2.4063
2.7343
2.3087
2.3121
2.4460
x x x x
1 2 3 4
X
X
X X
X
X
X X
X
X X X
X X X X
Mallow’s Cp statistic
Mallow’s Cp statistic
• Cp estimates the size of the bias introduced
in the estimates of the responses by having
an underspecified model (a model with
important predictors missing).
Biased prediction
• If there is no bias, the expected value of the
observed responses E yi  and the expected
value of the predicted responses E yˆ i  both
equal μY|x.
• Fitting the data with an underspecified
model, introduces bias, Bi  E  yˆ i   E  yi  ,
into predicted response at the ith data point.
Biased prediction
Bi  E  yˆ i   E  yi 
no bias
bias
E yˆ i 
E yi 
Bias from an underspecified model
Weight = -1.22 + 0.283 Height + 0.111 Water, MSE = 0.017
Weight = -4.14 + 0.389 Height, MSE = 0.653
15
0
10
20
Weight
10
5
0
30
35
40
Height
45
Variation in predicted responses
• Because of bias, variance in the predicted
responses for data point i is due to two
things:
 
2
yˆ i
– random sampling variation
– variance associated with the bias
B 
2
i
Total variation in
predicted responses
Sum the two variance components over all n data points to
obtain a measure of the total variation in the predicted responses:
n
1 n 2
2
p  2  yˆ i   E  yˆ i   E  yi  
  i 1
i 1

If there is no bias, Γp achieves its smallest value, p:
1 n 2

p  2  yˆ i  0  p
  i 1

A good measure of an underspecified
model
So, Γp seems to be a good measure of an underspecified model:
n
1 n 2
2
p  2  yˆ i   E  yˆ i   E  yi  
  i 1
i 1

The best model is simply the one with the smallest value of Γp.
We even know that the theoretical minimum of Γp is p.
Cp as an estimate of Γp
If we know the population variance σ2, we can estimate Γp:
Cp

MSE
 p
p

2
n  p
2
where MSEp is the mean squared error from fitting the model
containing the subset of p-1 predictors (p parameters).
Mallow’s Cp statistic
But we don’t know σ2. So, estimate it using MSEall, the mean
squared error obtained from fitting the model containing all of
the predictors.
Cp

MSE
 p
p
 MSEall n  p
MSEall
• Estimating σ2 using MSEall :
assumes that there are no biases in the full model with all
of the predictors, an assumption that may or may not be
valid, but can’t be tested without additional information.
guarantees that Cp = p for the full model.
Summary facts about Mallow’s Cp
• Subset models with small Cp values have a small
total (standardized) variance of prediction.
• When the Cp value is …
– near p, the bias is small (next to none),
– much greater than p, the bias is substantial,
– below p, it is due to sampling error; interpret as no bias.
• For the largest model with all possible predictors,
Cp= p (always).
Using the Cp criterion
• Identify subsets of predictors for which the Cp
value is near p (if possible).
– The full model always yields Cp= p, so don’t select the
full model based on Cp.
– If all models, except the full model, yield a large Cp not
near p, it suggests some important predictor(s) are
missing from the analysis.
– When more than one model has a Cp value near p, in
general, choose the simpler model or the model that
meets your research needs.
Cement example
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
67.5
66.6
97.9
97.2
98.2
98.2
98.2
64.5
63.6
97.4
96.7
97.6
97.6
97.4
138.7
142.5
2.7
5.5
3.0
3.0
5.0
8.9639
9.0771
2.4063
2.7343
2.3087
2.3121
2.4460
x x x x
1 2 3 4
X
X
X X
X
X
X X
X
X X X
X X X X
The regression equation is
y = 62.4 + 1.55 x1 + 0.510 x2 + 0.102 x3 - 0.144 x4
Source
Regression
Residual Error
Total
DF
4
8
12
SS
2667.90
47.86
2715.76
MS
666.97
5.98
F
111.48
P
0.000
The regression equation is y = 52.6 + 1.47 x1 + 0.662 x2
Source
Regression
Residual Error
Total
Cp

MSE
 p
DF
2
10
12
p
SS
2657.9
57.9
2715.8
 MSEall n  p
MSEall
MS
1328.9
5.8
 3
F
229.50
P
0.000
5.8  5.9813 3  2.7
5.98
The regression equation is
y = 62.4 + 1.55 x1 + 0.510 x2 + 0.102 x3 - 0.144 x4
Source
Regression
Residual Error
Total
DF
4
8
12
SS
2667.90
47.86
2715.76
MS
666.97
5.98
F
111.48
P
0.000
The regression equation is y = 103 + 1.44 x1 - 0.614 x4
Source
Regression
Residual Error
Total
Cp

MSE
 p
DF
2
10
12
p
SS
2641.0
74.8
2715.8
 MSEall n  p
MSEall
MS
1320.5
7.5
 3
F
176.63
P
0.000
7.5  5.9813 3  5.5
5.98
Cement example
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
67.5
66.6
97.9
97.2
98.2
98.2
98.2
64.5
63.6
97.4
96.7
97.6
97.6
97.4
138.7
142.5
2.7
5.5
3.0
3.0
5.0
8.9639
9.0771
2.4063
2.7343
2.3087
2.3121
2.4460
x x x x
1 2 3 4
X
X
X X
X
X
X X
X
X X X
X X X X
The regression equation is
y = 71.6 + 1.45 x1 + 0.416 x2 - 0.237 x4
Predictor
Coef
Constant
71.65
x1
1.4519
x2
0.4161
x4
-0.2365
S = 2.309
SE Coef
14.14
0.1170
0.1856
0.1733
R-Sq = 98.2%
T
5.07
12.41
2.24
-1.37
P
0.001
0.000
0.052
0.205
VIF
1.1
18.8
18.9
R-Sq(adj) = 97.6%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
9
12
SS
2667.79
47.97
2715.76
MS
889.26
5.33
F
166.83
P
0.000
The regression equation is
y = 48.2 + 1.70 x1 + 0.657 x2 + 0.250 x3
Predictor
Constant
x1
x2
x3
Coef
48.194
1.6959
0.65691
0.2500
S = 2.312
SE Coef
3.913
0.2046
0.04423
0.1847
R-Sq = 98.2%
T
12.32
8.29
14.85
1.35
P
0.000
0.000
0.000
0.209
VIF
3.3
1.1
3.1
R-Sq(adj) = 97.6%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
9
12
SS
2667.65
48.11
2715.76
MS
889.22
5.35
F
166.34
P
0.000
The regression equation is
y = 52.6 + 1.47 x1 + 0.662 x2
Predictor
Constant
x1
x2
Coef
52.577
1.4683
0.66225
S = 2.406
SE Coef
2.286
0.1213
0.04585
R-Sq = 97.9%
T
23.00
12.10
14.44
P
0.000
0.000
0.000
VIF
1.1
1.1
R-Sq(adj) = 97.4%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
10
12
SS
2657.9
57.9
2715.8
MS
1328.9
5.8
F
229.50
P
0.000
Stepwise Regression: y versus x1, x2, x3, x4
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is
y
on 4 predictors, with N =
Step
Constant
1
117.57
2
103.10
3
71.65
x4
T-Value
P-Value
-0.738
-4.77
0.001
-0.614
-12.62
0.000
-0.237
-1.37
0.205
1.44
10.40
0.000
1.45
12.41
0.000
1.47
12.10
0.000
0.416
2.24
0.052
0.662
14.44
0.000
2.31
98.23
97.64
3.0
2.41
97.87
97.44
2.7
x1
T-Value
P-Value
x2
T-Value
P-Value
S
R-Sq
R-Sq(adj)
C-p
8.96
67.45
64.50
138.7
2.73
97.25
96.70
5.5
4
52.58
13
Residual analysis
Residual analysis
Example: Modeling PIQ
130.5
PIQ
91.5
100.728
MRI
86.283
73.25
Height
65.75
170.5
Weight
127.5
.5
.5
91 130
3
8
.28
.7 2
86 100
.75 3.25
65
7
7.5 70.5
12
1
Best Subsets Regression: PIQ versus MRI, Height, Weight
Response is PIQ
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
14.3
0.9
29.5
19.3
29.5
11.9
0.0
25.5
14.6
23.3
7.3
13.8
2.0
6.9
4.0
21.212
22.810
19.510
20.878
19.794
H
e
i
M g
R h
I t
W
e
i
g
h
t
X
X
X X
X
X
X X X
Stepwise Regression: PIQ versus MRI, Height, Weight
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is
PIQ
on 3 predictors, with N =
38
Step
Constant
1
4.652
2
111.276
MRI
T-Value
P-Value
1.18
2.45
0.019
2.06
3.77
0.001
Height
T-Value
P-Value
S
R-Sq
R-Sq(adj)
C-p
-2.73
-2.75
0.009
21.2
14.27
11.89
7.3
19.5
29.49
25.46
2.0
Example: Modeling BP
120
BP
110
53.25
Age
47.75
97.325
Weight
89.375
2.125
BSA
1.875
8.275
Duration
4.425
72.5
Pulse
65.5
76.25
Stress
30.75
0
11
0
12
. 75 3.25
47
5
5
5
.37 7. 32
89
9
75
25
1. 8 2. 1
25 .275
4. 4
8
.5
65
.5
72
.75 6. 25
30
7
Best Subsets Regression: BP versus Age, Weight, ...
Response is BP
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
4
5
5
6
90.3
75.0
99.1
92.0
99.5
99.2
99.5
99.5
99.6
99.5
99.6
89.7
73.6
99.0
91.0
99.4
99.1
99.4
99.4
99.4
99.4
99.4
312.8
829.1
15.1
256.6
6.4
14.1
6.4
7.1
7.0
7.7
7.0
1.7405
2.7903
0.53269
1.6246
0.43705
0.52012
0.42591
0.43500
0.42142
0.43078
0.40723
D
u
W
r
e
a
i
t
A g B i
g h S o
e t A n
P
u
l
s
e
S
t
r
e
s
s
X
X
X X
X
X X
X X
X X
X X
X X
X X
X X
X
X
X
X X
X
X
X
X X
X X X
X X X X
Stepwise Regression: BP versus Age, Weight, BSA, Duration,
Pulse, Stress
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is
BP
on 6 predictors, with N =
20
Step
Constant
1
2.205
2
-16.579
3
-13.667
Weight
T-Value
P-Value
1.201
12.92
0.000
1.033
33.15
0.000
0.906
18.49
0.000
0.708
13.23
0.000
0.702
15.96
0.000
Age
T-Value
P-Value
BSA
T-Value
P-Value
S
R-Sq
R-Sq(adj)
C-p
4.6
3.04
0.008
1.74
90.26
89.72
312.8
0.533
99.14
99.04
15.1
0.437
99.45
99.35
6.4
The regression equation is
BP = - 12.9 + 0.683 Age + 0.897 Weight
+ 4.86 BSA + 0.0665 Dur
Predictor
Constant
Age
Weight
BSA
Dur
S = 0.4259
Coef
-12.852
0.68335
0.89701
4.860
0.06653
SE Coef
2.648
0.04490
0.04818
1.492
0.04895
R-Sq = 99.5%
Analysis of Variance
Source
DF
Regression
4
Residual Error 15
Total
19
SS
557.28
2.72
560.00
T
-4.85
15.22
18.62
3.26
1.36
P
0.000
0.000
0.000
0.005
0.194
VIF
1.3
4.5
4.3
1.2
R-Sq(adj) = 99.4%
MS
139.32
0.18
F
768.01
P
0.000
The regression equation is
BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA
Predictor
Constant
Age
Weight
BSA
Coef
-13.667
0.70162
0.90582
4.627
S = 0.4370
SE Coef
2.647
0.04396
0.04899
1.521
R-Sq = 99.5%
T
-5.16
15.96
18.49
3.04
P
0.000
0.000
0.000
0.008
VIF
1.2
4.4
4.3
R-Sq(adj) = 99.4%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
16
19
SS
556.94
3.06
560.00
MS
185.65
0.19
F
971.93
P
0.000
The regression equation is
BP = - 16.6 + 0.708 Age + 1.03 Weight
Predictor
Constant
Age
Weight
Coef
-16.579
0.70825
1.03296
S = 0.5327
SE Coef
3.007
0.05351
0.03116
R-Sq = 99.1%
T
-5.51
13.23
33.15
P
0.000
0.000
0.000
VIF
1.2
1.2
R-Sq(adj) = 99.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
17
19
SS
555.18
4.82
560.00
MS
277.59
0.28
F
978.25
P
0.000
Best subsets regression
• Stat >> Regression >> Best subsets …
• Specify response and all possible predictors.
• If desired, specify predictors that must be
included in every model.
– (Researcher’s knowledge!)
• Select OK. Results appear in session
window.
Model building strategy
The first step
• Decide on the type of model needed
– Predictive: model used to predict the response
variable from a chosen set of predictors.
– Theoretical: model based on theoretical
relationship between response and predictors.
– Control: model used to control a response
variable by manipulating predictor variables.
The first step (cont’d)
• Decide on the type of model needed
– Inferential: model used to explore strength of
relationships between response and predictors.
– Data summary: model used merely as a way to
summarize a large set of data by a single
equation.
The second step
• Decide which predictor variables and
response variable on which to collect the
data.
• Collect the data.
The third step
• Explore the data
– Check for outliers, gross data errors, missing
values on a univariate basis.
– Study bivariate relationships to reveal other
outliers, to suggest possible transformations, to
identify possible multicollinearities.
The fourth step
• Randomly divide the data into a training set
and a test set:
– The training set, with at least 15-20 error d.f.,
is used to fit the model.
– The test set is used for cross-validation of the
fitted model.
The fifth step
• Using the training set, fit several candidate
models:
– Use best subsets regression.
– Use stepwise regression (only gives one model
unless specifies different alpha-to-remove and
alpha-to-enter values).
The sixth step
• Select and evaluate a few “good” models:
– Select based on adjusted R2, Mallow’s Cp,
number and nature of predictors.
– Evaluate selected models for violation of model
assumptions.
– If none of the models provide a satisfactory fit,
try something else, such as more data, different
predictors, a different class of model …
The final step
• Select the final model:
– Compare competing models by cross-validating
them against the test data.
– The model with a larger cross-validation R2 is a
better predictive model.
– Consider residual plots, outliers, parsimony,
relevance, and ease of measurement of
predictors.