Transcript Linear and Logistic Regression using SAS Enterprise Miner
Regression for Data Mining
Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan
Learning Objectives
To understand the application of regression analysis in data mining Linear/nonlinear Logistic (Logit) To understand the key statistical measures of fit To learn how to run and interpret regression analyses using SAS Enterprise Miner software
Analysis of Association
In business problems interests often go beyond the statistical testing of differences (e.g., female versus male preferences) Often interested in degree of association between variables. Regression is one of the techniques that helps uncover those relations.
Linear Regression Analysis
Analysis of the strength of the
linear relationship
between predictor (independent) variables and outcome (dependent/criterion) variables.
In two dimensions (one predictor, one outcome variable) data can be plotted on a scatter diagram.
E(y) = b 0 + b 1 (x) Expected value of y (outcome) Intercept Term coefficient Predictor variable
Estimation Process
Regression Model y = b 0 Regression Equation E(y) = b 0 + b , 0 b b 1 1 x + + b 1 e
x
Unknown Parameters Sample Data:
x y x
1
y
1
. .
. .
x n y n b
0 and b 1 provide estimates of b 0 and b 1 Estimated Regression Equation ˆy b b x Sample Statistics
b
0 , b 1
Simple Linear Regression Equation: Positive Linear Relationship
E(y): Outcome
Regression line
Intercept b 0 Slope b 1 is positive
x : Predictor
Simple Linear Regression Equation: Negative Linear Relationship
E(y): Outcome Intercept b 0
Regression line
Slope b 1 is negative
x: Predictor
Simple Linear Regression Equation: No Relationship
E(y): Outcome • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
x: Predictor
Simple Linear Regression Equation: No Relationship
Intercept b 0 E(y)
Regression line
Slope is 0 b 1
x
Simple Linear Regression Equation: Parabolic Relationship
E(y): Outcome Intercept b 0 • •• •• • •• •
x: Predictor
Example
List Variables we have Determine a DV of interest Is there a way to predict DV?
Least Squares Method
Least Squares Criterion: minimize error (distance between actual data & estimated line)
i
i
) 2 where:
y i
= observed value of the dependent variable for the
i
th observation
y
^
i
= estimated value of the dependent variable for the
i
th observation
Least Squares Method
Slope for the Estimated Regression Equation
b
1 (
x
i
(
x i
i x
) 2
y
)
Least Squares Method y Intercept for the Estimated Regression Equation
b
0 where:
x i
= value of independent variable for
i
th
y i
= observation value of dependent variable for
i
th _ _
y
observation = mean value for independent variable = mean value for dependent variable n = total number of observations
Least Squares Estimation Procedure
Least Squares Criterion: The sum of the vertical deviations (y axis) of the points from the line is minimal.
Predicted Line Actual Data
Example: Kwatts vs. Temp
Temp Kwatts 59.2
9,730 61.9
55.1
9,750 10,180 66.2
52.1
69.9
46.8
76.8
79.7
79.3
80.2
83.3
10,230 10,800 11,160 12,530 13,910 15,110 15,690 17,020 17,880
Is the Relationship Linear?
KWatts vs. Temp
20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 40 45 50 55 60 65
Temp
70 75 80 85 90 KWatts
Example Results
Let X = Temp, Y = Kwatts Y = 319.04 + 185.27 X
KWatts vs. Temp
20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 40 45 50 55 60 65
Temp
70 75 80 85 90 KWatts Forecast average
Coefficient of Determination
How “strong” is relationship between predictor & outcome? (Fraction of observed variance of outcome variable explained by the predictor variables).
Relationship Among SST, SSR, SSE SST = SSR + SSE ( ) 2 (
i
) 2 + ( ˆ
i
) 2 where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error
Coefficient of Determination (
r
2
)
r
2 = SSR/SST where: SSR = sum of squares due to regression SST = total sum of squares
Kwatts vs. Temp Example
Regression Residual df 1 10 SS 58784708.31
38696916.69
Total 11
r 2 =
0.603033734
97481625 Does the linear regression provide a good fit?
Assumptions About the Error Term
e 1. The error e is a random variable with mean of zero.
2. The variance of e , denoted by 2 , is the same for all values of the independent variable.
3. The values of e are independent.
4. The error e variable.
is a normally distributed random
Significance Test for Regression
Is the value of b 1 zero?
Two tests are commonly used: t Test and F Test Both the variance (
t
test and
F
test require an estimate of the 2 ) of the error ( e) .
As in most of our statistical work, we are working with a sample, not the population, so we use mean square error (
s
2
).
Testing for Significance
An Estimate of
s
2 = MSE = SSE/(n 2) where: SSE (
y i
i
) 2 (
y i
b
0
b
1
x i
) 2
Testing for Significance
An Estimate of • • To estimate we take the square root of 2 .
The resulting
s
the estimate.
is called the standard error of
s
MSE SSE
n
2
Testing for Significance: t Test
Hypotheses: Coefficient ( b 1 ) is 0 (no relationship between predictor & outcome) Calculating t Statistic:
t
b
1
s b
1
Testing for Significance:
t
Test
H
: b 1 0 2. Specify the level of significance.
a = .05
3. Select the test statistic.
t
b
1
s b
1 4. State the rejection rule.
H
0 : b 1 0 if p-value < .05 or |t| > 3.182 (with 3 degrees of freedom)
Alternative Test:
F
Test
H
: b 1 0 Different Test Statistic: F = MSR/MSE
Testing for Significance:
F
Test Reject if:
p
-value < a or F > F a F = MSR/MSE where :
F
a is based on an
F
distribution with 1 degree of freedom in the numerator and
n
- 2 degrees of freedom in the denominator
Testing for Significance:
F
Test
H
: b 1 0 2. Specify the level of significance.
a = .05
3. Select the test statistic. F = MSR/MSE 4. State the rejection rule.
H
0 : b 1 0 if p-value < .05 or F > 10.13 ( with 1 d.f. in numerator and 3 d.f. in denominator)
Standard Error of the Estimate
Standard Error of Estimate has properties analogous to those of standard deviation. How “good” is our “fit”?
Interpretation is similar: ~68% of outcomes/predictions within one s est . ~95% of outcomes/predictions within two s est .
Kwatts vs. Temp Example
ANOVA
df SS MS F F Regression Residual Total 1 10 11 58784708.31
38696916.69
97481625 58784708.31
3869691.669
15.19
Significance 0.002972726
Intercept Temp Coefficients 319.0414124
185.2702073
Standard Error t Stat 3260.412811
47.53479059
0.097853073
3.897570706
P-value 0.923982528
0.002972726
Is the regression model statistically significant? Is the coefficient of Temp significant?
Cautions about Interpreting Significance Tests
Statistical significance does not mean linear relationship between
x
and
y
. Relationship between
x
and y does not
mean
a cause-and-effect relationship is present between
x
and
y
.
SAS Enterprise Miner
These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3
Using SAS Enterprise Miner requires the following steps: Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1
Create a project in Enterprise Miner Within the project: Create a data source using your SAS data file Create a diagram that includes a data node and a regression node and a multiplot node for graphs Run the model in the diagram and review the results
Creating a SAS data file from an Excel file: open SAS 9.1. Select File then Import Data
This opens the import wizard. Since the source file is from Excel, click Next. Then click Browse to find the TempKWatts.xls file
Since the data are on sheet1$, click Next. Then enter
SASUSER
as the Library and
TEMPKILOWATTL
as the Member. Then click
Next
Now click Finish to create your file
Open SAS Enterprise Miner 5.3. Enter the user name and password provided
The Enterprise Window below opens. Select
New Project
The Create New Project dialog box appears. Select the General tab, then type the short name of the project, e.g.,
KWattTemp0.
Keep the default path.
In the Startup code tab, enter: libname Ktemps "C:\Documents and Settings\mliberat\My Documents\My SAS Files\9.1\EM_Projects"; This code will be run each time you open the project
The Enterprise Miner application window opens
Right-click on Data Source, opening the wizard. Source is SAS table, so click Next
Browse the SAS libraries to find the SAS table
Tempkilowattl
found in the SASuser Library (previously created)
Click Next twice. Note that the Table properties shows that we have two variables with 12 observations
The next step controls how Enterprise Miner organizes metadata for the variables in your data. Select advanced, then click next (you can view/change the settings if you click Customize before clicking Next)
Change Role of KWatts to target (outcome variable); change Level of both KWatts and Temp to interval (continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats – we will do this later
Here Role relates to the role of the data set (raw, train, validate, score); raw is fine for our analysis of data, so click
Finish
Tempkilowattl now appears under Data Sources in the top left panel called the Project Panel
We need to create a Diagram for our model. Right-click on Diagrams, then enter
TempKwatts0
in the dialog box. Now the left panel shows TempKwatts0 as a Diagram, and the right hand panel is called the Diagram Workspace. Icons can be dragged and dropped onto the Diagram Workspace.
Now add an Input Data Node to the Diagram. From the Data Sources list in the Project Panel drag and drop the Data Source TempKwatts0 onto the Diagram Workspace. Note that when input data node is highlighted, various properties are displayed on the left-hand panel.
If you wish to see the properties of any or all of the variables, highlight the input data node; then on the left hand Properties Panel under Train, click on the box to the right of Variables; in the screen that opens control-click on KWatts and Temp; then click on Explore in the lower right
Frequency distributions for the variables and the raw data are provided. Right-clicking on observations in the lower-left panel will show where they appear in the bar charts. Cancel when finished.
Click on the Explore tab found over the Diagram Workspace, and then drag and drop the Multiplot icon onto the field. Using your cursor, draw a directed arrow from the TempKwattsl icon to the Multiplot icon. With the Multiplot icon highlighted, its properties are found in the left-hand Properties Panel.
Right-click on the Multiplot icon and select Run. After the run is completed select Results from the Run Status window.
Various charts are available as shown below. Descriptive statistics for each variable are given in the lower pane.
Click on the Model tab and drag the Regression icon onto the Model field. Connect the Tempkwattsl icon to the Regression icon. Highlight the Regression icon and on the Property Panel change Regression Type to linear regression.
Run the Regression and select Results. Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R square), and the two effects (intercept and slope – bars represent size and color represents direction)
For a given percentile, the Target Mean is the actual (or estimated value based on actuals), or what you are trying to predict; the Mean for Predicted is the forecasted values, or the predictions (or estimated values based on forecasts). The results are shown from highest to lowest forecasted values. The distances between the curves shows how well the model predicts the actual data.
A variety of fit statistics are provided. These include SSE, MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE), RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in terms of absolute value, where n = no. of observations, p=no. of variables in model (one in our case).
Schwartz’s Bayesian Criterion and Akaike’s Information Criterion are used for model selection (comparing one model to another). Schwartz’s adjusts the residual squared error for the number of parameters estimated, while Akaike’s is a relative measure of information lost from fitting the model.
Kwatts vs. Temp Example 2
Another approach to modeling the relationship between Kwatts and Temp is to use a
nonlinear regression
This is easily accomplished in Enterprise Miner – highlight the regression node, then in the left hand panel select
yes
for polynomial terms
We use the default of two terms
Is the fit any better???
Multiple Regression
Consider the following data relating family size and income to food expenditures: 16 17 18 19 20 8 9 10 11 12 13 14 15 4 5 6 7 family 1 2 3 food $ 5.2
5.1
5.6
4.6
11.3
8.1
7.8
5.8
5.1
18 4.9
11.8
5.2
4.8
7.9
6.4
20 13.7
5.1
2.9
30 40 82 42 58 28 20 42 income $ 28 26 32 24 54 59 44 47 112 85 31 26 1 6 5 2 2 2 1 6 3 4 1 5 3 1 4 2 3 family size 3 3 2
Multiple Regression
We can run this problem in Enterprise Miner using the same approach followed with the previous example On our model field we have placed the data source called
foodexpenditures
, and also both
Multiplot
and
StatExplore
found under the
Explore
tab above the model field Highlight
foodexpenditures
, then in the left-hand panel under
Training
, find
variables
and click on the box to the right to open up the variables Change the
role
of
family
to
rejected
observation) and change the
level
of (it is just the number of the
food_
to
target
, and
income_, food_,
and
fam_size
to
interval,
then click
OK
Foodexpenditures Model
Highlight the StatExplore node, right-click to Run, then select Results. Correlations between the input variables and the target are provided, along with basic statistics. The input variables are ordered by the size of the correlations. Now close out the results window and run the regression node and obtain results
Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the three effects (intercept and slopes for the two input variables with bars represent size and color represents direction). The model is significant and is a good fit with the data.
What happens in regression analysis when the target variable is binary?
There are many situations when the target variable is binary – some examples: whether a customer will or will not receive credit whether a customer will or will not response to a promotion Whether a firm will go bankrupt in a year
Whether a student will pass an exam!!!
Passing an Exam Data
6 7 8 9 10 11 12 13 14 3 4 5 Student id 1 2 1 1 1 0 1 0 1 1 0 0 0 0 Outcome 0 1 15 26 29 14 58 2 31 26 11 Study Hours 3 34 17 6 12
Running a linear regression to predict pass/don’t pass as a function of hours of study provides a model that doesn’t correctly model the data. The data are given in exampassing.xls
Passing an Exam
1.6
1.4
1.2
1 0.8
0.6
0.4
0.2
0 0 10 20 30 40
hours of study
50 60 70 Actual Predicted
The Enterprise Miner results show a poor fit on a percentile basis between predicted and target – another modeling approach is needed.
Logistic Regression
Similar to linear regression, two main differences Y (outcome or response) is categorical Yes/No Approve/Reject Responded/Did not respond Result is expressed as a
probability
of being in either group.
Comparing the Logistic & Linear Regression Models
Logisitic regression
p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)] 1-p =1/[1+exp(a+bx)] ln [p/(1-p)] = a + bx where:
exp
or
e
is the exponential function (e=2.71828…)
ln
is the natural logarithm
(ln(e) = 1)
p is probability that the event y occurs given x, and can range between 0 and 1 p/(1-p) is the
"odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit" all other components of the regression model are the same
Odds Ratio
Frequently used Related to probability of an event as follows:
Odds Ratio = p/(1-p)
Example: Probability of firm going bankrupt = .25
Odds firm will go bankrupt = .25/(1-.25) =
1/3 or 3 to 1
This is how sports books calculate odds (e.g., if odds of VU winning a championship are 2:1, probability is 1/3 ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)
Probability, Odds Ratio, LN of Odds Ratio
25 20 15 10 5 0 -5 0.
05 0.1 0.15 0.
2 0.
25 0.3 0.35 0.
4 0.
45 0.5 0.55 0.
6 0.
65 0.7 0.75 0.
8 0.
85 0.9 0.95
probability
odds nl(odds)
Running the exam data: Change regression type from linear regression to logistic regression Highlight the data node; on left-hand panel under Train open variables and change the level of outcome to binary
Results show a much better fit (upper left) and only one misclassification (lower right – a false negative).
The results show that the odds ratio = p(1-p) = exp( 8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640
Understanding Response Rate and Lift
To better understand the top left chart, change
cumulative lift
to
cumulative % response
. The observations are ranked by the predicted probability of response (highest to lowest) for each observation (from the fitted model).
Understanding Response Rate and Lift
Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40 th percentile. At the 50 th percentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%. The 8 th ranked observation, between the 55 th and 60 th percentile, is a positive response, so the cumulative % response is about 7/8 or 87%. Since there are no more positive responses after the 60 th percentile, the cumulative response rate will drop to 50%.
The chart compares how well the cumulative ranked predictions lead to a match between actual and predicted responses
Understanding Response Rate and Lift
Lift calculates the ratio of the actual response rate (passing) of the top n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined. At the 50 th percentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.
On the Properties Panel, click on Exported Data to see the predicted probabilities and response for each observation and compare to the actual response .
Logistic regression uses maximum likelihood (and not sum of squared errors) to estimate the model parameters. The results below show that the model is highly significant based on a chi-square test. The Wald chi-square statistic tests whether an effect is significant or not.
Bankruptcy Prediction
To predict bankruptcy a year in advance, you might collect: working capital/total assets (WC/TA) retained earnings/total assets (RE/TA) earnings before interest and taxes/total assets (EBIT/TA) market value of equity/total debt (MVE/TD) sales/total assets (S/TA)
Bankruptcy Training Data
7 8 9 10 11 Firm 1 2 3 4 5 6 12 13 14 15 16 17 18 19 20 WC/TA 0.0165
0.1415
0.5804
0.2304
0.3684
0.1527
0.1126
0.0141
0.222
0.2776
0.2689
0.2039
0.5056
0.1759
0.3579
0.2845
0.1209
0.1254
0.1777
0.2409
RE/TA 0.1192
0.3868
0.3331
0.296
0.3913
0.3344
0.3071
0.2366
0.1797
0.2567
0.1729
-0.0476
-0.1951
0.1343
0.1515
0.2038
0.2823
0.1956
0.0891
0.166
EBIT/TA 0.2035
0.0681
0.081
0.1225
0.0524
0.0783
0.0839
0.0905
0.1526
0.1642
0.0287
0.1263
0.2026
0.0946
0.0812
0.0171
-0.0113
0.0079
0.0695
0.0746
MVE/TD 0.813
0.5755
0.5755
0.4102
0.1658
0.7736
1.3429
0.5863
0.3459
0.2968
0.1224
0.8965
0.538
0.1955
0.1991
0.3357
0.3157
0.2073
0.1924
0.2516
S/TA 1.6702
1.0579
1.0579
3.0809
1.1533
1.5046
1.5736
1.4651
1.7237
1.8904
0.9277
1.0457
1.9514
1.9218
1.4582
1.3258
2.3219
1.489
1.6871
1.8524
0 0 0 0 0 0 0 0 0 1 1 1 1 0 BR/NB 1 1 1 1 1 1
Bankruptcy Example
Using the BankruptTrain.xls data create a SAS data file called bankrupt BR_NB:
role
is
target
and
level
is
binary
Firm:
role
is
rejected
and level is
nominal
(it is simply the firm number) Remaining five financial ratio variables:
role
is
input
and
level
is
interval
Create a diagram named bankrupt1. Drag and drop the data node onto the model. Highlight the data node and on the left hand panel under variables click on the box to its right to see the variables data
From the Explore tab drag and drop the StatExplore node onto the diagram and link it to the bankrupt node. Highlight the StatExplore node, right-click and run it, and obtain results. On top, correlations between the five input variables and the target are shown via bars ordered from largest to smallest. Below the mean variable score for bankrupt vs. non bankrupt observations is shown.
From the Model tab drag and drop the regression node onto the diagram and connect it to the bankrupt node. Highlight the regression node and run, and obtain the results
The results show that the model fits the data very well with highly significant overall chi square statistic, low error values, and 0 misclassifications. Cumulative lift shows that for the top 50% of observations that are bankrupt, they are twice as likely to be classified as bankrupt.
Scoring
Once you have specified a model you might wish to apply it to new data whose outcome is unknown -- make predictions This can be easily accomplished in Enterprise Miner using scoring Convert the data set BankruptScore.xls to a SAS file called
bankruptscore
. The
role
of this data is
score
.
Bankruptcy Scoring Data
F G I H J Firm A B C D E WC/TA 0.1759
0.3732
0.1725
0.163
0.1904
0.1123
0.0732
0.2653
0.107
0.2921
RE/TA 0.1343
0.3483
0.3238
0.3555
0.2011
0.2288
0.3526
0.2683
0.0787
0.239
EBIT/TA 0.0956
-0.0013
0.104
0.011
0.1329
0.01
0.0587
0.0235
0.0433
0.9673
MVE/TD 0.1955
0.3483
0.8847
0.373
0.558
0.1884
0.2349
0.5118
0.1083
0.3402
S/TA 1.9218
1.8223
0.5576
2.8307
1.6623
2.7186
1.7432
1.835
1.2051
0.9277
Drag and drop the bankruptscore data node to the bankrupt1 diagram. From the Assess tab, drag and drop the Score node into the diagram. Link the regression and bankruptscore nodes together and connect them to the Score node.
Run the Score node and obtain the Results. Of the 10 firms, 6 are predicted to become bankrupt.
For details about the individual predictions, highlight the Score node and on the left-hand panel click on the square to the right of Exported Data. Then in the box that appears click on the row whose Port entry is Score. Then click on Explore.
The lower portion of the output is shown below. The predictions are given, along with the probabilities of the firm becoming bankrupt or not.
Regression Using Selection Models
When there are a number of possible input variables, procedures are available to sort through them and include those that have a certain level of statistical significance SAS Enterprise Miner 5.3 offers three
selection methods
: Backward Forward Stepwise
Regression Using Selection Models
Backward: training begins with all candidate effects in the model and removes effects until the
stay significance level
or the stop criterion is met Forward: training begins with no candidate effects in the model and adds effects until the
entry significance level
or the stop criterion is met. Stepwise: training begins as in the forward model but may remove effects already in the model. This continues until the
stay significance level
or the stop criterion is met Note that the default significance levels (p values) values are 0.05 and no stop criteria (such as maximum number of steps in the regression) are set
Regression Using Selection Models – Bankruptcy Model
To select
stepwise regression
for the bankruptcy model, highlight the regression node and in the properties panel under
Selection Model
choose
Stepwise
. The default significance level of 0.05 is used
Regression Using Selection Models – Bankruptcy Model
Interestingly, the Training Model only uses RE/TA as a predictor There are 3 misclassifications (.15 rate) in this set vs. 0 in the original model The results are very different: the original model with all 5 input variables predicted bankruptcy for G, E, C, and J, while the stepwise model predicted B, C, D, F, G, H, and J would become bankrupt.
Changing the significance levels to 0.1 (to make it easier for input variables to enter/leave the stepwise model) produces the same results