mondayfeb14lecture.ppt

Download Report

Transcript mondayfeb14lecture.ppt

Business Statistics, Can. ed.
By Black, Chakrapani & Castillo
Chapter 14
Discrete Distributions
Building Multiple
Regression Models
Prepared by Dr. Clarence S. Bayne
JMSB, Concordia University
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Learning Objectives
• Analyze and interpret nonlinear variables in multiple regression
analysis.
• Understanding the role of qualitative variables and how to use
them in multiple regression analysis.
• How to build and evaluate multiple regression models.
• What is multicollinearity and how to deal with it
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Mathematical Transformations:
Recoding Independent Variables to
Create Non-linear Models
Description of Models
First-order model with Two Independent
Variables
Equations
Y  0  1 X1   2 X 2  
Second-order Model with One Independent
variable
Y   0  1 X 1   2 X 12  
Second-order Model with an Interaction
Term
Y  0  1 X1  2 X 2  3 X1 X 2  
Second-order with Two Independent
Variables
Y  0  1 X1  2 X 2  3 X12  4 X 22  5 X1 X 2  
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
A Curvilinear Scatter Plot of Sales Data
for 13 Manufacturing Companies
Sales
Manufacturer ($1,000,000)
1
2.1
2
3.6
3
6.2
4
10.4
5
22.8
6
35.6
7
57.1
8
83.5
9
109.4
10
128.6
11
196.8
12
280.0
13
462.3
Number of
Manufacturing
Representatives
2
1
2
3
4
4
5
5
6
7
8
10
11
500
450
400
350
300
Sales 250
200
150
100
50
0
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
0
2
4
6
8
10
Number of Representatives
12
Excel Simple Linear
Regression Output for
the Manufacturing
Example
Coefficients Standard Error
-107.03
28.737
41.026
4.779
Intercept
numbers
Regression Statistics
Multiple R
0.933
R Square
0.870
Adjusted R Square 0.858
Standard Error
51.10
Observations
t Stat
-3.72
8.58
13
P-value
0.003
0.000
ANOVA
df
Regression
Residual
Total
1
11
12
SS
192395
28721
221117
MS
192395
2611
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
F
73.69
Significance F
0.000
Second Order Model with one
Independent Variable: Manufacturing
Sales Data: Table 14.2
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Scatter Plots Showing Original Curvilinear
With More Linear Transformed Data: Figure 14.2
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Computer Output for
Quadratic Model to
Predict Sales
Intercept
MfgrRp
MfgrRpSq
Regression Statistics
Multiple R
0.986
R Square
0.973
Adjusted R Square 0.967
Standard Error
24.593
Observations
13
Coefficients Standard Error
18.067
24.673
-15.723
9.5450
4.750
0.776
t Stat
0.73
- 1.65
6.12
P-value
0.481
0.131
0.000
ANOVA
df
Regression
Residual
Total
2
10
12
SS
215069
6048
221117
MS
107534
605
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
F
177.79
Significance F
0.000
Tukey’s Ladder of Transformation
The Four Quadrant Approach
Move toward
2
3
y ,y ,
toward log x, -1
x ,
Move toward log x, -1
toward log Y, -1
, or
y,
x
2
3
y ,y ,
Move toward
toward
x , , or
2
3
,x ,
Move toward
x
2
3
,x ,
toward log y, -1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
, or
or
y ,
Regression Models With Interactions
 Often in the real world of business and economics interaction occurs
between two variables
 One variable acts differently over a range of values for the second
variable than it does over another range of values for the second
variable
 In a manufacturing plant humidity might affect the hardness of
material differently at differently at different temperatures
 The ANOVA model in Chapter 11 addressed this problem by using an
interaction variable as a blocking variable
 In regression analysis, interaction can be examined as a separate
independent variable
 This is illustrated by using the second-order model design with
two independent variables and an interaction term.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Table 14.3 Share
Prices of Three
Stocks over a
15-Month Period
Stock 1
Problem Definition:
The data represent the
closing prices for three
corporations over a 15
months period. An
investment firm wants to
use the prices for stocks 2
and 3 to develop a
regression model to predict
the price of stock 1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Stock 2
Stock 3
41
36
35
39
36
35
38
38
32
45
51
41
41
52
39
43
55
55
47
57
52
49
58
54
41
62
65
35
70
77
36
72
75
39
74
74
33
83
81
28
101
92
31
107
91
Develop Model Using Step by Step
Approach and Explore for Interaction
First-order with Two Independent Variables
Y   0  1 X 1   2
X
2

where: Y = price of stock 1
X
X
1
price of stock 2
2
price of stock 3
Second-order with an Interaction Term
Y  
X  X  X X
Y     X   X   X 
0
1
1
2
2
3
1
0
1
1
2
2
3
3
where : Y = price of stock 1
X
X
X
1
 price of stock 2
2
 price of stock 3
3

X X
1
2
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
2

Initial Regression First-order Model with
Two Independent Variables
The regression equation is
Stock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3
Predictor
Coef
Constant
50.855
Stock 2
-0.1190
Stock 3
-0.0708
S = 4.570
StDev
3.791
0.1931
0.1990
R-Sq = 47.2%
T
P
13.41 0.000
-0.62 0.549
-0.36 0.728
R-Sq(adj) = 38.4%
Analysis of Variance
Source
Regression
Error
Total
DF
2
12
14
SS
224.29
250.64
474.93
MS
112.15
20.89
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
F
P
5.37 0.022
Excel Regression Second-order Model with
Interaction Term for the Three Stocks
The regression equation is
Stock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 – 0.00998 Inter
Predictor
Constant
Stock 2
Stock 3
Inter
S = 2.909
Coef
12.046
0.8788
0.2205
-0.009985
StDev
9.312
0.2619
0.1435
0.002314
R-Sq = 80.4%
T
P
1.29 0.222
3.36 0.006
1.54 0.153
-4.31 0.001
R-Sq(adj) = 75.1%
Analysis of Variance
Source
Regression
Error
Total
DF
3
11
14
SS
381.85
93.09
474.93
MS
127.28
8.46
F
P
15.04 0.000
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Response Surface for the Stock ExampleWithout and With Interaction
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Regression Statistics from Two Excel
Output Summaries With and Without
Interaction
Summary Regression Statistics for Share Prices of Three
Stocks
Summary Output :
With No Interaction
Multiple R
R Square
Summary Output
With Interaction
0.687213365
0.47226221
Multiple R
R Square
0.804000661
0.750546296
Adjusted R Square
0.384305911
Adjusted R Square
Standard Error
4.570195728
Standard Error
Observations
15
0.89666084
Observations
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
2.90902388
15
Analysis and Conclusions
• By using the interaction term the coefficient of determination( R2)
increases from 0.47 to 0.80
• The Standard error decreases from 4.57 in the first model down to 2.909
in the second.
• The t ratios for the X1 term and the interaction term are statistically
significant in the second model
• T = 3.36 with a p value of 0.006 for X1 and t= -4.31 with a probability of
0.001 for X1X2 .
• Inclusion of X1X2 helped the model account for a substantially greater
amount of the dependent variable. It is a significant contributor to the
model
• The second graph in figure 14.6 shows how the interaction term bends
the curve to fit the data as stock 2 is increased
• Be cautious in interpreting the accuracy of the partial coefficients because
of the high likelihood of multicollinearity
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Model-Building: Search Procedures
 Search procedure are processes whereby more than one
multiple regression model is developed for a given database,
and the models are compared and sorted by different criteria,
depending on the given procedure
 There are many search procedures. Among the most widely
known are




All Possible Regressions
Stepwise Regression
Forward Selection
Backward Elimination
 Which approach is best is subject to much debate and depends
on the disciplines and the philosophy of enquiry that the
researcher brings to the research.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
All Possible Regressions
• All possible regressions search procedure computes all possible linear
multiple regression models from the data using all variables
• If a data set contains k independent variables all possible regressions will
determine 2k – 1 different models
• This produces all possible different models with single predictors; two
predictors; three predictors up to all k predictors
• The next slide show predictors for all possible regressions for five
independent variables
• If a research methodology and study design exist that identifies all essential
variables, the procedure enables the business researcher to examine every
model
• Warning. This search through all possible models can be tedious, time
consuming, inefficient, and perhaps overwhelming
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
All Possible Regressions
with Five Independent Variables
Single
Predictor
X1
X2
X3
X4
X5
Two
Predictors
X1,X2
X1,X3
X1,X4
X1,X5
X2,X3
X2,X4
X2,X5
X3,X4
X3,X5
X4,X5
Three
Predictors
X1,X2,X3
X1,X2,X4
X1,X2,X5
X1,X3,X4
X1,X3,X5
X1,X4,X5
X2,X3,X4
X2,X3,X5
X2,X4,X5
X3,X4,X5
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Four
Predictors
X1,X2,X3,X4
X1,X2,X3,X5
X1,X2,X4,X5
X1,X3,X4,X5
X2,X3,X4,X5
Five Predictors
X1,X2,X3,X4,X 5
Stepwise Regression
• Stepwise regression is a step-by-step process that begins by
developing a regression model with a single predictor variable and
adds and deletes predictors one step at a time
• It allows the researcher to examine the fit of the model at each step
until no more significant predictors remain outside the model
• This starts by choosing the single predictor regression with the
highest t or F value and which is significant at some predetermined
Alpha value.
•
If none of the independent variables meet this criteria, no model is
recommended.
• Incrementally other variables are added to the equation and tested
for the significance of their contribution to explaining Total variation
relative to other variable, then test for the significance.
• This procedure continues until all significant predictor are included
• Stepwise regression allows checks for multicollinearity and the
dropping of variables that were included in earlier stages
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Forward Selection
Like stepwise, except that variables are
not reevaluated after entering the
model
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Backward Elimination
• Start with the “full model” (all k predictors)
• If all predictors are significant, stop
• Otherwise, eliminate the most nonsignificant
predictor; and return to previous step
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Stepwise Regression
• Perform k simple regressions; and select the best as the
initial model
• Evaluate each variable not in the model
– If none meet the criterion, stop
– Add the best variable to the model; reevaluate previous variables,
and drop any which are not significant
• Return to previous step
• The criteria for inclusion and exclusion of variables may be
of a technical nature; common sense observational nature;
based on a body of theory; the usefulness of the discovery of
new relationships as insights to meaning
• The researcher has to be keenly aware of the problem of
spurious relationships when using these search procedures
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Choosing the Variables for a Stepwise
Regression Predicting World Crude Oil
Production Example
Problem Definition: Predicting world crude oil production
• Choice of a method: many different types of prediction models can be
constructed. the researcher adopts an econometric approach using
multiple regression
• After a preliminary survey of the industry and the factors surrounding it,
the researcher realizes that much of the world crude oil market is driven
by variables related to the usage and production in the USA
The researcher identifies five independent variables as predictors:
1.U.S. energy consumption
2. Gross U.S. nuclear electricity generation
3.U.S. Coal production
4.Total U.S. dry gas (natural gas) production
5. Fuel rate of U.S. owned automobiles
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Systematic Framework Underlying
Data Collection
• A survey of published and other data on energy production
and usage suggest that world production of crude oil is
driven by previous years activities in the U.S.
• Expected that as energy consumption of the U.S. increased,
so would world production of crude oil
• It seemed reasonable to introduce nuclear electricity
generation, coal production, dry gas production and fuel
rates to the study
• Rationale: their increase output may be expected to have a
negative effect on crude oil production if energy
consumption remained fixed.
• Data on five independent variables and the dependent
variable (world crude oil production) was gathered and is
presented on the next slide
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Definition and Measurement of Variables:
Data for Multiple Regression Model to
Predict World Crude Oil Production
Y
World Crude Oil Production
(millions of barrels per Day)
X1
U.S. Energy Consumption
(quadrillion BTUs generation per year)
X2
U.S. Nuclear Generation
(billion kilowatts-hours)
X3
U.S. Coal Production
(million short-tons)
X4
U.S. Dry Gas Production
(trillion cubic feet)
X5
U.S. Fuel Rate for Autos
(miles per gallon)
Y
55.7
55.7
52.8
57.3
59.7
60.2
62.7
59.6
56.1
53.5
53.3
54.5
54.0
56.2
56.7
58.7
59.9
60.6
60.2
60.2
60.6
60.9
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
X1
74.3
72.5
70.5
74.4
76.3
78.1
78.9
76.0
74.0
70.8
70.5
74.1
74.0
74.3
76.9
80.2
81.3
81.3
81.1
82.1
83.9
85.6
X2
83.5
114.0
172.5
191.1
250.9
276.4
255.2
251.1
272.7
282.8
293.7
327.6
383.7
414.0
455.3
527.0
529.4
576.9
612.6
618.8
610.3
640.4
X3
598.6
610.0
654.6
684.9
697.2
670.2
781.1
829.7
823.8
838.1
782.1
895.9
883.6
890.3
918.8
950.3
980.7
1029.1
996.0
997.5
945.4
1033.5
X4
21.7
20.7
19.2
19.1
19.2
19.1
19.7
19.4
19.2
17.8
16.1
17.5
16.5
16.1
16.6
17.1
17.3
17.8
17.7
17.8
18.2
18.9
X5
13.30
13.42
13.52
13.53
13.80
14.04
14.41
15.46
15.94
16.65
17.14
17.83
18.20
18.27
19.20
19.87
20.31
21.02
21.69
21.68
21.04
21.48
Step 1: Stepwise Regression Results
with One Predictor
 The results of simple regression using each independent
variable to predict oil production produces the initial
regression equation
y = 13.075 + 0.580x1 where y is world crude oil production
and x1 is U.S. Energy consumption. Note the t value
(11.77) in Table 14.8 is the highest of all variables tried,
an R-squared is 85.2%
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Excel Output of Regression
for Crude Oil Production
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Step 2: Stepwise Regression Results with
Two Predictors
• X2 is retained initially in the model and a search is conducted to
determine which of the other models together with it produces the
highest significant t value( add most to explaining variation in Y).
• The new equation emerging from computer calculation is
y = 7.14 + 0.772x1 – 0.517x2 . X2 is U.S. fuel rate. It has a t value of 3.75 and an r-squared of 90.8. Both very significant.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Step 3: Regression Results with Three
Predictors
• Step 3 continues the search for additional
predictor variables
• Table 14.10 shows that any other values added
make no significant contributions to the regression
obtained at step 2. The t values are very small.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Minitab Stepwise Output
Stepwise Regression
F-to-Enter:
4.00 F-to-Remove:
4.00
Response is Coiler on 5 predictors, with N = 26
Step
Constant
Seconds
T-Value
P-value
1
13.075
0.580
11.77
0.000
Fuel Rate
T-Value
P-value
S
R-Sq
2
7.140
0.772
11.91
0.000
-0.52
-3.75
0.001
1.52
85.24
1.22
90.83
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Key Concerns
• The search procedures provide a framework for an analysis and must be
applied subject to commonsense and an explanatory theory or analysis
• Avoid the mistake of using the strict sequential order in which variables
come into a computer print out ( on stepwise and forward selection) to
rank the importance of the variable
• In multiple regression (unlike simple regression) the importance of an
independent variable is ranked in terms of its net contribution to
explaining Y when used with other variables; not in terms of its
individual correlation with y
• Problems of multicollinearity require transformation or omission of
variable(s) before or as analysis proceeds . Adding a variable that is
highly correlated with other independent variables is very problematic.
It distorts the value of coefficients and renders all tests unreliable.
• An increase in R-squared is not in and of itself a good indicator of the
importance of the last variable added.
• Common sense and use value is the final arbiter in choosing the final
model
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Multicollinearity
 Condition that occurs when two or more of the independent
variables of a multiple regression model are highly correlated
 Effect of Multicollinearity
 Difficult, if not impossible, to interpret the estimates of the
regression coefficients
 Inordinately small t values for the regression coefficients
 Standard deviations of regression coefficients are
overestimated: t-tests and F test may have no meaning
 Algebraic sign of predictor variable’s coefficient opposite of
what expected
 In practice correlations as high as 60 to 70 percent may be
tolerated without causing a serious problem of multicollinearity
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Testing for Multicollinearity
 Two techniques for determining the possible existence of
Multicollinearity
 Prepare a correlation matrix of the independent variables using an Excel or
other software program and identify those pairs of variables that have
correlations in excess of 0.70
 The Variance Inflation factor (VIF): conduct a regression analysis to predict
one independent variable by the other. Thus the independent variable
being predicted becomes the dependent variable. This is done for all
possible different pairs and R-squares (Coefficients of determination) for
each calculated.
VIF 
1
1  Ri2
is the measure that determines whether the standard errors of the
estimates are inflated.
 Some researchers follow a guideline that for VIF greater than 10
or an R2 greater than 0.90 for the largest VIFs indicates a severe
multicollinearity problem
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Correlations among Oil Production
Predictor Variables
Energy
Consumption
Energy
Consumption
Nuclear
Coal
Dry Gas
Fuel Rate
1
0.856
0.791
0.057
0.791
Nuclear
0.856
1
0.952
-0.404
0.972
Coal
0.791
0.952
1
-0.448
0.968
Dry Gas
0.057
-0.404
-0.448
1
-
Fuel Rate
0.796
0.972
0.968
-0.423
1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Problem of Interpretation When
Multicollinearity Exists: World Crude Oil
Production Regression
• The algebraic signs in a regression model must conform to
common sense observation or established theory
• Note the following three equations considered at different
stages f the stepwise regression analysis
1.
2.
3.
•
Ŷ = 44.869 + 0.7838(fuel rate). The positive fuel rate coefficient can be
interpreted in terms of economic theory: price substitution effect.
Ŷ = 45.072 + 0.0157(coal). The positive coal coefficient is explainable in a
complementary sense.
Ŷ = 45.806 + 0.0227(coal) – 0.3934(fuel rate). The negative fuel rate
coefficient is opposite to that in equation 1 and is contrary to what by
normally expected in economic theory or common sense observation
The reason for the apparent contradiction in equation 3 can
be attributed to multicollinearity: R2 = 0.968 or VIF =31
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Copyright Notice
Copyright © 2010 John Wiley & Sons Canada, Ltd. All rights reserved. Reproduction or
translation of this work beyond that permitted by Access Copyright (The Canadian
Copyright Licensing Agency) is unlawful. Request for further information should be
addressed to the Permissions Department, John Wiley & Sons Canada, Ltd. The
purchaser may make back-up copies for his/her own use only and not for distribution
or resale. The Publisher assumes no responsibility for errors, omissions, or damages
caused by the use of these programs or from the use of the information herein.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.