Model Building - NC State Department of Statistics

Download Report

Transcript Model Building - NC State Department of Statistics

Chapter 9 Supplement
Model Building
1
Introduction
• Regression analysis is one of the most
commonly used techniques in statistics.
• It is considered powerful for several reasons:
– It can cover a variety of mathematical models
• linear relationships.
• non - linear relationships.
• nominal independent variables.
– It provides efficient methods for model building
2
Polynomial Models
• There are models where the independent
variables (xi) may appear as functions of a
smaller number of predictor variables.
• Polynomial models are one such example.
3
Polynomial Models with One Predictor
Variable
y = b0 + b1x1+ b2x2 +…+ bpxp + e
y = b0 + b1x + b2x2 + …+bpxp + e
4
Polynomial Models with One Predictor
Variable
• First order model (p = 1)
y = b0 + b1x + e
• Second order model (p=2)
y = b0 + b1x + eb2x2 + e
b2 < 0
b2 > 0
5
Polynomial Models with One Predictor
Variable
• Third order model (p = 3)
y = b0 + b1x + b2x2 + eb3x3 + e
b3 < 0
b3 > 0
6
Polynomial Models with Two Predictor
Variables
y
• First order model
y = b0 + b1x1 + eb2x2 + e
y
x1
x2
x1
x2
7
Polynomial Models with Two Predictor
Variables
• First order model
y = b0 + b1x1 + b2x2 + e
The effect of one predictor variable on y
is independent of the effect of the other
predictor variable on y.
X2 = 3
X2 = 2
X2 = 1
• First order model, two
predictors,and interaction
y = b0 + b1x1 + b2x2
+b3x1x2 + e
The two variables interact
to affect the value of y.
X2 = 3
X2 = 2
X2 =1
x1
x
8 1
Polynomial Models with Two Predictor
Variables
Second order model
y = b0 + b1x1 + b2x2
+ b3x12 + b4x22 + e
X2 = 3
y = [b0+b2(3)+b4(32)]+ b1x1 + b3x12 + e
X2 = 2
y = [b0+b2(2)+b4
(22)]+
b1x1 + b3x1 + e
2
Second order model with
interaction
y = b0 + b1x1 + b2x2 +b3x12
+ b4x22+ b
e 5x1x2 + e
X2 = 3
X2 = 2
X2 =1
X2 =1
y = [b0+b2(1)+b4(12)]+ b1x1 + b3x12 + e
x1
9
Selecting a Model
• Several models have been introduced.
• How do we select the right model?
• Selecting a model:
– Use your knowledge of the problem (variables
involved and the nature of the relationship between
them) to select a model.
– Test the model using statistical techniques.
10
Selecting a Model; Example
• Example: The location of a new restaurant
– A fast food restaurant chain tries to identify new
locations that are likely to be profitable.
– The primary market for such restaurants is middleincome adults and their children (between the age 5
and 12).
– Which regression model should be proposed to
predict the profitability of new locations?
11
Selecting a Model; Example
• Solution
– The dependent variable will be Gross Revenue
– Quadratic relationships between Revenue and each predictor
variable should be observed. Why?
• Families with very young or
older kids will not visit the
restaurant as frequent as
families with mid-range ages of
kids.
• Members of middle-class
families are more likely to visit
a fast food restaurant than
members of poor or wealthy
families.
Revenue
Revenue
Low
Middle High
Income
Low
Middle High
12
age
Selecting a Model; Example
• Solution
– The quadratic regression model built is
Sales = b0 + b1INCOME + b2AGE
+ b3INCOME2 +b4AGE2 + b5(INCOME)(AGE) +e
Include interaction term when in doubt,
and test its relevance later.
SALES = annual gross sales
INCOME = median annual household income in the
neighborhood
AGE = mean age of children in the neighborhood
13
Selecting a Model; Example
To verify the validity of the proposed model for
recommending the location of a new fast food
restaurant, 25 areas with fast food restaurants
were randomly selected.
– Each area included one of the firm’s and three
competing restaurants.
– Data collected included (Xm9-01.xls):
• Previous year’s annual gross sales.
• Mean annual household income.
• Mean age of children
14
Selecting a Model; Example
Xm9-01.xls
Revenue
1128
1005
1212
.
.
Added data
Income
23.5
17.6
26.3
.
.
Age
10.5
7.2
7.6
.
.
Income sq
552.25
309.76
691.69
.
.
Collected data
Age sq
110.25
51.84
57.76
.
.
(Income)( Age)
246.75
126.72
199.88
.
.
15
The Quadratic Relationships –
Graphical Illustration
REVENUE vs. AGE
1500
1000
500
REVENUE vs. INCOME
0
0.0
1500
5.0
10.0
15.0
20.0
1000
500
0
0.0
10.0
20.0
30.0
40.0
16
Model Validation
Regression Statistics
Multiple R
0.9521
R Square
0.9065
Adjusted R Square
0.8819
Standard Error
44.70
Observations
25
This is a valid model that can be
used to make predictions.
ANOVA
df
Regression
Residual
Total
Intercept
Income
Age
Income sq
Age sq
(Income)( Age)
5
19
24
SS
368140
37956
406096
Coefficients Standard Error
-1134
320.0
173.2
28.20
23.55
32.23
-3.73
0.54
-3.87
1.18
1.97
0.94
MS
73628
1998
F
Significance F
36.86
0.0000
t Stat
P-value
-3.54
0.0022
6.14
0.0000
0.73
0.4739
-6.87
0.0000
-3.28
0.0039
2.08
0.0509
But…
17
Reducing multicollinearity
Model Validation
The model can be used to make predictions...
…but multicollinearity is a problem!!
The t-tests may be distorted, therefore,
do not interpret the coefficients or test them.
In excel: Tools > Data Analysis > Correlation
INCOME
INCOME
AGE
INC sq
AGE sq
INC X AGE
AGE
INC sq
AGE sq INC X AGE
1
0.0201
1
0.9945 -0.045
1
-0.042 0.9845 -0.099
1
0.4596 0.8861 0.3968 0.8405
1
18
Nominal Independent Variables
• In many real-life situations one or more
independent variables are nominal.
• Including nominal variables in a regression
analysis model is done via indicator variables.
• An indicator variable (I) can assume one out of
two values, “zero” or “one”.
o
1
if
a
degree
earned
is
in
Finance
1
if
the
temperature
was
below
50
11 ififadata
firstwere
condition
collected
out ofbefore
two is1980
met
I= 00 ififthe
o orFinance
asecond
degree
earnedwas
isout
not
temperature
50
more
00 ififadata
werecondition
collected
after
ofintwo
1980
is met
19
Nominal Independent Variables;
Example: Auction Price of Cars
A car dealer wants to predict the auction price of a
car. Xm9-02a_supp
– The dealer believes now that odometer reading and
the car color are variables that affect a car’s price.
– Three color categories are considered:
• White
• Silver
• Other colors
Note: Color is a
nominal variable.
20
Nominal Independent Variables;
Example: Auction Price of Cars
• data - revised (Xm9-02b_supp)
1 if the color is white
I1 =
0 if the color is not white
1 if the color is silver
I2 = 0 if the color is not silver
The category “Other colors” is defined by:
I1 = 0; I2 = 0
21
How Many Indicator Variables?
• Note: To represent the situation of three possible
colors we need only two indicator variables.
• Conclusion: To represent a nominal variable
with m possible categories, we must create m-1
indicator variables.
22
Nominal Independent Variables;
Example: Auction Car Price
• Solution
– the proposed model is
y = b0 + b1(Odometer) + b2I1 + b3I2 + e
– The data
Price
14636
14122
14016
15590
15568
14718
.
.
Odometer
37388
44758
45833
30862
31705
34010
.
.
I-1
1
1
0
0
0
0
.
.
I-2
0
0
0
0
1
1
.
.
White car
Other color
Silver color
23
Example: Auction Car Price
The Regression Equation
From Excel we get the regression equation
PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)
For one additional mile the
auction price decreases by
5.55 cents.
A white car sells, on the average,
for $90.48 more than a car of the
“Other color” category
A silver color car sells, on the average,
for $295.48 more than a car of the
“Other color” category.
24
Example: Auction Car Price
The Regression Equation
From Excel (Xm9-02b_supp) we get the regression equation
PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)
Price
Price = 16701 - .0555(Odometer) + 90.48(0) + 295.48(1)
Price = 16701 - .0555(Odometer) + 90.48(1) + 295.48(0)
Price = 16701 - .0555(Odometer) + 45.2(0) + 148(0)
Odometer
25
Example: Auction Car Price
The Regression Equation
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8355
R Square
0.6980
Adjusted R Square
0.6886
Standard Error
284.5
Observations
100
There is insufficient evidence
Xm9-02b_supp
to infer that a white color car and
a car of “other color” sell for a
different auction price.
There is sufficient evidence
to infer that a silver color car
sells for a larger price than a
car of the “other color” category.
ANOVA
df
Regression
Residual
Total
Intercept
Odometer
I-1
I-2
3
96
99
SS
17966997
7772564
25739561
Coefficients Standard Error
16701 184.3330576
-0.0555
0.0047
90.48
68.17
295.48
76.37
MS
5988999
80964
F
Significance F
73.97
0.0000
t Stat
P-value
90.60
0.0000
-11.72
0.0000
1.33
0.1876
3.87
0.0002
26
Nominal Independent Variables;
Example: MBA Program Admission
(MBA II)
• The Dean wants to evaluate applications for the MBA
program by predicting future performance of the applicants.
• The following three predictors were suggested:
– Undergraduate GPA
– GMAT score
– Years of work experience
Note: The undergraduate
degree is nominal data.
• It is now believed that the type of undergraduate degree
should be included in the model.
27
Nominal Independent Variables;
Example: MBA Program Admission
1 if B.A.
I1 =
0 otherwise
1 if B.B.A
I2 = 0 otherwise
1 if B.Sc. or B.Eng.
I3 = 0 otherwise
The category “Other group” is defined by:
I1 = 0; I2 = 0; I3 = 0
28
Nominal Independent Variables;
Example: MBA Program Admission
MBA-II
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.7461
R Square
0.5566
Adjusted R Square
0.5242
Standard Error
0.729
Observations
89
ANOVA
df
Regression
Residual
Total
Intercept
UnderGPA
GMAT
Work
I-1
I-2
I-3
SS
6
82
88
54.75
43.62
98.37
Coefficients Standard Error
0.19
1.41
-0.0061
0.114
0.0128
0.0014
0.098
0.030
-0.34
0.22
0.71
0.24
0.03
0.21
MS
9.13
0.53
F
Significance F
17.16
0.0000
t Stat
P-value
0.13
0.8930
-0.05
0.9577
9.43
0.0000
3.24
0.0017
-1.54
0.1269
2.93
0.0043
0.17
0.8684
29
Applications in Human Resources
Management: Pay-Equity
• Pay-equity can be handled in two different forms:
– Equal pay for equal work
– Equal pay for work of equal value.
• Regression analysis is extensively employed in
cases of equal pay for equal work.
30
Human Resources Management:
Pay-Equity
• Example (Xm9-03_supp)
– Is there sex discrimination against female managers
in a large firm?
– A random sample of 100 managers was selected
and data were collected as follows:
•
•
•
•
Annual salary
Years of education
Years of experience
Gender
31
Human Resources Management:
Pay-Equity
• Solution
– Construct the following multiple regression model:
y = b0 + b1Education + b2Experience + b3Gender + e
– Note the nature of the variables:
• Education – Interval
• Experience – Interval
• Gender – Nominal (Gender = 1 if male; =0 otherwise).
32
Human Resources Management:
Pay-Equity
• Solution – Continued (Xm9-03)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8326
R Square
0.6932
Adjusted R Square
0.6836
Standard Error
16274
Observations
100
Analysis and Interpretation
• The model fits the data quite well.
• The model is very useful.
• Experience is a variable strongly
related to salary.
• There is no evidence of sex discrimination.
ANOVA
df
Regression
Residual
Total
Intercept
Education
Experience
Gender
SS
MS
3 57434095083 19144698361
96 25424794888 264841613.4
99 82858889971
CoefficientsStandard Error
-5835.1
16082.8
2118.9
1018.5
4099.3
317.2
1851.0
3703.1
t Stat
-0.36
2.08
12.92
0.50
F
Significance F
72.29
0.0000
P-value
0.7175
0.0401
0.0000
0.6183
33
Human Resources Management:
Pay-Equity
• Solution – Continued (Xm9-03)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8326
R Square
0.6932
Adjusted R Square
0.6836
Standard Error
16274
Observations
100
ANOVA
df
Regression
Residual
Total
Intercept
Education
Experience
Gender
3
96
99
Analysis and Interpretation
• Further studying the data we find:
Average experience (years) for women is 12.
Average experience (years) for men is 17
• Average salary for female manager is $76,189
Average salary for male manager is $97,832
SS
57434095083
25424794888
82858889971
Coefficients Standard Error
-5835.1
16082.8
2118.9
1018.5
4099.3
317.2
1851.0
3703.1
MS
19144698361
264841613
t Stat
-0.36
2.08
12.92
0.50
F Significance F
72.29
0.0000
P-value
0.7175
0.0401
0.0000
0.6183
34
Stepwise Regression
• Multicollinearity may prevent the study of the
relationship between dependent and independent
variables.
• The correlation matrix may fail to detect multicollinearity
because variables may relate to one another in various
ways.
• To reduce multicollinearity we can use stepwise
regression.
• In stepwise regression variables are added to or deleted
from the model one at a time, based on their
contribution to the current model.
35
Model Building
• Identify the dependent variable, and clearly define it.
• List potential predictors.
– Bear in mind the problem of multicollinearity.
– Consider the cost of gathering, processing and
storing data.
– Be selective in your choice (try to use as few
variables as possible).
36
• Gather the required observations (have at least
six observations for each independent variable).
• Identify several possible models.
– A scatter diagram of the dependent variables can be
helpful in formulating the right model.
– If you are uncertain, start with first order and second
order models, with and without interaction.
– Try other relationships (transformations) if the
polynomial models fail to provide a good fit.
• Use statistical software to estimate the model.
37
• Determine whether the required conditions are
satisfied. If not, attempt to correct the problem.
• Select the best model.
– Use the statistical output.
– Use your judgment!!
38