Lecture 26 - University of Pennsylvania

Download Report

Transcript Lecture 26 - University of Pennsylvania

Lecture 26
• Model Building (Chapters 20.2-20.3)
• HW6 due Wednesday, April 23rd by 5 p.m.
• Problem 3(d): Use JMP to calculate the
prediction interval rather than by hand.
Curvature: Midterm Problem 10
Bivariate Fit of MPG City By Weight(lb)
40
MPG City
35
30
25
20
15
Residual
1500
2500 3000 3500 4000
We ight(lb)
10
6
2
-2
-6
1500 2000
2500 3000 3500
We ight(lb)
4000
Remedy I: Transformations
• Use Tukey’s Bulging Rule to choose a
transformation.
Bivariate Fit of 1/MPGCity By Weight(lb)
1/MPGCity
0.07
0.06
0.05
0.04
0.03
Residual
1500
2500 3000 3500 4000
We ight(lb)
0.010
0.000
-0.010
1500 2000
2500 3000 3500
We ight(lb)
4000
Remedy II: Polynomial Models
y = b0 + b1x1+ b2x2 +…+ bpxp + e
y = b0 + b1x + b2x2 + …+bpxp + e
Quadratic Regression
Bivariate Fit of MPG City By Weight(lb)
40
MPG City
35
30
25
20
15
1500
2500 3000 3500 4000
We ig h t(lb )
Parameter Estimates
Term
Residual
Intercept
Weight(lb)
(Weight(lb)-2809.5)^2
5
2
-1
-4
-7
1500 2000
2500
3000
3500
We ig h t(lb )
Estimate
Std Error
40.166608
-0.006894
0.000003
0.902231
0.00032
4.634e-7
4000
Polynomial Models with One Predictor
Variable
• First order model (p = 1)
y = b0 + b1x + e
• Second order model
(p=2)
y = b0 + b1x + b2x2 + e
e
b2 < 0
b2 > 0
Polynomial Models with One Predictor
Variable
• First order model (p = 1)
y = b0 + b1x + e
• Second order model
y = b0 + b1x +b2x2 + e
(p=2)
e
b2 < 0
b2 > 0
Polynomial Models with One Predictor
Variable
• Third order model (p = 3)
y = b0 + b1x + b2x2 +b3ex3 + e
b3 < 0
b3 > 0
Interaction
• Two independent variables x1 and x2
interact if the effect of x1 on y is influenced
by the value of x2.
• Interaction can be brought into the multiple
linear regression model by including the
independent variable x1* x2.
• Example:
Incoˆme = 1000+ 2000* Educ+ 100* IQ + 10* IQ * Educ
Interaction Cont.
• y = b0 + b1x1 + b2 x2 + b3 x1x2
• “Slope” for x1=E(y|x1+1,x2)-E(y|x1,x2)=
b1 + b3 x2
• Incoˆme = 1000+ 2000* Educ+ 100* IQ + 10* IQ* Educ
• Is the expected income increase from an
extra year of education higher for people
with IQ 100 or with IQ 130 (or is it the
same)?
Polynomial Models with Two Predictor
Variables
• First order model
y = b0 + b1x1 + b2x2 + e
The effect of one predictor variable on y
is independent of the effect of the other
predictor variable on y.
X2 = 3
X2 = 2
X2 = 1
• First order model, two
predictors,and
interaction
y = b0 + b1x1 + b2x2
The two
variables
interact
+b
x
x
3 1 2+ e
to affect the value of y.
X2 = 3
X2 = 2
X2 =1
x1
x1
Polynomial Models with Two Predictor
Variables
Second order model
y = b0 + b1x1 + b2x2
+ b3x12 + b4x22 + e
X2 = 3
y = [b0+b2(3)+b4(32)]+ b1x1 + b3x12 + e
X2 = 2
y = [b0+b2(2)+b4
(22)]+
b1 x 1 + b3 x 1 + e
2
X2 =1
y = [b0+b2(1)+b4(12)]+ b1x1 + b3x12 + e
x1
Second order model with
interaction
y = b0 + b1x1 + b2x2
+b3x12
+bb5x41xx222++ ee
X2 = 3
X2 = 2
X2 =1
Selecting a Model
• Several models have been introduced.
• How do we select the right model?
• Selecting a model:
– Use your knowledge of the problem (variables
involved and the nature of the relationship
between them) to select a model.
– Test the model using statistical techniques.
Selecting a Model; Example
• Example 20.1 The location of a new
restaurant
– A fast food restaurant chain tries to identify
new locations that are likely to be profitable.
– The primary market for such restaurants is
middle-income adults and their children
(between the age 5 and 12).
– Which regression model should be proposed to
predict the profitability of new locations?
Selecting a Model; Example
• Solution
– The dependent variable will be Gross Revenue
– Quadratic relationships between Revenue and each
predictor variable should be observed. Why?
• Families with very young or
older kids will not visit the
restaurant as frequent as
families with mid-range
ages of kids.
• Members of middle-class
families are more likely to
visit a fast food family than
members of poor or wealthy
families.
Revenue
Revenue
Low
Middle High
Income
Low
Middle High
age
Selecting a Model; Example
• Solution
– The quadratic regression model built is
Sales = b0 + b1INCOME + b2AGE
+ b3INCOME2 +b4AGE2 + b5(INCOME)(AGE) +e
Include interaction term when in doubt,
and test its relevance later.
SALES = annual gross sales
INCOME = median annual household income in the
neighborhood
AGE = mean age of children in the neighborhood
Selecting a Model; Example
• Example 20.2
– To verify the validity of the model proposed in example
20.1 for recommending the location of a new fast food
restaurant, 25 areas with fast food restaurants were
randomly selected.
– Each area included one of the firm’s and three
competing restaurants.
– Data collected included (Xm20-02.jmp):
• Previous year’s annual gross sales.
• Mean annual household income.
• Mean age of children
Selecting a Model; Example
Xm20-02
Revenue
1128
1005
1212
.
.
Added data
Income
23.5
17.6
26.3
.
.
Age
10.5
7.2
7.6
.
.
Income sq
552.25
309.76
691.69
.
.
Collected data
Age sq
110.25
51.84
57.76
.
.
(Income)( Age)
246.75
126.72
199.88
.
.
Quadratic Relationships –
Graphical Illustration
Bivariate Fit of Revenue By Age
1300
1300
1200
1200
Revenue
Revenue
Bivariate Fit of Revenue By Income
1100
1000
1100
1000
900
900
800
15
800
2.5
20
25
Incom e
30
35
5.0
7.5 10.0
Age
12.5 1
Model Validation
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.906535
0.881939
44.69533
1085.56
25
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
5
19
24
368140.38
37955.78
406096.16
73628.1
1997.7
36.8569
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Income
Age
Income sq
Age sq
(Income)( Age)
Estimate
Std Error
t Ratio
Prob>|t|
-1133.981
173.20317
23.549963
-3.726129
-3.868707
1.9672682
320.0193
28.20399
32.23447
0.542156
1.179054
0.944082
-3.54
6.14
0.73
-6.87
-3.28
2.08
0.0022
<.0001
0.4739
<.0001
0.0039
0.0509
20.3 Nominal Independent Variables
• In many real-life situations one or more independent
variables are nominal.
• Including nominal variables in a regression analysis
model is done via indicator (or dummy) variables.
• An indicator variable (I) can assume one out of two
values, “zero” or “one”.
I=
1 if the temperature was below 50o
0 if the temperature was 50o or more
1 if data were collected before 1980
0 if data were collected after 1980
1 if a degree earned is in Finance
0 if a degree earned is not in Finance
Nominal Independent Variables;
Example: Auction Car Price (II)
• Example 18.2 - revised (Xm18-02a)
– Recall: A car dealer wants to predict the auction
price of a car.
– The dealer believes now that odometer reading
and the car color are variables that affect a
car’s price.
– Three color categories are considered:
• White
• Silver
• Other colors
Note: Color is a
nominal variable.
Nominal Independent Variables;
Example: Auction Car Price (II)
• Example 18.2 - revised (Xm18-02b)
1 if the color is white
I1 =
0 if the color is not white
1 if the color is silver
I2 = 0 if the color is not silver
The category “Other colors” is defined by:
I1 = 0; I2 = 0
How Many Indicator Variables?
• Note: To represent the situation of three possible
colors we need only two indicator variables.
• Conclusion: To represent a nominal variable with
m possible categories, we must create m-1
indicator variables.
Nominal Independent Variables;
Example: Auction Car Price
• Solution
– the proposed model is
y = b0 + b1(Odometer) + b2I1 + b3I2 + e
– The data
Price
14636
14122
14016
15590
15568
14718
.
.
Odometer
37388
44758
45833
30862
31705
34010
.
.
I-1
1
1
0
0
0
0
.
.
I-2
0
0
0
0
1
1
.
.
White car
Other color
Silver color
Example: Auction Car Price
The Regression Equation
From JMP (Xm18-02b) we get the regression equation
PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)
Price
Price = 16701 - .0555(Odometer) + 90.48(0) + 295.48(1)
Price = 16701 - .0555(Odometer) + 90.48(1) + 295.48(0)
Price = 6350 - .0278(Odometer) + 45.2(0) + 148(0)
Odometer
Example: Auction Car Price
The Regression Equation
From JMP we get the regression equation
PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)
For one additional mile the
auction price decreases by
5.55 cents.
A white car sells, on the average,
for $90.48 more than a car of the
“Other color” category
A silver color car sells, on the average,
for $295.48 more than a car of the
“Other color” category.
Example: Auction Car Price
The Regression Equation
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8355
R Square
0.6980
Adjusted R Square
0.6886
Standard Error
284.5
Observations
100
There is insufficient evidence
Xm18-02b
to infer that a white color car and
a car of “other color” sell for a
different auction price.
There is sufficient evidence
to infer that a silver color car
sells for a larger price than a
car of the “other color” category.
ANOVA
df
Regression
Residual
Total
Intercept
Odometer
I-1
I-2
3
96
99
SS
17966997
7772564
25739561
Coefficients Standard Error
16701 184.3330576
-0.0555
0.0047
90.48
68.17
295.48
76.37
MS
5988999
80964
F
Significance F
73.97
0.0000
t Stat
P-value
90.60
0.0000
-11.72
0.0000
1.33
0.1876
3.87
0.0002
Nominal Independent Variables;
Example: MBA Program Admission
(MBA II)
• Recall: The Dean wanted to evaluate applications for
the MBA program by predicting future performance of
the applicants.
• The following three predictors were suggested:
– Undergraduate GPA
– GMAT score
– Years of work experience
Note: The undergraduate
degree is nominal data.
• It is now believed that the type of undergraduate degree
should be included in the model.
Nominal Independent Variables;
Example: MBA Program Admission
(II)
1 if B.A.
I1 =
0 otherwise
1 if B.B.A
I2 = 0 otherwise
1 if B.Sc. or B.Eng.
I3 = 0 otherwise
The category “Other group” is defined by:
I1 = 0; I2 = 0; I3 = 0
MBA Program Admission (II)
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
6
82
88
54.751842
43.617378
98.369220
9.12531
0.53192
17.1554
Prob > F
<.0001
Parameter Estimates
Term
Intercept
UnderGPA
GMAT
Work
Degree[1]
Degree[2]
Degree[3]
Estimate
Std Error
t Ratio
Prob>|t|
0.2886998
-0.006059
0.0127928
0.0981817
-0.443872
0.6068391
-0.064081
1.396475
0.113968
0.001356
0.030323
0.146288
0.160425
0.138484
0.21
-0.05
9.43
3.24
-3.03
3.78
-0.46
0.8367
0.9577
<.0001
0.0017
0.0032
0.0003
0.6448
20.4 Applications in Human
Resources Management: Pay-Equity
• Pay-equity can be handled in two different
forms:
– Equal pay for equal work
– Equal pay for work of equal value.
• Regression analysis is extensively
employed in cases of equal pay for equal
work.
Human Resources Management:
Pay-Equity
• Example 20.3 (Xm20-03)
– Is there sex discrimination against female
managers in a large firm?
– A random sample of 100 managers was selected
and data were collected as follows:
•
•
•
•
Annual salary
Years of education
Years of experience
Gender
Human Resources Management:
Pay-Equity
• Solution
– Construct the following multiple regression model:
y = b0 + b1Education + b2Experience + b3Gender + e
– Note the nature of the variables:
• Education – Interval
• Experience – Interval
• Gender – Nominal (Gender = 1 if male; =0 otherwise).
Human Resources Management:
Pay-Equity
• Solution – Continued (Xm20-03)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8326
R Square
0.6932
Adjusted R Square
0.6836
Standard Error
16274
Observations
100
Analysis and Interpretation
• The model fits the data quite well.
• The model is very useful.
• Experience is a variable strongly
related to salary.
• There is no evidence of sex discrimination.
ANOVA
df
Regression
Residual
Total
Intercept
Education
Experience
Gender
SS
MS
3 57434095083 19144698361
96 25424794888 264841613.4
99 82858889971
CoefficientsStandard Error
-5835.1
16082.8
2118.9
1018.5
4099.3
317.2
1851.0
3703.1
t Stat
-0.36
2.08
12.92
0.50
F
Significance F
72.29
0.0000
P-value
0.7175
0.0401
0.0000
0.6183
Human Resources Management:
Pay-Equity
• Solution – Continued (Xm20-03)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8326
R Square
0.6932
Adjusted R Square
0.6836
Standard Error
16274
Observations
100
ANOVA
df
Regression
Residual
Total
Intercept
Education
Experience
Gender
3
96
99
Analysis and Interpretation
• Further studying the data we find:
Average experience (years) for women is 12.
Average experience (years) for men is 17
• Average salary for female manager is $76,189
Average salary for male manager is $97,832
SS
57434095083
25424794888
82858889971
Coefficients Standard Error
-5835.1
16082.8
2118.9
1018.5
4099.3
317.2
1851.0
3703.1
MS
19144698361
264841613
t Stat
-0.36
2.08
12.92
0.50
F Significance F
72.29
0.0000
P-value
0.7175
0.0401
0.0000
0.6183
20.5 Stepwise Regression
• Purposes of stepwise regression:
– Find strong predictors (stepwise forward)
– Eliminate weak predictors (stepwise backward)
– Prevent highly collinear groups of predictors from
collectively entering the model (they degrade pvals)
• The workings of stepwise regression:
– Predictors are entered/removed one at a time
– Stepwise forward: given a current model, enter the
predictor that increases R2 the most,…. if pval<0.25
– Stepwise backward: …, remove the predictor that
decreases R^2 the least,…. if pval>0.10
Stepwise Regression in JMP
• “Analyze” ! “Fit Model”
• response ! Y, predictors ! “add”
• pull-down menu top right: “Standard Least
Squares” ! “Stepwise”; “Run Model”
• Stepwise Fit window: updates automagically
– manual stepwise: check boxes in “Entered” column,
to enter and remove predictors
– stepwise forward/backward: “Step” to enter/remove
one predictor, “Go” for automatic sequential selection
– “Direction” pull-down for “forward” (default),
“backward”, “mixed” selection strategies
Comments on Stepwise Regression
• Stepwise regression might not find the best model;
you might find better models with manual search,
where better means: fewer predictors & larger R2.
• Forward search stops when there is no predictor
with pval<0.25 (can be changed in JMP).
• Backward search stops when there is no predictor
with pval>0.10 (can be changed in JMP).
• Often one wants to search only models with certain
predictors included. Use “Lock” column in JMP.
Practice Problems
• 20.6,20.8,20.22,20.24