Class 20: Thurs., Nov. 18 - University of Pennsylvania

Download Report

Transcript Class 20: Thurs., Nov. 18 - University of Pennsylvania

Class 20: Thurs., Nov. 18
• Specially Constructed Explanatory Variables
– Dummy variables for categorical variables
– Interactions involving dummy variables
• I will e-mail you HW8 tomorrow. It will be due
Tuesday, Nov. 30th.
• Schedule:
–
–
–
–
Tuesday, Nov. 23rd: One-way ANOVA
Tuesday, Nov. 30th: Review
Thursday, Dec. 2nd: Midterm II
Tuesday, Dec. 7th, Thursday, Dec. 9th: Two-way
ANOVA
Categorical variables
• Categorical (nominal) variables: Variables that
define group membership, e.g., sex
(male/female), color (blue/green/red), county
(Bucks County, Chester County, Delaware
County, Philadelphia County).
• How to use categorical variables as explanatory
variables in regression analysis:
– If the variable has two categories (e.g., sex
(male/female), rain or not rain, snow or not snow), we
have defined a variable that equals 1 for one of the
categories and 0 for the other category.
Predicting Emergency Calls to the
AAA Club
Rain forecast=1 if rain is in
forecast, 0 if not
Snow forecast=1 if snow is in
forecast, 0 if not
Weekday=1 if weekday, 0 if
not
Response Calls
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
Parameter Estimates
Term
Intercept
Average
Temperature
Range
Rain forecast
Snow forecast
Weekday
Sunday
Subzero
0.692384
0.584719
1735.151
4318.75
28
Estimate Std Error
3628.7902 2153.788
-35.63182 51.52383
133.30434
429.70588
548.80038
-1603.1
-1847.152
3857.6004
50.85675
1211.933
1342.27
876.7378
1212.612
1489.803
t Ratio Prob>|t|
1.68 0.1076
-0.69 0.4972
2.62
0.35
0.41
-1.83
-1.52
2.59
0.0164
0.7266
0.6870
0.0824
0.1433
0.0175
Comparing Toy Factory Managers
• An analysis has shown that the time
required to complete a production run in a
toy factory increases with the number of
toys produced. Data were collected for
the time required to process 20 randomly
selected production runs as supervised by
three managers (A, B and C). Data in
toyfactorymanager.JMP.
• How do the managers compare?
Marginal Comparison
Oneway Analysi s of Time for Run B y Manager
Time for Run
300
250
200
150
a
b
c
Manager
• Marginal comparison could be misleading. We
know that large production runs with more toys
take longer than small runs with few toys.
Oneway Analysi s of R un Siz e By Manager
350
300
Run Size
250
200
150
100
50
a
b
c
Manager
• How can we be sure that Manager c’s advantage is not
due to simply having supervised smaller production
runs?
• Solution: Run a multiple regression in which we include
size of the production run as an explanatory variable,
along with manager, in order to control for size of the
production run.
•
Including Categorical Variable in
Multiple Regression: Wrong
Approach
We could assign codes to the managers, e.g., Manager
A = 0, Manager B=1, Manager C=2.
Parameter Estimates
Term
Intercept
Run Size
Managernumber
Estimate
211.92804
0.2233844
-31.03612
Std Error
7.212609
0.029184
3.056054
t Ratio
29.38
7.65
-10.16
Prob>|t|
<.0001
<.0001
<.0001
• This model says that for the same run size, Manager B is
31 minutes faster than Manager A and Manager C is 31
minutes faster than Manager B.
• This model restricts the difference between Manager A
and B to be the same as the difference between
Manager B and C – we have no reason to do this.
• If we use a different coding for Manager, we get different
results, e.g., Manager B=0, Manager A=1, Manager C=2
Parameter Estimates
Term
Intercept
Run Size
Managernumber2
Estimate
188.63636
0.2103122
-5.008207
Std Error
12.73082
0.048921
5.122956
t Ratio
14.82
4.30
-0.98
Prob>|t|
<.0001
<.0001
0.3324
Manager A 5 min.
faster than
Manager B
Including Categorical Variable in
Multiple Regression: Right
Approach
• Create an indicator (dummy) variable for
each category.
• Manager[a] = 1 if Manager is A
0 if Manager is not A
• Manager[b] = 1 if Manager is B
0 if Manager is not B
• Manager[c] = 1 if Manager is C
0 if Manager is not C
Response Time for Run
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
• For a run size of length 100, the estimated time for run of Managers
A, B and C are
Eˆ (Tim e| Runsize  100, Manager a)  176.71 0.24*100 38.41*1  14.65* 0  23.76* 0
Eˆ (Tim e| Runsize  100, Manager b)  176.71 0.24*100 38.41* 0  14.65*1  23.76* 0
Eˆ (Tim e| Runsize  100, Manager c)  176.71 0.24*100 38.41* 0  14.65* 0  23.76*1
• For the same run size, Manager A is estimated to be on average
38.41-(-14.65)=53.06 minutes slower than Manager B and
38.41-(-23.76)=62.17 minutes slower than Manager C.
Categorical Variables in Multiple
Regression in JMP
• Make sure that the categorical variable is coded
as nominal. To change coding, right clock on
column of variable, click Column Info and
change Modeling Type to nominal.
• Use Fit Model and include the categorical
variable into the multiple regression.
• After Fit Model, click red triangle next to
Response and click Estimates, then Expanded
Estimates (the initial output in JMP uses a
different, more confusing, coding of the dummy
variables).
Response Time for Run
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
• The coefficients on Manager A, Manager B and Manager
C add up to zero. So the positive coefficient on Manager
A means that Manager A is slower than the average (of
Manager A, B and C) and the negative coefficients on
Manager B and Manager C mean that these two
managers are faster than the average (of Manager A, B
and C).
• The coefficients on the indicator variables will always
add up to zero in JMP.
• Caution: Different software uses different coding for
indicator variables. It doesn’t change the predictions
from the multiple regression but does change the
interpretation.
Equivalence of Using One 0/1 Dummy
Variable and Two 0/1 Dummy Variables
when Categorical Variable has two
categories
•
Parameter Estimates
Term
Intercept
Average
Temperature
Range
Rain forecast
Snow forecast
Weekday
Sunday
Subzero
Estimate Std Error
3628.7902 2153.788
-35.63182 51.52383
133.30434
429.70588
548.80038
-1603.1
-1847.152
3857.6004
50.85675
1211.933
1342.27
876.7378
1212.612
1489.803
t Ratio Prob>|t|
1.68 0.1076
-0.69 0.4972
2.62
0.35
0.41
-1.83
-1.52
2.59
0.0164
0.7266
0.6870
0.0824
0.1433
0.0175
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
Average Temperature
Range
Rain forecast[0]
Rain forecast[1]
Snow forecast[0]
Snow forecast[1]
Weekday[0]
Weekday[1]
Sunday[0]
Sunday[1]
Subzero[0]
Subzero[1]
Estimate
4321.7173
-35.63182
133.30434
-214.8529
214.85294
-274.4002
274.40019
801.55002
-801.55
923.57625
-923.5762
-1928.8
1928.8002
Two models give equivalent predictions. The difference in mean number of
Emergency calls between a day with a rain forecast and a day without a rain forecast
holding all other variables fixed is 429.71=214.85-(-214.85).
Effect Tests
Effect Tests
Source
Run Size
Manager
Nparm
1
2
DF
1
2
Sum of Squares
25260.250
44773.996
F Ratio
94.1906
83.4768
Prob > F
<.0001
<.0001
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
Effect test for manager: H0 : Manager[a]  Manager[b]  Manager[c]
vs. Ha: not all manager[a],manager[b],manager[c] equal. Null hypothesis is that all
managers are the same (in terms of mean run time) when run size is held fixed,
alternative hypothesis is that not all managers are the same (in terms of mean run
time) when run size is held fixed.
• p-value for Effect Test <.0001. Strong evidence that not all managers are the same
when run size is held fixed.
• Note: H0 : Manager[a]  Manager[b]  Manager[c]
equivalent to
•
Ha : manager[a]  manager[b]  manager[c]  0
•
because JMP has constraint that manager[a]+manager[b]+manager[c]=0.
Effect test for Run size tests null hypothesis that Run Size coefficient is 0 versus
alternative hypothesis that Run size coefficient isn’t zero. Same p-value as t-test.
Effect Tests
Source
Run Size
Manager
Nparm
1
2
DF
1
2
Sum of Squares
25260.250
44773.996
F Ratio
94.1906
83.4768
Prob > F
<.0001
<.0001
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
• Effect tests shows that managers are not equal.
• For the same run size, Manager C is best (lowest mean
run time), followed by Manager B and then Manager C.
• The above model assumes no interaction between
Manager and run size – the difference between the
mean run time of the managers is the same for all run
sizes.
Interaction Model
Response Time for Run
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
Run Size
Manager[a]
Manager[b]
Manager[c]
Manager[a]*(Run Size-209.317)
Manager[b]*(Run Size-209.317)
Manager[c]*(Run Size-209.317)
Estimate
179.59191
0.2344284
38.188168
-13.5381
-24.65007
0.0728366
-0.097651
0.0248147
Std Error
5.619643
0.024708
2.900342
2.936288
2.887839
0.035263
0.037178
0.032207
t Ratio
31.96
9.49
13.17
-4.61
-8.54
2.07
-2.63
0.77
Eˆ (tim e_ for _ run | runsize  x, Manager A) 
179.59  0.234* x  38.188*1  13.538* 0  24.651* 0 
0.073*1* ( x  209.317)  0.098* 0 * ( x  209.317)  0.025* 0 * ( x  209.317) 
179.59  0.234* x  38.188 0.073* ( x  209.317)
Eˆ (tim e_ for _ run | runsize  x, Manager A)  (179.59  38.188 0.073* 209.317)  (0.234 0.073) * x
Eˆ (tim e_ for _ run | runsize  x, Manager B)  (179.59  13.538 0.098* 209.317)  (0.234 0.098) * x
Eˆ (tim e_ for _ run | runsize  x, Manager C )  (179.59  24.651 0.025* 209.317)  (0.234 0.025* x
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
0.0437
0.0112
0.4444
Interaction Model in JMP
• To add interactions involving categorical
variables in JMP, follow the same procedure as
with two continuous variables. Run Fit Model in
JMP, add the usual explanatory variables first,
then highlight one of the variables in the
interaction in the Construct Model Effects box
and highlight the other variable in the interaction
in the Columns box and then click Cross in the
Construct Model Effects box.
Interaction Model
• Interaction between run size and Manager: The effect on
mean run time of increasing run size by one is different
for different managers.
Eˆ (tim e_ for _ run | runsize  x  1, Manager A)  Eˆ (runsize  x, Manager A)  0.234 0.073  0.307
Eˆ (tim e_ for _ run | runsize  x  1, Manager B)  Eˆ (runsize  x, Manager B)  0.234 0.098  0.136
Eˆ (tim e_ for _ run | runsize  x  1, Manager C )  Eˆ (runsize  x, Manager C )  0.234 0.025  0.259
• Effect Test for Interaction:
Effect Tests
Source
Run Size
Manager
Manager*Run Size
Nparm
1
2
2
DF
1
2
2
Sum of Squares
22070.614
43981.452
1778.661
F Ratio
90.0192
89.6934
3.6273
Prob > F
<.0001
<.0001
0.0333
• Manager*Run Size Effect test tests null hypothesis that
there is no interaction (effect on mean run time of
increasing run size is same for all managers) vs.
alternative hypothesis that there is an interaction
between run size and managers. p-value =0.0333.
Evidence that there is an interaction.
Eˆ (tim e_ for _ run | runsize  x, Manager A)  202.498 0.307* x
Eˆ (tim e_ for _ run | runsize  x, Manager B)  186.565 0.136* x
Eˆ (tim e_ for _ run | runsize  x, Manager C )  149.706 0.259* x
• The runs supervised by Manager A appear
abnormally time consuming. Manager b
has higher initial fixed setup costs than
Manager c (186.565>149.706) but has
lower per unit production time
(0.136<0.259).
Interaction Profile Plot
300
Time
for Run
345
Run Size
200
Run Size
250
58
150
a
300
Time
for Run
c
b
Manager
200
Manager
250
150
100
150
200
250
300
350
400
a
b c
Lower left hand plot shows mean time for run vs. run size for the three managers
a, b and c.
Interactions Involving Categorical
Variables: General Approach
• First fit model with an interaction between
categorical explanatory variable and continuous
explanatory variable. Use effect test on
interaction to see if there is evidence of an
interaction.
• If there is evidence of an interaction (p-value
<0.05 for effect test), use interaction model.
• If there is not strong evidence of an interaction
(p-value >0.05 for effect test), use model without
interactions.
Example: A Sex Discrimination
Lawsuit
• Did a bank discriminatorily pay higher starting
salaries to men than to women. Harris Trust and
Savings Bank was sued by a group of female
employees who accused the bank of paying
lower starting salries to women. The data in
harrisbank.JMP are the starting salaries for all
32 male and all 61 female skilled, entry-level
clerical employees hired by the bank between
1969 and 1977, as well as the education levels
and sex of the employees.
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
EDUC
SEX[FEMALE]
SEX[MALE]
SEX[FEMALE]*(EDUC-12.5054)
SEX[MALE]*(EDUC-12.5054)
Estimate
4257.893
98.923456
-322.6792
322.67916
-36.7929
36.792897
Std Error
422.8744
31.79614
68.97647
68.97647
31.79614
31.79614
t Ratio
10.07
3.11
-4.68
4.68
-1.16
1.16
Effect Tests
Source
EDUC
SEX
SEX*EDUC
Nparm
1
1
1
DF
1
1
1
Sum of Squares
3159890.9
7144342.4
437120.8
F Ratio
9.6794
21.8847
1.3390
Prob > F
0.0025
<.0001
0.2503
• No evidence of an interaction between
Sex and Education. Fit model without
interactions.
P
Discrimination Case Regression
Results
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
4519.0292
EDUC
80.697765
SEX[FEMALE]
-345.9041
SEX[MALE]
345.90413
Std Error
358.2969
27.67291
66.11594
66.11594
t Ratio
12.61
2.92
-5.23
5.23
Prob>|t|
<.0001
0.0045
<.0001
<.0001
Lower 95%
3807.2099
25.720708
-477.255
214.55328
Effect Tests
Source
EDUC
SEX
Nparm
1
1
DF
1
1
Sum of Squares
2786560.8
8969209.7
F Ratio
8.5038
27.3715
Prob > F
0.0045
<.0001
• Strong evidence that there is a difference in the mean starting
salaries of women and men of the same education level.
• Estimated difference: Men have 345.904+345.904=$691.81 higher
mean starting salaries than women of the same education level.
• 95% confidence interval for mean difference =
(2*$214.55,2*$477.25)=($429.10,$854.50).
• Bank’s defense: Omitted variable bias. Variables such as Seniority,
Age, Experience also need to be controlled for.
Upper 95
5230.84
135.674
-214.55
477.254