Lecture 12 - University of Pennsylvania

Download Report

Transcript Lecture 12 - University of Pennsylvania

Stat 112: Lecture 17 Notes
• Chapter 6.8: Assessing the Assumption
that the Disturbances are Independent
• Chapter 7.1: Using and Interpreting
Indicator Variables.
Time Series Data and
Autocorrelation
• When Y is a variable collected for the same entity
(person, state, country) over time, we call the data time
series data.
• For time series data, we need to consider the
independence assumption for the simple and multiple
regression model.
• Independence Assumption: The residuals are
independent of one another. This means that if the
residual is positive this year, it needs to be equally likely
for the residuals to be positive or negative next year, i.e.,
there is no autocorrelation.
• Positive autocorrelation: Positive residuals are more
likely to be followed by positive residuals than by
negative residuals.
• Negative autocorrelation: Positive residuals are more
likely to be followed by negative residuals than by
positive residuals.
Ski Ticket Sales
• Christmas Week is a critical period for most ski resorts.
• A ski resort in Vermont wanted to
determine the effect that weather
had on its sale of lift tickets during
Christmas week.
• Data from past 20 years.
Yi= lift tickets during Christmas
week in year i
Xi1=snowfall during Christmas week in year i
Xi2= average temperature during Christmas week in year
i.
Data in skitickets.JMP
Response Tickets
Parameter Estimates
Term
Intercept
Snowfall
Temperature
Estimate
8308.0114
74.593249
-8.753738
Std Error
903.7285
51.57483
19.70436
t Ratio
9.19
1.45
-0.44
Prob>|t|
<.0001
0.1663
0.6625
Bivariate Fit of Residual Tickets By Year
3000
Residual Tickets
2000
Residuals suggest
positive autocorrelation
1000
0
-1000
-2000
-3000
0
5
10
Year
15
20
Durbin-Watson Test of
Independence
• The Durbin-Watson test is a test of whether the residuals are
independent.
• The null hypothesis is that the residuals are independent and the
alternative hypothesis is that the residuals are not independent
(either positively or negatively) autocorrelated.
• The test works by computing the correlation of consecutive
residuals.
• To compute Durbin-Watson test in JMP, after Fit Model, click the red
triangle next to Response, click Row Diagnostics and click DurbinWatson Test. Then click red triangle next to Durbin-Watson to get pvalue.
• For ski ticket data,
Durbin-Watson
Durbin-Watson
0.5931403
Number of Obs.
20
AutoCorrelation
0.5914
p-value = 0.0002. Strong evidence of autocorrelation
Prob<DW
0.0002
Remedies for Autocorrelation
• Add time variable to the regression.
• Add lagged dependent (Y) variable to the
regression. We can do this by creating a new
column and right clicking, then clicking Formula,
clicking Row and clicking Lag and then clicking
the Y variables.
• After adding these variables, refit the model and
then recheck the Durbin-Watson statistic to see
if autocorrelation has been removed.
Response Tickets
Parameter Estimates
Term
Intercept
Snowfall
Temperature
Year
Estimate
5965.5876
70.183059
-9.232802
229.96997
Std Error
631.2518
28.85142
11.01971
37.13209
t Ratio
9.45
2.43
-0.84
6.19
Prob>|t|
<.0001
0.0271
0.4145
<.0001
Durbin-Watson
Durbin-Watson
1.8849875
Number of Obs.
20
AutoCorrelation
0.0405
Prob<DW
0.3512
Bivariate Fit of Residual Tickets 3 By Year
2000
Residual Tickets 3
1500
1000
500
0
-500
-1000
-1500
-2000
0
5
10
15
20
Year
No evidence of autocorrelation once Year has been added as an
explanatory variable
Example 6.10 in book
Response SALES
Parameter Estimates
Term
Intercept
ADV
Estimate
-632.6945
0.1772326
Std Error
47.27697
0.007045
t Ratio
-13.38
25.16
Prob>|t|
<.0001
<.0001
Durbin-Watson
Durbin-Watson
0.4672937
Number of Obs.
36
AutoCorrelation
0.7091
Prob<DW
<.0001
Strong Evidence of Autocorrelation
Bivariate Fit of Residual SALES By Year
Residual SALES
50
0
-50
-100
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Response SALES
Parameter Estimates
Term
Intercept
ADV
Lagged Sales
Estimate
-234.4752
0.0630703
0.6751139
Std Error
78.06875
0.020228
0.112302
t Ratio
-3.00
3.12
6.01
Prob>|t|
0.0051
0.0038
<.0001
Durbin-Watson
Durbin-Watson
2.3330219
Number of Obs.
35
AutoCorrelation
-0.2063
Adding Lagged Sales removes the autocorrelation.
Bivariate Fit of Residual SALES 2 By Year
40
Residual SALES 2
20
0
-20
-40
-60
-80
-100
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Categorical variables
• Categorical (nominal) variables: Variables that
define group membership, e.g., sex
(male/female), color (blue/green/red), county
(Bucks County, Chester County, Delaware
County, Philadelphia County).
• How to use categorical variables as explanatory
variables in regression analysis:
– If the variable has two categories (e.g., sex
(male/female), rain or not rain, snow or not snow), we
have defined a variable that equals 1 for one of the
categories and 0 for the other category.
Predicting Emergency Calls to the
AAA Club
Rain forecast=1 if rain is in
forecast, 0 if not
Snow forecast=1 if snow is in
forecast, 0 if not
Weekday=1 if weekday, 0 if
not
Response Calls
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
Parameter Estimates
Term
Intercept
Average
Temperature
Range
Rain forecast
Snow forecast
Weekday
Sunday
Subzero
0.692384
0.584719
1735.151
4318.75
28
Estimate Std Error
3628.7902 2153.788
-35.63182 51.52383
133.30434
429.70588
548.80038
-1603.1
-1847.152
3857.6004
50.85675
1211.933
1342.27
876.7378
1212.612
1489.803
t Ratio Prob>|t|
1.68 0.1076
-0.69 0.4972
2.62
0.35
0.41
-1.83
-1.52
2.59
0.0164
0.7266
0.6870
0.0824
0.1433
0.0175
Comparing Toy Factory Managers
• An analysis has shown that the time
required to complete a production run in a
toy factory increases with the number of
toys produced. Data were collected for
the time required to process 20 randomly
selected production runs as supervised by
three managers (A, B and C). Data in
toyfactorymanager.JMP.
• How do the managers compare?
Marginal Comparison
Oneway Analysi s of Time for Run B y Manager
Time for Run
300
250
200
150
a
b
c
Manager
• Marginal comparison could be misleading. We know
that large production runs with more toys take longer
than small runs with few toys. How can we be sure that
Manager c has not simply been supervising very small
production runs?
• Solution: Run a multiple regression in which we include
size of the production run as an explanatory variable
along with manager, in order to control for size of the
production run.
•
Including Categorical Variable in
Multiple Regression: Wrong
Approach
We could assign codes to the managers, e.g., Manager
A = 0, Manager B=1, Manager C=2.
Parameter Estimates
Term
Intercept
Run Size
Managernumber
Estimate
211.92804
0.2233844
-31.03612
Std Error
7.212609
0.029184
3.056054
t Ratio
29.38
7.65
-10.16
Prob>|t|
<.0001
<.0001
<.0001
• This model says that for the same run size, Manager B is
31 minutes faster than Manager A and Manager C is 31
minutes faster than Manager B.
• This model restricts the difference between Manager A
and B to be the same as the difference between
Manager B and C – we have no reason to do this.
• If we use a different coding for Manager, we get different
results, e.g., Manager B=0, Manager A=1, Manager C=2
Parameter Estimates
Term
Intercept
Run Size
Managernumber2
Estimate
188.63636
0.2103122
-5.008207
Std Error
12.73082
0.048921
5.122956
t Ratio
14.82
4.30
-0.98
Prob>|t|
<.0001
<.0001
0.3324
Manager A 5 min.
faster than
Manager B
Including Categorical Variable in
Multiple Regression: Right
Approach
• Create an indicator (dummy) variable for
each category.
• Manager[a] = 1 if Manager is A
0 if Manager is not A
• Manager[b] = 1 if Manager is B
0 if Manager is not B
• Manager[c] = 1 if Manager is C
0 if Manager is not C
Response Time for Run
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
• For a run size of length 100, the estimated time for run of
Managers A, B and C ar
Eˆ (Tim e| Runsize  100, Manager a)  176.71 0.24*100 38.41*1  14.65* 0  23.76* 0
Eˆ (Tim e| Runsize  100, Manager b)  176.71 0.24*100 38.41* 0  14.65*1  23.76* 0
Eˆ (Tim e| Runsize  100, Manager c)  176.71 0.24*100 38.41* 0  14.65* 0  23.76*1
• For the same run size, Manager A is estimated to be on
average 38.41-(-14.65)=53.06 minutes slower than
Manager B and
38.41-(-23.76)=62.17 minutes slower than Manager C.
Categorical Variables in Multiple
Regression in JMP
• Make sure that the categorical variable is coded
as nominal. To change coding, right clock on
column of variable, click Column Info and
change Modeling Type to nominal.
• Use Fit Model and include the categorical
variable into the multiple regression.
• After Fit Model, click red triangle next to
Response and click Estimates, then Expanded
Estimates (the initial output in JMP uses a
different, more confusing coding of the dummy
variables).
Equivalence of Using One 0/1 Dummy
Variable and Two 0/1 Dummy Variables
when Categorical Variable has two
categories
•
Parameter Estimates
Term
Intercept
Average
Temperature
Range
Rain forecast
Snow forecast
Weekday
Sunday
Subzero
Estimate Std Error
3628.7902 2153.788
-35.63182 51.52383
133.30434
429.70588
548.80038
-1603.1
-1847.152
3857.6004
50.85675
1211.933
1342.27
876.7378
1212.612
1489.803
t Ratio Prob>|t|
1.68 0.1076
-0.69 0.4972
2.62
0.35
0.41
-1.83
-1.52
2.59
0.0164
0.7266
0.6870
0.0824
0.1433
0.0175
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
Average Temperature
Range
Rain forecast[0]
Rain forecast[1]
Snow forecast[0]
Snow forecast[1]
Weekday[0]
Weekday[1]
Sunday[0]
Sunday[1]
Subzero[0]
Subzero[1]
Estimate
4321.7173
-35.63182
133.30434
-214.8529
214.85294
-274.4002
274.40019
801.55002
-801.55
923.57625
-923.5762
-1928.8
1928.8002
Two models give equivalent predictions. The difference in mean number of
Emergency calls between a day with a rain forecast and a day without a rain forecast
holding all other variables fixed is 429.71=214.85-(-214.85).
Effect Tests
Effect Tests
Source
Run Size
Manager
Nparm
1
2
DF
1
2
Sum of Squares
25260.250
44773.996
F Ratio
94.1906
83.4768
Prob > F
<.0001
<.0001
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
Effect test for manager: H0 : Manager[a]  Manager[b]  Manager[c]
vs. Ha: not all manager[a],manager[b],manager[c] equal. Null hypothesis is that all
managers are the same (in terms of mean run time) when run size is held fixed,
alternative hypothesis is that not all managers are the same (in terms of mean run
time) when run size is held fixed.
• p-value for Effect Test <.0001. Strong evidence that not all managers are the same
when run size is held fixed.
• Note: H0 : Manager[a]  Manager[b]  Manager[c]
equivalent to
•
Ha : manager[a]  manager[b]  manager[c]  0
•
because JMP has constraint that manager[a]+manager[b]+manager[c]=0.
Effect test for Run size tests null hypothesis that Run Size coefficient is 0 versus
alternative hypothesis that Run size coefficient isn’t zero. Same p-value as t-test.
Effect Tests
Source
Run Size
Manager
Nparm
1
2
DF
1
2
Sum of Squares
25260.250
44773.996
F Ratio
94.1906
83.4768
Prob > F
<.0001
<.0001
Expanded Estimates
Nominal factors expanded to all levels
Term
Estimate
Intercept
176.70882
Run Size
0.243369
Manager[a]
38.409663
Manager[b]
-14.65115
Manager[c]
-23.75851
Std Error
5.658644
0.025076
3.005923
3.031379
2.995898
t Ratio
31.23
9.71
12.78
-4.83
-7.93
Prob>|t|
<.0001
<.0001
<.0001
<.0001
<.0001
• Effect tests shows that managers are not equal.
• For the same run size, Manager C is best (lowest mean
run time), followed by Manager B and then Manager C.
• The above model assumes no interaction between
Manager and run size – the difference between the
mean run time of the managers is the same for all run
sizes.
Election Equation
Goal: Predict the Incumbent Party’s share of the vote
Data in Elections.JMP
Y = Incumbent party’s share of Vote
X_1 = Nominal Variable for party in power
X_2 = Economic Growth
X_3 = Inflation
X_4 = Consecutive Quarters of Good News
X_5 = Duration Value
X_6 = President Running
X_7 = War