Lecture 12 - University of Pennsylvania

Download Report

Transcript Lecture 12 - University of Pennsylvania

Stat 112: Lecture 15 Notes
• Finish Chapter 6:
– Review on Checking Assumptions (Section
6.4-6.6)
– Outliers and Influential Points (Section 6.7)
• Homework 4 is due this Thursday.
• Please let me know of any ideas you want
to discuss for the final project.
Review of Checking and Remedying
Assumptions
1.
2.
3.
E(Y | X1, , X K )  0  1 X1   K X K
Linearity:
Check residual by predicted plots and residual plots for each
variable for pattern in the mean of the residuals.
Remedies: Transformations and Polynomials. To see if remedy
works, check new residual plots for pattern in the mean of the
residuals..
The standard deviation of Y for the subpopulation of units with
X1  x1, , X K  xK
is the same for all subpopulations.
Check residual by predicted plot for pattern in the spread of the
residuals.
Remedies: Transformation of Y. To see if remedy works, check
residual by predicted plot for the transformed Y regression.
Normality: The distribution of Y for the subpopulation of units with
X1  x1, , X K  xK is normally distributed for all subpopulations.
Check histogram for bell shaped distribution of residuals and
normal quantile plot of residuals for approximately straight line.
Remedies: Transformation of Y. To see if remedy works, check
histogram and normal quantile plot of residuals for transformed Y
regression residuals
Checking whether a transformation
of Y works for remedying Nonconstant variance
1. Create a new column with the transformation
of the Y variable by right clicking in the new
column and clicking Formula and putting in the
appropriate formula for the transformation
(Note: Log is contained in the class of
transcendental functions)
2. Fit the regression of the transformation of Y on
the X variables
3. Check the residual by predicted plot to see if
the spread of the residuals appears constant
over the range of predicted values.
Example 6.3 from Book:
Telecore is a high-tech company located
in Fort Worth, Texas. The company produces a part called a
fibre-optic connector (FOC) and wants to generate a prediction of the
sales of FOCs based on the week.
Response SALES
Parameter Estimates
Term
Intercept
Week
Estimate
4703.7694
72.458877
Std Error
512.4686
3.340064
t Ratio
9.18
21.69
Prob>|t|
<.0001
<.0001
Residual by Predicted Plot
20000
SALES Residual
15000
10000
5000
0
-5000
-10000
-15000
0
10000
20000
30000
40000
SALES Predicted
Nonconstant variance with spread of residuals increasing as the
predicted values increase.
We try transforming Sales to Log (Sales)
Response Log Sales
Parameter Estimates
Estimate
8.7390327
0.0053746
Term
Intercept
Week
Std Error
0.034298
0.000224
t Ratio
254.80
24.04
Prob>|t|
0.0000
<.0001
Effect Tests
DF
1
Nparm
1
Source
Week
Sum of Squares
44.796634
Log Sales Residual
Residual by Predicted Plot
0
-1
8.0
8.5
9.0
9.5
10.0
10.5
Log Sales Predicted
Log Sales has approximately constant variance.
F Ratio
578.0861
Prob > F
<.0001
Outliers in Residuals
• Standardized Residual:
• Under normality assumption, 95% of
standardized residuals should be between -2
and 2, and 99% should be between -3 and 3.
• An observation with a standardized residual
above 3 or below -3 is considered to be an
outlier in its residual, i.e., its Y value is unusual
given its explanatory variables. It is worth
looking further at the observation to see if any
reasons for the large magnitude residual can be
identified.
philacrimerate.JMP outliers in
residuals
Bivariate Fit of HousePrice By CrimeRate
500000
Gladwyne
Haverford
HousePrice
400000
300000
200000
Phila,CC
100000
Phila, N
0
0
50
100
150
200
250
300
350
400
CrimeRate
Gladwyne, Villanova Haverford are outliers in residuals (their
house price is considerably higher than one would expect given
their crime rate).
Influential Points and Leverage
Points
• Influential observation: Point that if it is
removed would markedly change the
statistical analysis. For simple linear
regression, points that are outliers in the X
direction are often influential.
• Leverage point: Point that is an outlier in
the X direction that has the potential to be
influential. It will be influential if its
residual is of moderately large magnitude.
Which Observations Are Influential?
Bivariate Fit of HousePrice By CrimeRate
500000
All observations
Gladwyne
Haverford
Linear Fit
400000
HousePrice
HousePrice = 176629.41 - 576.90813 CrimeRate
300000
Without Center City
Philadelphia
200000
Phila,CC Linear Fit
HousePrice = 225233.55 - 2288.6894 CrimeRate
100000
Phila, N
0
0
50
100
150
200
250
300
350
Without Gladwyne
400
CrimeRate
Linear Fit
HousePrice = 173116.43 - 567.74508 CrimeRate
Linear Fit
Linear Fit
Linear Fit
Center City Philadelphia is influential; Gladwyne is not. In general,
points that have high leverage are more likely to be influential.
Excluding Observations from
Analysis in JMP
• To exclude an observation from the regression
analysis in JMP, go to the row of the
observation, click Rows and then click
Exclude/Unexclude. A red circle with a diagonal
line through it should appear next to the
observation.
• To put the observation back into the analysis, go
to the row of the observation, click Rows and
then click Exclude/Unexclude. The red circle
should no longer appear next to the observation.
Formal measures of leverage
and influence
• Leverage: “Hat values” (JMP calls them hats)
• Influence: Cook’s Distance (JMP calls them
Cook’s D Influence).
• To obtain them in JMP, click Analyze, Fit
Model, put Y variable in Y and X variable in
Model Effects box. Click Run Model box.
After model is fit, click red triangle next to
Response. Click Save Columns and then
Click Hats for Leverages and Click Cook’s D
Influences for Cook’s Distances.
• To sort observations in terms of Cook’s
Distance or Leverage, click Tables, Sort and
then put variable you want to sort by in By
box.
Distributions
Cook's D Influence HousePrice
h HousePrice
Gladwyne
Haverford
Phila,CC
Phila,CC
0
5
10
15
20
25
30
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
Center City Philadelphia has both influence (Cook’s Distance much
Greater than 1 and high leverage (hat value > 3*2/99=0.06). No other
observations have high influence or high leverage.
Rules of Thumb for High
Leverage and High Influence
• High Leverage Any observation with a leverage
(hat value) > (3 * # of coefficients in regression
model)/n has high leverage, where
# of coefficients in regression model = 2 for
simple linear regression.
n=number of observations.
• High Influence: Any observation with a Cook’s
Distance greater than 1 indicates a high
influence.
What to Do About Suspected
Influential Observations?
See flowchart attached to end of slides
Does removing the observation change the
substantive conclusions?
•
If not, can say something like “Observation x
has high influence relative to all other
observations but we tried refitting the
regression without Observation x and our
main conclusions didn’t change.”
• If removing the observation does change
substantive conclusions, is there any reason
to believe the observation belongs to a
population other than the one under
investigation?
– If yes, omit the observation and proceed.
– If no, does the observation have high leverage
(outlier in explanatory variable).
• If yes, omit the observation and proceed. Report that
conclusions only apply to a limited range of the
explanatory variable.
• If no, not much can be said. More data (or clarification of
the influential observation) are needed to resolve the
questions.
General Principles for Dealing with
Influential Observations
• General principle: Delete observations
from the analysis sparingly – only when
there is good cause (observation does not
belong to population being investigated or
is a point with high leverage). If you do
delete observations from the analysis, you
should state clearly which observations
were deleted and why.
•
•
•
•
•
Influential Points, High Leverage
Points, Outliers in Multiple
Regression
As in simple linear regression, we identify high leverage
and high influence points by checking the leverages and
Cook’s distances (Use save columns to save Cook’s D
Influence and Hats).
High influence points: Cook’s distance > 1
High leverage points: Hat greater than (3*(# of
explanatory variables + 1))/n is a point with high
leverage. These are points for which the explanatory
variables are an outlier in a multidimensional sense.
Use same guidelines for dealing with influential
observations as in simple linear regression.
Point that has unusual Y given its explanatory variables:
point with a residual that is more than 3 RMSEs away
from zero.
Multiple regression, modeling and outliers, leverage and influential
points
Pollution Example
• Data set pollution.JMP provides information about the
relationship between pollution and mortality for 60
cities between 1959-1961.
• The variables are
• y (MORT)=total age adjusted mortality in deaths per 100,000
population;
• PRECIP=mean annual precipitation (in inches);
EDUC=median number of school years completed for persons
25 and
older;
NONWHITE=percentage of 1960 population that is nonwhite;
NOX=relative pollution potential of Nox (related to amount of
tons of
Nox emitted per day per square kilometer);
SO2=relative pollution potential of SO2
Scatterplot Matrix
• Before fitting a multiple linear regression model,
it is good idea to make scatterplots of the
response variable versus the explanatory
variable. This can suggest transformations of
the explanatory variables that need to be done
as well as potential outliers and influential
points.
• Scatterplot matrix in JMP: Click Analyze,
Multivariate Methods and Multivariate, and then
put the response variable first in the Y, columns
box and then the explanatory variables in the Y,
columns box.
Scatterplot Matrix
1050
900
800
70
50
30
10
11.5
MORT
PRECIP
EDUC
10.0
9.0
30
15
5
350
250
150
50
NONW HITE
NOX
250
150
50
SO2
800 950 115010 30 50 709.0 10.5 12.55 1525 35 50150 300 50 150250
Crunched Variables
• When an X variable is “crunched –
meaning that most of its values are
crunched together and a few are far apart
– there will be influential points. To reduce
the effects of crunching, it is a good idea
to transform the variable to log of the
variable.
2.
a) From the scatter plot of MORT vs. NOX we see that NOX
values are crunched very tight. A Log transformation of NOX
is needed.
b) The curvature in MORT vs. SO2 indicates a Log
transformation for SO2 may be suitable.
After the two transformations we have the following
correlations:
MORT
PRECIP
EDUC
NONWHITE
NOX
SO2
Log(NOX)
Log(SO2)
MORT PRECI
P
1.0000 0.5095
0.5095 1.0000
-0.5110 -0.4904
0.6437 0.4132
-0.0774 -0.4873
0.4259 -0.1069
0.2920 -0.3683
0.4031 -0.1212
EDUC NONWHI
TE
-0.5110 0.6437
-0.4904 0.4132
1.0000 -0.2088
-0.2088 1.0000
0.2244 0.0184
-0.2343 0.1593
0.0180 0.1897
-0.2562 0.0524
NOX
-0.0774
-0.4873
0.2244
0.0184
1.0000
0.4094
0.7054
0.3582
SO2 Log(NO Log(SO2)
X)
0.4259 0.2920
0.4031
-0.1069 -0.3683
-0.1212
-0.2343 0.0180
-0.2562
0.1593 0.1897
0.0524
0.4094 0.7054
0.3582
1.0000 0.6905
0.7738
0.6905 1.0000
0.7328
0.7738 0.7328
1.0000
1100
950
800
70
50
30
10
MORT
PRECIP
12.0
10.5
EDUC
9.0
35
20
5
350
250
150
50
NONW HITE
NOX
250
150
50
6
4
2
0
6
4
2
0
SO2
Log(NOX)
Log(SO2)
800 950110010 30 50 709.0 10.512.0 5 15 25 35 50 150 300 50 150 250 0 1 2 3 4 5 6 0 1 2 3 4 5 6
Response MORT
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.688278
0.659415
36.30065
Parameter Estimates
Term
Intercept
PRECIP
EDUC
NONWHITE
log NOX
log S02
Estimate
940.6541
1.9467286
-14.66406
3.028953
6.7159712
11.35814
Std Error
94.05424
0.700696
6.937846
0.668519
7.39895
5.295487
t Ratio
10.00
2.78
-2.11
4.53
0.91
2.14
Prob>|t|
<.0001
0.0075
0.0392
<.0001
0.3681
0.0365
Cook’s
Distances
Residual by Predicted Plot
2
100
New Orleans, LA
MORT Res idual
New Orleans, LA
1.5
50
0
1
-50
0.5
-100
750 800 850 900 950 100010501100
MORT Predicted
0
New
Orleans
has
Cook’s
Distance
greater
than 1 –
New
Orleans
may be
influential.
Labeling Observations
• To have points identified by a certain
column, go the column, click Columns and
click Label (click Unlabel to Unlabel).
• To label a row, go to the row, click rows
and click label.
Multiple Regression with New
Orleans
Multiple Regression without New
Orleans
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
•
0.688278
0.659415
36.30065
940.3568
60
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model
5
157115.28
31423.1 23.8462
Error
54
71157.80
1317.7 Prob > F
C. Total 59
228273.08
<.0001
Parameter Estimates
Term
Intercept
PRECIP
EDUC
NONWHITE
Log NOX
Log SO2
Estimate
940.6541
1.9467286
-14.66406
3.028953
6.7159712
11.35814
Std Error t Ratio Prob>|t|
94.05424 10.00 <.0001
0.700696 2.78 0.0075
6.937846 -2.11 0.0392
0.668519 4.53 <.0001
7.39895 0.91 0.3681
5.295487 2.14 0.0365
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.724661
0.698686
32.06752
937.4297
59
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model
5
143441.28
28688.3 27.8980
Error
53
54501.26
1028.3 Prob > F
C. Total 58
197942.54
<.0001
Parameter Estimates
Term
Intercept
PRECIP
EDUC
NONWHITE
Log NOX
Log SO2
Estimate
852.3761
1.3633298
-5.666948
3.0396794
-9.898442
26.032584
Std Error t Ratio Prob>|t|
85.9328 9.92 <.0001
0.635732 2.14 0.0366
6.52378 -0.87 0.3889
0.590566 5.15 <.0001
7.730645 -1.28 0.2060
5.931083 4.39 <.0001
Removing New Orleans has a large
impact on the coefficients of log NOX
and log SO2, in particular, it reverses the
sign of log S02.
Leverage Plots
• A “simple regression view” of a multiple regression
coefficient. For xj:
Residual y (w/o xj) vs. Residual xj (vs the rest of x’s)
(both axes are recentered)
• Slope:
coefficient for that variable in the multiple regression
• Distances from the points to the LS line are multiple
regression residuals. Distance from point to horizontal
line is the residual if the explanatory variable is not
included in the model.
• Useful to identify (for xj)
outliers
leverage
influential points
(Use them the same way as in a simple regression to
identify the effect of points for the regression coefficient
of a particular variable)
PRECIP
Leverage Plot
Log NOX
Leverage Plot
1150
MORT Leverage Residuals
MORT Leverage Residuals
1150
1100
1050
1000
950
900
850
800
750
20
30
40
50
1000
950
900
850
800
60
0
P RE CIP Leverage, P =0.0075
•
1
2
3
4
5
6
Log NOX Leverage, P =0.3681
EDUC
Leverage Plot
Log SO2
Leverage Plot
1150
MORT Leverage Residuals
1150
MORT Leverage Residuals
1050
750
10
1100
1050
1000
950
900
850
800
750
1100
1050
1000
950
900
850
800
750
9.0
9.5
10.0 10.5 11.0 11.5 12.0 12.5 13.0
E DUC Le verage, P =0.0392
1150
1100
1050
1000
950
900
850
800
750
-5
0
5
10
15
20
25
30
-1
0
1
2
3
4
5
6
Log S O2 Leverage, P=0.036 5
NONWHITE
Leverage Plot
MORT Leverage Residuals
1100
35
NONWHIT E Leverage, P <.0001
40
The enlarged observation New Orleans is
an outlier for estimating each coefficient
and is highly leveraged for estimating the
coefficients of interest on log Nox and log
SO2. Since New Orleans is both highly
leveraged and an outlier, we expect it to be
influential.