Lecture 25 - University of Pennsylvania

Download Report

Transcript Lecture 25 - University of Pennsylvania

Lecture 25
• Regression diagnostics for the multiple
linear regression model
• Dealing with influential observations for
multiple linear regression
• Interaction variables
Distributions Midterm 2 Scores
Midterm 2 Scores
Approximate grade guidelines:
Exam raw score Grade
44+
A range
35+
B range
•
25
30
35
40
45
50
55
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
43.916667
4.7580845
0.8687034
45.693365
42.139969
30
Final grade is determined by 40%
homework, 20% each midterm and 20%
final. Lower midterm is replaced by
final score if latter is higher.
Assumptions of Multiple Linear
Regression Model
• Assumptions of multiple linear regression:
– For each subpopulation x1,...,xp ,
• (A-1A) {Y | X1,..., X p}  0  1 X1    p X p
• (A-1B) Var(Y | X ,..., X )   2
1
p
• (A-1C) The distribution of Y | X1,..., X pis normal
[Distribution of residuals should not depend on
x1,...,xp ]
– (A-2) The observations are independent of one
another
Checking/Refining Model
• Tools for checking (A-1A) and (A-1B)
– Residual plots versus predicted (fitted) values
– Residual plots versus explanatory variables x1,, xp
– If model is correct, there should be no pattern in the
residual plots
• Tool for checking (A-1C)
– Histogram of residuals
• Tool for checking (A-2)
– Residual plot versus time or spatial order of
observations
Model Building (Display 9.9)
1. Make scatterplot matrix of variables (using
analyze, multivariate). Decide on whether to
transform any of the explanatory variables.
Check for obvious outliers.
2. Fit tentative model.
3. Check residual plots for whether assumptions of
multiple regression model are satisfied. Look
for outliers and influential points.
4. Consider fitting richer model with interactions
or curvature. See if extra terms can be dropped.
5. Make changes to model and repeat steps 2-4
until an adequate model is found.
Multiple regression, modeling and outliers, leverage and influential points
Pollution Example
• Data set pollutionhc.JMP provides information about the
relationship between pollution and mortality for 60 cities
between 1959-1961.
• The variables are
• y (MORT)=total age adjusted mortality in deaths per 100,000
population;
• PRECIP=mean annual precipitation (in inches);
EDUC=median number of school years completed for persons 25 and
older;
NONWHITE=percentage of 1960 population that is nonwhite;
HC=relative pollution potential of hydrocarbons (product of tons
emitted per day per square kilometer and a factor correcting for SMSA
dimension and exposure)
Scatterplot Matrix
1100
950
MORTALITY
800
50
30
10
12.0
10.5
PRECIP
EDUC
9.0
35
20
NONW HITE
5
600
300
HC
0
800 1000 103050 9.0 11.0
5 15 30
0 300
Transformations for Explanatory
Variables
•
1.
2.
•
In deciding whether to transform an explanatory variable
x, we consider two features of the plot of the response y
vs. the explanatory variable x.
Is there curvature in the relationship between y and x?
This suggests a transformation chosen by Tukey’s
Bulging rule.
Are most of the x values “crunched together” and a few
very spread apart? This will lead to several points being
very influential. When this is the case, it is best to
transform x to make the x values more evenly spaced
and less influential. If the x values are positive, the log
transformation is a good idea.
For the pollution data, reason 2 suggests transforming
HC to log HC.
Scatterplot Matrix
1100
950
MORTALITY
800
50
30
10
12.0
10.5
PRECIP
EDUC
9.0
35
20
NONW HITE
5
6
Log HC
3
0
800 1000 103050 9.0 11.0
5 15 30
0 2 4 6
Response MORTALITY
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.62584
0.598628
39.40713
Analysis of Variance
Source
DF
Sum of Squares Mean Square
F Ratio
Model
4 142862.38 35715.6 22.9990
Prob > F
Error
55 85410.70 1552.9
C. Total 59 228273.08
<.0001
Parameter Estimates
Term
Intercept
PRECIP
EDUC
NONWHITE
Log HC
Estimate
Std Error
t Ratio
Prob>|t|
1043.1145 96.26515 10.84 <.0001
1.9789185 0.762967 2.59 0.0121
-22.93284 7.003909 -3.27 0.0018
2.8382152 0.692279 4.10 0.0001
14.984527 5.401444 2.77 0.0075
If the multiple regression model assumptions are
correct, there is strong evidence that the mean mortality
increases for increases in hydrocarbons, holding fixed
precipitation, education and nonwhite. If there are no
uncontrolled for confounding variables his would
suggest that an increase in hydrocarbons causes an
increase in mortality.
Residual vs. Predicted Plot
• Useful for detecting nonconstant variance;
look for fan or funnel pattern.
• Plot of residuals ei versus predicted values ,
ˆ{yi | x1i ,, xpi }
• For pollution data, no strong indication of
nonconstant variance.
MORTALITY Residual
Residual by Predicted Plot
100
50
0
-50
-100
750 850 950 1050
MORTALITY Predicted
Residual plots vs. each
explanatory variable
• Make plot of residuals vs. an explanatory variable
by using Fit Model, clicking red triangle next to
response, selecting Save Columns and selecting
save residuals. This creates a column of residuals.
Then click Analyze, Fit Y by X and put residuals
in Y and the explanatory variable in X.
• Use these residual plots to check for pattern in the
mean of residuals (suggests that we need to
transform x or use a polynomial in x) or pattern in
the variance of the residuals.
100
Bivariate Fit of Residual MORTALITY By EDUC
Residual MORTALITY
Residual MORTALITY
Bivariate Fit of Residual MORTALITY By PRECIP
50
0
-50
-100
0 10 20 30 40 50 60 70
PRECIP
100
50
0
-50
-100
0
5 10 15 20 25 30 35 40
NONWHITE
50
0
-50
-100
8.5
9.5
10.5 11.5 12.5
EDUC
Bivariate Fit of Residual MORTALITY By Log HC
Residual MORTALITY
Residual MORTALITY
Bivariate Fit of Residual MORTALITY By NONWHITE
100
100
50
0
-50
-100
-1 0
1
2 3 4
Log HC
5
6
7
Residual plots look fine. No strong indication of nonlinearity or
nonconstant variance.
Check of normality/outliers
Distributions
Residual MORTALITY
-100 -50
0
50 100
Normality looks okay. One residual outlier, Lancaster.
Influential Observations
• As in simple linear regression, one or two
observations can strongly influence the estimates.
• Harder to immediately see the influential
observations in multiple regression.
• Use Cook’s Distances (Cook’s D influence) to
look for influential observations. An obs. Has
large influence if Cook’s distance is greater than 1.
• Can use Table, Sort to sort observations by Cook’s
Distance or Leverage.
• For pollution data: no observation has high
influence.
Strategy for dealing with
influential observations
• Use Display 11.8
• Leverage of point: measure of distance between
point’s explanatory variable values and
explanatory variable values in entire data set.
• Two sources of influence: leverage, magnitude of
residual.
• General approach: If an influential point has high
leverage, omit point and report conclusions for
reduced range of explanatory variables. If an
influential point does not have high leverage, then
the point cannot just be removed. We can report
results with and without point.
Leverage
• Obtaining leverages from JMP: After Fit Model,
click red triangle next to Response, select Save
Columns, Hats.
• Leverages are between 1/n and 1. Average
leverage is p/n.
• An observation is considered to have high
leverage if the leverage is greater than 2p/n where
p=# of explanatory variables. For pollution data,
2p/n = (2*4)/60=.133
Specially Constructed
Explanatory Variables
• Interaction variables
• Squared and higher polynomial terms for
curvature
• Dummy variables for categorical variables.
Interaction
• Interaction is a three-variable concept. One of
these is the response variable (Y) and the other
two are explanatory variables (X1 and X2).
• There is an interaction between X1 and X2 if the
impact of an increase in X2 on Y depends on the
level of X1.
• To incorporate interaction in multiple regression
model, we add the explanatory variable X1 * X 2 .
There is evidence of an interaction if the
coefficient on X1 * X 2 is significant (t-test has pvalue < .05).
An experiment to study how noise affects the performance of children tested second
grade hyperactive children and a control group of second graders who were not
hyperactive. One of the tasks involved solving math problems. The children solved
problems under both high-noise and low-noise conditions. Here are the mean scores:
Mean Mathematics Score
250
200
150
High Noise
Low Noise
100
50
0
Control
Hyperactive
Let Y=Mean Mathematics Score, X 1  Type of Child (0= Control, 1 = Hyperactive),
X 2 =Type of Noise (0= Low Noise, 1= High Noise). There is an interaction between
type of child and type of noise: Impact of increasing noise from low to high depends on
the type of child.
Interaction variables in JMP
• To add an interaction variable in Fit Model
in JMP, add the usual explanatory variables
first, then highlight X1 in the Select
Columns box and X 2 in the Construct
Model Effects Box. Then click Cross in the
Construct Model Effects Box.
• JMP creates the explanatory variable
( X1  X1 ) * ( X 2  X 2 )
Interaction Model for Pollution
Data
{Y | precip, educ, nonwhit, log HC} 
0  1 precip  2educ 3nonwhit 4 log HC  5 (logHC  2.75) * ( precip  37.37) 
{Y | precip  x1, educ  x2 , nonwhit x3 , log HC  x4  1) 
{Y | precip  x1, educ  x2 , nonwhit x3 , log HC  x4 ) 
 4  5 * ( x1  37.37)
Response MORTALITY
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.703313
0.675842
35.41441
940.3568
60
Parameter Estimates
Term
Intercept
PRECIP
EDUC
NONWHITE
Log HC
(Log HC-2.75329)*(PRECIP-37.3667)
Estimate
989.56166
1.7266652
-18.72354
2.3882226
25.422693
1.2550598
Std Error
87.67919
0.688946
6.393312
0.633573
5.593733
0.334228
t Ratio
11.29
2.51
-2.93
3.77
4.54
3.76
Prob>|t|
<.0001
0.0152
0.0050
0.0004
<.0001
0.0004
There is strong evidence (p-value < .0004) that there is an
interaction between hydrocarbons and precipitation. The
impact of an increase in hydrocarbons on mean mortality,
holding fixed education, nonwhite and precipitation is greater
for higher precipitation levels.