Class 6: Tuesday, Sep. 28 - University of Pennsylvania

Download Report

Transcript Class 6: Tuesday, Sep. 28 - University of Pennsylvania

Class 6: Tuesday, Sep. 28
• Section 2.4.
• Checking the assumptions of the simple
linear regression model:
– Residual plots
– Normal quantile plots
• Outliers and influential observations
Checking the model
•
•
1.
2.
3.
4.
The simple linear regression model is a great
tool but its answers will only be useful if it is the
right model for the data. We need to check the
assumptions before using the model.
Assumptions of the simple linear regression
model:
Linearity: The mean of Y|X is a straight line.
Constant variance: The standard deviation of
Y|X is constant.
Normality: The distribution of Y|X is normal.
Independence: The observations are
independent.
Checking that the mean of Y|X is
a straight line
1. Scatterplot: Look at whether the mean of
Y given X appears to increase or decrease
in a straight line.
Bivariate Fit of Heart Disease Mortality By Wine Consumption
65000
12
60000
10
Heart Di sease Mortality
Salary
Bivariate Fit of Salary By Years of Experience
55000
50000
45000
40000
35000
8
6
4
2
0
2.5
5
7.5 10 12.5
Years of Experience
0
10
20
30
40
50
Wine Consumption
60
70
80
Residual Plot
• Residuals: Prediction error of using
regression to predict Yi for observation i:
resi  Yi  Yˆi
, where Yˆi  ˆ0  ˆ1 X i
• Residual plot: Plot with residuals on the y
axis and the explanatory variable (or some
other variable) on the x axis.
Residual
Residual
5000
0
-5000
-10000
0
2.5
5
7.5
Years of Experience
10
12.5
3
2
1
0
-1
-2
-3
0
10
20
30
40
50
Wine Cons umption
60
70
80
• Residual Plot in JMP: After doing Fit Line, click red
triangle next to Linear Fit and then click Plot Residuals.
• What should the residual plot look like if the simple linear
regression model holds? Under simple linear regression
model, the residuals resi  Yi  Yˆi  Yi  (ˆ0  ˆ1 X i )
should have approximately a
normal distribution with mean zero and a standard
deviation which is the same for all X.
• Simple linear regression model: Residuals should appear
as a “swarm” of randomly scattered points about zero.
Ideally, you should not be able to detect any patterns. (Try
not to read too much into these plots – you’re looking for
gross departures from a random scatter).
• A pattern in the residual plot that for a certain range of X
the residuals tend to be greater than zero or tend to be less
than zero indicates that the mean of Y|X is not a straight
line.
Bivariate Fit of Mileage By Speed
40
35
Mileage
30
25
20
15
Data Simulated From A Simple Linear Regression Model
10
5
0
10 20 30 40 50 60 70 80 90 100 110
Speed
Idealreg.JMP
Bivariate Fit of Y By X
110
100
Linear Fit
Linear Fit
90
80
70
Mileage = 23.266776 - 0.0012701 Speed
Y
0
60
50
40
30
20
-10
-20
0
10
20
30
40
50
60
70
80
90 100 110
10
0
Speed
0
10 20 30 40 50 60 70 80 90 100 110
X
2
Residual
Residual
10
1
0
-1
-2
0
10
20
30
40
50
60
X
70
80
90 100 110
Checking Constant Variance
• Use residual plot of residuals vs. X to check constant
variance assumption.
• Constant variance: Spread of residuals is similar for all
ranges of X
• Nonconstant variance: Spread of residuals is different for
different ranges of X.
– Fan shaped plot: Residuals are increasing in spread as
X increases
– Horn shaped plot: Residuals are decreasing in spread as
X increases.
Data Simulated From A Simple Linear Regression Model
Simulated Data from a Model with Nonconstant Variance
Idealreg.JMP
Bivariate Fit of Y By X
Bivariate Fit of Y By X
350
110
100
300
90
80
70
250
200
Y
Y
60
50
150
100
40
30
20
50
0
10
0
-50
0
10 20 30 40 50 60 70 80 90 100 110
-100
X
0
10 20 30 40 50 60 70 80 90 100 110
X
300
1
Residual
Residual
2
0
-1
-2
0
10
20
30
40
50
60
70
80
200
100
0
-100
90 100 110
-200
X
0
10
20
30
40
50
Name Game
60
70
80
90 100 110
X
Bivariate Fit of Proportion recalled By Position
0.4
0.9
0.8
0.7
Residual
Proportion recall ed
1.1
1
0.6
0.5
0.1
-0.2
-0.5
-0.8
0.4
0.3
0.2
1
2
3
4
5
6
Position
0.1
0
1
2
3
4
5
6
Position
7
8
9
10
7
8
9
10
Checking Normality
• If the distribution of Y|X is normal, then the
residuals should have approximately a normal
distribution.
• To check normality, make histogram and normal
quantile plot of residuals.
• In JMP, after using Fit Line, click red triangle next
to Linear Fit and click save residuals. Click
Analyze, Distribution, put Residuals in Y, click
OK and then after histogram appears, click red
triangle next to Residuals and click Normal
Quantile Plot.
Name Game
Distributions
Residuals Proportion recalled
0.4
.01
.05 .10
.25
.50
.75
.90 .95
.99
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
-0.7
-3
-2
-1
0
1
2
Normal Quantile Plot Simulation
3
from Simple Linear Regression Model
Distributions
Residuals Y
.01
3
.05 .10
.25
.50
.75
.90 .95
.99
2
1
0
-1
-2
-3
-3
-2
-1
0
Normal Quantile Plot
1
2
3
Normal Quantile Plot
• Section 1.3.
• Most useful tool for assessing normality.
• Plot of residuals (or whatever variable is being
checked for normality) on y-axis versus z-score of
percentile of data point.
• If the true distribution is normal, the normal
quantile plot will be a straight line. Deviations
from a straight line indicate that the distribution is
not normal.
• The dotted red lines are “confidence bands.” If all
the points lie inside the confidence bands, then we
feel that the normality assumption is reasonable.
Independence
• In a problem where the data is collected over time,
plot the residuals vs. time.
• For simple linear regression model, there should
be no pattern in residuals over time.
• Pattern in residuals over time where residuals are
higher or lower in early part of data than later part
of data indicates that relationship between Y and
X is changing over time and might indicate that
there is a lurking variable.
• Lurking variable: A variable that is not among the
explanatory or response variables in a study and
yet may influence the interpretation of
relationships among those variables.
Residual vs. Time Example
• Mathematics dept. at large state university must
plan number of instructors required for large
elementary courses and wants to predict
enrollment in elementary math courses (y) based
on number of first-year students (x).
• Data in mathenroll.JMP
• Residual plot vs. time in JMP: After fit y by x, fit
line, click red triangle next to linear fit and click
save residuals. Then use fit y by x with y =
residuals and x = year.
400
200
0
-200
3750 4000 4250 4500 4750 5000
First year students
Residuals Math enrollment
Residual
Residual Plots
400
300
200
100
0
-100
-200
1992
1996 1998 2000
Year
Analysis of Math Enrollment
• Residual plot versus time order indicates that there
must be a lurking variable associated with time, in
particular there is a change in the relationship
between y and x between 1997 and 1998.
• In fact, one of schools in the university changed its
program to require that entering students take
another mathematics course beginning in 1998,
increasing enrollment.
• Implication: Data from before 1998 should not be
used to predict future math enrollment.
What to Do About Violations of
Simple Linear Regression Model
• Coming up in the Future:
• Nonlinearity: Transformations (Chapter 2.6),
Polynomial Regression (Chapter 11)
• Nonconstant Variance: Transformations (Chapter
2.6)
• Nonnormality: Transformations (Chapter 2.6).
• Lack of independence: Incorporate time into
multiple regression (Chapter 11), time series
techniques (Stat 202).
Outliers and Influential
Observations
• Outlier: Any really unusual observation.
• Outlier in the X direction (called high leverage
point): Has the potential to influence the
regression line.
• Outlier in the direction of the scatterplot: An
observation that deviates from the overall pattern
of relationship between Y and X. Typically has a
residual that is large in absolute value.
• Influential observation: Point that if it is removed
would markedly change the statistical analysis.
For simple linear regression, points that are
outliers in the x direction are often influential.
Housing Prices and Crime Rates
• A community in the Philadelphia area is interested
in how crime rates are associated with property
values. If low crime rates increase property
values, the community might be able to cover the
costs of increased police protection by gains in tax
revenues from higher property values.
• The town council looked at a recent issue of
Philadelphia Magazine (April 1996) and found
data for itself and 109 other communities in
Pennsylvania near Philadelphia. Data is in
philacrimerate.JMP. House price = Average house
price for sales during most recent year, Crime
Rate=Rate of crimes per 1000 population.
Bivariate Fit of HousePrice By CrimeRate
500000
Gladwyne
Haverford
HousePrice
400000
300000
200000
Phila,CC
100000
Phila, N
0
0
50
100
150
200
250
300
350
400
CrimeRate
Center City Philadelphia is a high leverage point. Gladwyne and
Haverford are outliers in the direction of the scatterplot (their
house price) is considerably higher than one would expect given
their crime rate.
Which points are influential?
Bivariate Fit of HousePrice By CrimeRate
500000
All observations
Gladwyne
Haverford
Linear Fit
400000
HousePrice
HousePrice = 176629.41 - 576.90813 CrimeRate
300000
Without Center City
Philadelphia
200000
Phila,CC Linear Fit
HousePrice = 225233.55 - 2288.6894 CrimeRate
100000
Phila, N
0
0
50
100
150
200
250
300
350
Without Gladwyne
400
CrimeRate
Linear Fit
HousePrice = 173116.43 - 567.74508 CrimeRate
Linear Fit
Linear Fit
Linear Fit
Center City Philadelphia is influential; Gladwyne is not. In general,
points that have high leverage are more likely to be influential.
Formal measures of leverage and
influence
• Leverage: “Hat values” (JMP calls them hats)
• Influence: Cook’s Distance (JMP calls them
Cook’s D Influence).
• To obtain them in JMP, click Analyze, Fit Model,
put Y variable in Y and X variable in Model
Effects box. Click Run Model box. After model
is fit, click red triangle next to Response. Click
Save Columns and then Click Hats for Leverages
and Click Cook’s D Influences for Cook’s
Distances.
• To sort observations in terms of Cook’s Distance
or Leverage, click Tables, Sort and then put
variable you want to sort by in By box.
Distributions
Cook's D Influence HousePrice
Gladwyne
Haverford
0
5
h HousePrice
Phila,CC
10
15
20
25
30
Phila,CC
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
Center City Philadelphia has both influence (Cook’s Distance much
Greater than 1 and high leverage (hat value > 3*2/99=0.06). No other
observations have high influence or high leverage.
Rules of Thumb for High
Leverage and High Influence
• High Leverage Any observation with a leverage
(hat value) > (3 * # of coefficients in regression
model)/n has high leverage, where
# of coefficients in regression model = 2 for simple
linear regression.
n=number of observations.
• High Influence: Any observation with a Cook’s
Distance greater than 1 indicates a high influence.
What to Do About Suspected
Influential Observations?
See flowchart handout.
Does removing the observation change the
substantive conclusions?
•
If not, can say something like “Observation x
has high influence relative to all other
observations but we tried refitting the
regression without Observation x and our
main conclusions didn’t change.”
• If removing the observation does change
substantive conclusions, is there any reason to
believe the observation belongs to a population
other than the one under investigation?
– If yes, omit the observation and proceed.
– If no, does the observation have high leverage (outlier
in explanatory variable).
• If yes, omit the observation and proceed. Report that
conclusions only apply to a limited range of the explanatory
variable.
• If no, not much can be said. More data (or clarification of the
influential observation) are needed to resolve the questions.
• General principle: Delete observations from the
analysis sparingly – only when there is good cause
(does not belong to population being investigated
or is a point with high leverage). If you do delete
observations from the analysis, you should state
clearly which observations were deleted and why.
Summary
• Before using the simple linear regression model,
we need to check its assumptions. Check linearity,
constant variance, normality and independence by
using scatterplot, residual plot and normal quantile
plot.
• Influential observations: observations that, if
removed, would have a large influence on the
fitted regression model. Examine influential
observations, remove them only with cause
(belongs to a different population than being
studied, has high leverage) and explain why you
deleted them.
• Next class: Lurking variables, causation (Sections
2.4, 2.5).