Lecture 17: Tues., March 16 - University of Pennsylvania

Download Report

Transcript Lecture 17: Tues., March 16 - University of Pennsylvania

Lecture 17: Tues., March 16
• Inference for simple linear regression (Ch.
7.3-7.4)
• R2 statistic (Ch. 8.6.2)
• Association is not causation (Ch. 7.5.3)
• Next class: Diagnostics for asssumptions of
simple linear regression model (Ch. 8.2-8.3)
Regression
• Goal of regression: Estimate the mean response Y
for subpopulations X=x, {Y | X }
• Example: Y= catheter length required, X=height
• Simple linear regression model:
{Y | X }  0  1 X
Y | X has normaldistribution withmean 0  1 X and SD 
• Estimate 0 and 1 by least squares – choose ˆ0 , ˆ1
to minimize the sum of squared residuals
(prediction errors)
Car Price Example
• A used-car dealer wants to understand how
odometer reading affects the selling price of used
cars.
• The dealer randomly selects 100 three-year old
Ford Tauruses that were sold at auction during the
past month. Each car was in top condition and
equipped with automatic transmission, AM/FM
cassette tape player and air conditioning.
• carprices.JMP contains the price and number of
miles on the odometer of each car.
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000
30000 40000
Odometer
Linear Fit
Linear Fit
Price = 17066.766 - 0.0623155 Odometer
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.650132
0.646562
303.1375
14822.82
100
Parameter Estimates
Term
Intercept
Odometer
Estimate
Std Error
t Ratio
Prob>|t|
17066.766
-0.062315
169.0246
0.004618
100.97
-13.49
<.0001
<.0001
Inference for Simple Linear Regression
• Inference based on the ideal simple linear
regression model holding.
• Inference based on taking repeated random
samples ( y1,, yn ) from the same subpopulations
( x1,, xn ) as in the observed data.
• Types of inference:
–
–
–
–
Hypothesis tests for intercept and slope
Confidence intervals for intercept and slope
Confidence interval for mean of Y at X=X0
Prediction interval for future Y for which X=X0
Ideal Simple Linear Regression
Model
• Assumptions of ideal simple linear regression
model
– There is a normally distributed subpopulation of
responses for each value of the explanatory variable
– The means of the subpopulations fall on a straight-line
function of the explanatory variable.
– The subpopulation standard deviations are all equal (to
 )
– The selection of an observation from any of the
subpopulations is independent of the selection of any
other observation.
Sampling Distributions of
• See handout.
• See Display 7.7
•
1
ˆ
SD( 1 )  
(n  1) sx
ˆ0
and ˆ
2
Standard deviation is smaller for (i) larger
n, (ii) smaller  , (iii) larger spread in x
2
(higher sx )
1
Hypothesis tests for
0
and
1
• Hypothesis test of H 0 : 1  0 vs. H a : 1  0
– Based on t-test statistic,
| Estimate |
| ˆ1 |
| t |

SE ( Estimate ) SE ( ˆ1 )
– p-value has usual interpretation, probability under the
null hypothesis that |t| would be at least as large as its
observed value, small p-value is evidence against null
hypothesis
– Interpretation of null hypothesis: X is not a useful
predictor of Y, mean of Y is not associated with X.
• Test for H0 : 0  0 vs. H a : 0  0 is based on an
analogous test statistic.
• Test statistics and p-values can be found on JMP output
under parameter estimates, obtained by using fit line after
fit Y by X.
• For car price data, convincing evidence that both intercept
and slope are not zero (p-value <.0001 for both).
Confidence Intervals for
0
and 
• Confidence intervals provide a range of plausible
values for 0 and 1
• 95% Confidence Intervals:
ˆ0  tn2 (.975) SE( ˆ0 )  ˆ0  2SE( ˆ0 )
ˆ  t (.975)SE(ˆ )  ˆ  2SE(ˆ )
1
n2
1
1
1
Table A.2 lists tn2 (.975) . It is approximately 2.
• Finding CIs in JMP: Go to parameter estimates,
right click, click Columns and then click Lower
95% and Upper 95%.
• For car price data set, CIs: 0 : (16731,17402)
1 : (0.071,0.053)
1
Two prediction problems
a)
b)
The used-car dealer has an opportunity to bid on a lot of
cars offered by a rental company. The rental company
has 250 Ford Tauruses, all equipped with automatic
transmission, air conditioning and AM/FM cassette tape
players. All of the cars in this lot have about 40,000
miles on the odometer. The dealer would like an
estimate of the average selling price of all cars in this lot
(or, virtually equivalently, average selling price of
population of Ford Tauruses with above equipment and
40,000 miles on the odometer).
The used-car dealer is about to bid on a 3-year old Ford
Taurus equipped with automatic transmission, air
conditioner and AM/FM cassette tape player and with
40,000 miles on the odometer. The dealer would like to
predict the selling price of this particular car.
Prediction problem (a)
• Goal is to estimate the conditional mean of
selling price given odometer
reading=40,000, {Y | X  40000}
• Point estimate is
ˆ{Y | X  40000}  ˆ0  ˆ1 * 40000 14,574
• What is a range of plausible values for
{Y | X  40000}
?
Confidence Intervals for Mean of
Y at X=X0
• What is a plausible range of values for {Y | X 0} ?
• 95% CI for {Y | X 0}: ˆ{Y2 | X 0}  tn2 (.975)SE[ˆ{Y | X 0}]
1 (X0  X )
ˆ{Y | X 0}  ˆ0  ˆ1 X 0
• SE[ ˆ{Y | X 0}]  ˆ 
2 ,
n (n  1) s X
• Note about formula
– Precision in estimating{Y | X } is not constant for all
values of X. Precision decreases as X0 gets farther
away from sample average of X’s
• JMP implementation: Use Confid Curves fit
command under red triangle next to Linear Fit
after using Fit Y by X, fit line. Use the crosshair
tool to find the exact values of the confidence
interval endpoints for a given X .
Prediction Problem (b)
• Goal is to estimate the selling price of a given car
with odometer reading=40,000.
• What are likely values for a future value Y0 at some
specified value of X (=X0)?
• Best prediction is the estimated mean response for X=X0:
ˆ{Y | X  40000}  ˆ0  ˆ1 * 40000 14,574
• A prediction interval is an interval of likely values along
with a measure of the likelihood that interval will contain
response.
• 95% prediction interval for X0: If repeated samples
are obtained from the subpopulations ( x1,, xn ) and a
prediction interval is formed, the prediction interval will
contain the value of Y0 for a future observation from the
subpopulation X0 95% of the time.
Prediction Intervals Cont.
• Prediction interval must account for two sources
of uncertainty:
•
– Uncertainty about the location of the subpopulation
mean {Y | X 0}
– Uncertainty about where the future value will be in
relation to its mean
Y  Pred{Y | X 0}  Y  ˆ{Y | X 0}
 [Y  {Y | X 0}]  [{Y | X 0}  ˆ{Y | X 0}]
• Prediction Error = Random Sampling Error +
Estimation Error
Prediction Interval Formula
• 95% prediction interval at X0
ˆ{Y | X 0}  tn2 (.975) ˆ 2 SE[ˆ{Y | X 0}]2
• Compare to 95% CI for mean at X0:
ˆ{Y | X 0}  tn2 (.975)SE[ˆ{Y | X 0}]
– Prediction interval is wider due to random sampling
error in future response
– As sample size n becomes large, margin of error of CI
for mean goes to zero but margin of error of PI doesn’t.
• JMP implementation: Use Confid Curves Indiv command
under red triangle next to Linear Fit after using Fit Y by X,
fit line. Use the crosshair tool to find the exact values of
the confidence interval endpoints for a given X0.
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000
30000 40000
Odometer
95% Confidence Interval for {Y | X  40000}  (14514, 14653)
95% Prediction Interval for X=40000  (13972, 15194)
R-Squared
• The R-squared statistic, also called the coefficient
of determination, is the percentage of response
variation explained by the explanatory variable.
R2 
T otalsum of squares - Residual sum of squares
T otalsum of squares
• Unitless measure of strength of relationship
between x and y
2
(
Y

Y
)
i1 i
n
• Total sum of squares =
. Best sum of
squared prediction error without using x.
n
n
2
res

• Residual sum of squares = i1 i i1 (Yi  ˆ0  ˆ1 X i )2
R-Squared Example
Bivariate Fit of Price By Odometer
16000
Price
15500
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
15000
14500
14000
13500
15000
0.650132
0.646562
303.1375
14822.82
100
30000 40000
Odometer
• R2=.6501. Read as “65.01 percent of the
variation in car prices was explained by the
linear regression on odometer.”
Interpreting
2
R
• R2 takes on values between 0 and 1, with
higher R2 indicating a stronger linear
association.
• If the residuals are all zero (a perfect fit),
then R2 is 1. If the least squares line has
slope 0, R2 will be 0.
• R2 is useful as a unitless summary of the
strength of linear association.
Caveats about
2
R
– R2 is not useful for assessing model adequacy, i.e., does
simple linear regression model hold (use residual plots)
or whether or not there is an association (use test of
H 0 : 1  0 vs.H a : 1  0 )
– A good R2 depends on the context. In precise
laboratory work, R2 values under 90% might be too
low, but in social science contexts, when a single
variable rarely explains great deal of variation in
response, R2 values of 50% may be considered
remarkably good.
Association is not causation
• A high R 2 means that x has a strong linear relationship
with y – there is a strong association between x and y. It
does not imply that x causes y.
• Alternative explanations for highR 2 :
– Reverse is true. Y causes X.
– There may be a lurking (confounding) variable related
to both x and y which is the common cause of x and y
• No cause and effect relationship can be inferred unless X is
randomly assigned to units in a random experiment.
• A researcher measures the number of television sets per
person X and the average life expectancy Y for the world’s
nations. The regression line has a positive slope – nations
with many TV sets have higher life expectancies. Could
we lengthen the lives of people in Rwanda by shipping
them TV sets?
Example
• A community in the Philadelphia area is interested
in how crime rates affect property values. If low
crime rates increase property values, the
community may be able to cover the costs of
increased police protection by gains in tax
revenues from higher property values. Data on the
average housing price and crime rate (per 1000
population) for communities in Pennsylvania near
Philadelphia for 1996 are shown in
housecrime.JMP.
Bivariate Fit of HousePrice By CrimeRate
500000
Residual
300000
HousePrice
400000
300000
200000
100000
0
-100000
10
200000
20
30
40
CrimeRate
Distributions
Residuals HousePrice
100000
0
10
20
30
40
50
60
70
CrimeRate
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Parameter Estimates
Term
Estimate
Intercept
225233.55
CrimeRate
-2288.689
0.184229
0.175731
78861.53
158464.5
98
Std Error
16404.02
491.5375
-100000
t Ratio
13.73
-4.66
0
Prob>|t|
<.0001
<.0001
100000 200000 300000
50
60
70
Questions
1. Can you deduce a cause-and-effect
relationship from these data? What are
other explanations for the association
between housing prices and crime rate
other than that high crime rates cause low
housing prices?
2. Does the ideal simple linear regression
model appear to hold?