Lecture 6-7 Notes - University of Pennsylvania

Download Report

Transcript Lecture 6-7 Notes - University of Pennsylvania

Lecture 6 Notes
• Note: I will e-mail homework 2 tonight. It will be
due next Thursday.
• The Multiple Linear Regression model (Chapter
4.1)
• Inferences from multiple regression analysis
(Chapter 4.2)
• In multiple regression analysis, we consider more
than one independent variable x1,…,xK . We are
interested in the conditional mean of y given
x1,…,xK .
Automobile Example
• A team charged with designing a new automobile
is concerned about the gas mileage that can be
achieved. The design team is interested in two
things:
(1) Which characteristics of the design are likely to
affect mileage?
(2) A new car is planned to have the following
characteristics: weight – 4000 lbs, horsepower –
200, cargo – 18 cubic feet, seating – 5 adults.
Predict the new car’s gas mileage.
• The team has available information about gallons
per 1000 miles and four design characteristics
(weight, horsepower, cargo, seating) for a sample
of cars made in 1989. Data is in car89.JMP.
Multivariate
Correlations
GP1000MHwy
1.0000
0.7097
0.6157
0.3405
0.2599
GP1000MHwy
Weight(lb)
Horsepower
Cargo
Seating
Weight(lb)
0.7097
1.0000
0.7509
0.1816
0.3499
Horsepower
0.6157
0.7509
1.0000
-0.0548
-0.0914
7 rows not used due to missing values.
Scatterplot Matrix
55
45
GP1000MHwy
35
25
4000
3000
Weight(lb)
2000
250
200
Horsepower
150
100
180
140
100
Cargo
60
20
8
6
Seating
4
2
25
35
45 50
2000 3000 4000
100 150 200 25020 60 100 160 2 3 4 5 6 7 8
Cargo
0.3405
0.1816
-0.0548
1.0000
0.4894
Seating
0.2599
0.3499
-0.0914
0.4894
1.0000
Best Single Predictor
• To obtain the correlation matrix and
pairwise scatterplots, click Analyze,
Multivariate Methods, Multivariate.
• If we use simple linear regression with each
of the four independent variables, which
provides the best predictions?
Best Single Predictor
• Answer: The simple linear regression that has the
highest R2 gives the best predictions because recall
that
SSE
2
R  1
SST
• Weight gives the best predictions of
GPM1000Hwy based on simple linear regression.
• But we can obtain better predictions by using
more than one of the independent variables.
Multiple Linear Regression
Model
E(Y | x , , x )( 
)   x   x
•
1
K
y| x1 ,..., xK
0
1 1
K K
yi  0  1x1i  2 x2i   K xKi  ei
• Assumptions about
:
– The expected value of the disturbances is zero for each e, i
( x1 , , xK )
E(ei | xi1, , xiK )  0
2
– The variance of each ei is equal to  e ,i.e.,Var(ei | xi1, , xiK )   e2
– The ei are normally distributed.
– The ei are independent.
Point Estimates for Multiple
Linear Regression Model
• We use the same least squares procedure as for
simple linear regression.
• Our estimates of 0 ,..., K are the coefficients b0 ,...,bK
that minimize the sum of squared prediction
errors:
b0 ,...,bK  arg minb* ,...,b*
0
K
*
*
*
2
(
y

b

b
x



b
x
)
i1 i 0 1 i1
K iK
n
yˆ  b0  b1x1    bK xK
• Least Squares in JMP: Click Analyze, Fit Model,
put dependent variable into Y and add independent
variables to the construct model effects box.
Response GP1000MHwy
Whole Model
Actual by Predicted Plot
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.589015
0.573208
3.542778
37.33359
109
Analysis of Variance
Source
Model
Error
C. Total
DF
4
104
108
Sum of Squares
1870.7788
1305.3330
3176.1118
Mean Square
467.695
12.551
F Ratio
37.2627
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Weight(lb)
Horsepower
Cargo
Seating
Estimate
19.100521
0.0040877
0.0426999
0.0533
0.0268912
Std Error
2.098478
0.001203
0.01567
0.013787
0.428283
Residual by Predicted Plot
GP1000MHwy Residual
10
5
0
-5
-10
25
30
35
40
45
GP1000MHwy Predicte d
50
55
t Ratio
9.10
3.40
2.73
3.87
0.06
Prob>|t|
<.0001
0.0010
0.0075
0.0002
0.9501
Root Mean Square Error
• Estimate of  e :
2
ˆ
(
y

y
)
i1 i i
n
se 
n  K 1
• se = Root Mean Square Error in JMP
• For simple linear regression of GP1000MHWY on
Weight, se  3.86. For multiple linear regression of
GP1000MHWY on weight, horsepower, cargo,
seating, se  3.54
•
Residuals and Root Mean Square
Errors
ˆ
E (Y | X1  x1 ,
, X K  xK )  b0  b1 x1 
 bK xK
• Residual for observation i = prediction error for observation i =
Yi  Eˆ (Y | X 1  xi1 ,
Yi  b0  b1 xi1 
, X K  xiK ) 
 bK xiK
• Root mean square error = Typical size of absolute value of prediction error
• As with simple linear regression model, if multiple linear regression model
holds
– About 95% of the observations will be within two RMSEs of their
predicted value
• For car data, about 95% of the time, the actual GP1000M will be within
2*3.54=7.08 GP1000M of the predicted GP1000M of the car based on the
car’s weight, horsepower, cargo and seating.
Inferences about Regression
Coefficients
• Confidence intervals: (1   )100% confidence
interval for k : bk  t / 2sb
Degrees of freedom for t equals n-(K+1). Standard
error of bk , sb , found on JMP output.
• Hypothesis Test:
H 0 :  k   k*
k
k
H a :  k   k*
Decision rule for test: Reject H0 if t  t / 2 or t  t / 2
bk   k*
where
t
sb
p-value for testing H :   0 is printed in JMP
0
k
output under Prob>|t|.
k
Inference Examples
• Find a 95% confidence interval for horsepower ?
• Is seating of any help in predicting gas
mileage once horsepower, weight and cargo
have been taken into account? Carry out a
test at the 0.05 significance level.
Partial Slopes vs. Marginal Slopes
• Multiple Linear Regression Model:
y|x ,...,x  0  1x1   K xK
• The coefficient k is a partial slope. It indicates
1
K
the change in the mean of y that is associated with
a one unit increase in xk while holding all other
variables x1,...,xk 1, xk 1, ..., xK
fixed.
• A marginal slope is obtained when we perform a
simple regression with only one X, ignoring all
other variables. Consequently the other variables
are not held fixed.
Simple Linear Regression
Bivariate Fit of GP1000MHwy By Seating
55
GP1000MHwy
50
45
40
35
30
25
2
3
4
5
6
7
8
Seating
Parameter Estimates
Term
Intercept
Seating
Estimate
30.829816
1.3022488
Std Error
2.277905
0.442389
t Ratio
13.53
2.94
Prob>|t|
<.0001
0.0040
Multiple Linear Regression
Response GP1000MHwy
Whole Model
Parameter Estimates
Term
Intercept
Weight(lb)
Cargo
Seating
Horsepower
Estimate
19.100521
0.0040877
0.0533
0.0268912
0.0426999
Std Error
2.098478
0.001203
0.013787
0.428283
0.01567
t Ratio
9.10
3.40
3.87
0.06
2.73
Prob>|t|
<.0001
0.0010
0.0002
0.9501
0.0075
Partial Slopes vs. Marginal
Slopes Example
• In order to evaluate the benefits of a
proposed irrigation scheme in a certain
region, suppose that the relation of yield Y
to rainfall R is investigated over several
years.
• Data is in rainfall.JMP.
Bivariate Fit of Yield By Total Spring Rainfall
90
80
Yield
70
60
50
40
30
7
8
9
10
11
12
13
T otal Spring Rainfall
Linear Fit
Linear Fit
Yield = 76.666667 - 1.6666667 Total Spring Rainfall
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.027778
-0.13426
13.94433
60
8
Parameter Estimates
Term
Intercept
Total Spring Rainfall
Estimate
76.666667
-1.666667
Std Error
40.5546
4.025382
t Ratio
1.89
-0.41
Prob>|t|
0.1076
0.6932
B ivari ate Fi t of Average Spring Temperature B y Total Spri ng R ai nfal l
Average Spring Temperature
57.5
55
52.5
50
47.5
45
42.5
7
8
9
10
11
12
13
T otal Spring Rainfall
Higher rainfall is associated with lower temperature.
Multiple Linear Regression
Response Yield
Parameter Estimates
Term
Intercept
Total Spring Rainfall
Average Spring Temperature
Estimate
-144.7619
5.7142857
2.952381
Std Error
55.8499
2.680238
0.692034
t Ratio
-2.59
2.13
4.27
Prob>|t|
0.0487
0.0862
0.0080
Rainfall is estimated to be beneficial once temperature is held fixed.
Multiple regression provides a better picture of the benefits of
an irrigation scheme because temperature would be held fixed in
an irrigation scheme.