Transcript Slide 1

Multiple Regression I
4/9/12
• Transformations
• The model
• Individual coefficients
• R2
• ANOVA for regression
• Residual standard error
Section 9.4, 9.5
Professor Kari Lock Morgan
Duke University
To Do
• Project 2 Proposal (due Wednesday, 4/11)
• Homework 9 (due Monday, 4/16)
• Project 2 Presentation (Thursday, 4/19)
• Project 2 Paper (Wednesday, 4/25)
Non-Constant Variability
Non-Normal Residuals
Transformations
• If the conditions are not satisfied, there are
some common transformations you can apply
to the response variable
• You can take any function of y and use it as the
response, but the most common are
• log(y) (natural logarithm - ln)
• y (square root)
• y2 (squared)
• ey (exponential))
log(y)
Original Response, y:
Logged Response, log(y):
y
Original Response, y:
Square root of Response, y:
2
y
Original Response, y:
Squared response, y2:
y
e
Original Response, y:
Exponentiated Response, ey:
Multiple Regression
• Multiple regression extends simple linear
regression to include multiple explanatory
variables:
y  0  1x1  2 x2  ...  k xk òi
Grade on Final
• We’ll use your current grades to predict final
exam scores, based on a model from last
semester’s students
• Response: final exam score
• Explanatory: hw average, clicker average,
exam 1, exam 2
y  0  1hw  2clicker  3exam1  4exam2  
Grade on Final
What variable is the most significant predictor
of final exam score?
a) Homework average
b) Clicker average
c) Exam 1
d) Exam 2
Inference for Coefficients
The p-value for explanatory variable xi is
associated with the hypotheses
H 0 : i  0
H a : i  0
For intervals and p-values of coefficients in
multiple regression, use a t-distribution with
degrees of freedom n – k – 1, where k is the
number of explanatory variables included in
the model
Grade on Final
Estimate your score on the final exam.
What type of interval do you want for this
estimate?
a) Confidence interval
b) Prediction interval
Grade on Final
Estimate your score on the final exam.
(hw average is out of 10, clicker average is out of 2)
Grade on Final
Is the clicker coefficient really negative?!?
Give a 95% confidence interval for the clicker
coefficient (okay to use t* = 2).
Grade on Final
Is your score on exam 2 really not a significant
predictor of your final exam score?!?
Coefficients
• The coefficient (and significance) for each
explanatory variable depend on the other
variables in the model!
• In predicting final exam scores, if you know
someone’s score on Exam 1, it doesn’t provide
much additional information to know their score
on Exam 2 (both of these explanatory variables are
highly correlated)
Grade on Final
If you take Exam 1 out of the model…
Now Exam 2 is significant!
Model with Exam 1:
Grade on Final
If you include Project 1 in the model…
Model without Project 1:
Grades
Multiple Regression
• The coefficient for each explanatory variable is the
predicted change in y for one unit change in x,
given the other explanatory variables in the
model!
• The p-value for each coefficient indicates whether
it is a significant predictor of y, given the other
explanatory variables in the model!
• If explanatory variables are associated with each
other, coefficients and p-values will change
depending on what else is included in the model
Residuals
Are the conditions satisfied?
(a) Yes
(b) No
Evaluating a Model
• How do we evaluate the success of a
model?
• How we determine the overall
significance of a model?
• How do we choose between two
competing models?
Variability
• One way to evaluate a model is to partition
variability
Total
Variability

Variability
Explained
by the
Model

Error
Variability
• A good model “explains” a lot of the variability
in Y
Exam Scores
• Without knowing the explanatory variables, we
can say that a person’s final exam score will
probably be between 60 and 98 (the range of Y)
• Knowing hw average, clicker average, exam 1
and 2 grades, and project 1 grades, we can give a
narrower prediction interval for final exam score
• We say the some of the variability in y is
explained by the explanatory variables
• How do we quantify this?
Variability
How do we quantify variability in Y?
a) Standard deviation of Y
b) Sum of squared deviations from the
mean of Y
c) (a) or (b)
d) None of the above
Sums of Squares
Total
Variability
n
 Yi  Y 
i 1
SST
2

Variability
Explained
by the
model

Error
variability

ˆ
Y

 Y 

ˆ
Y

Y




n
i 1
i
SSM
2
n
i 1

i
SSE
i
2
Variability
Total Sum of Squares:
n
SST    yi  y 
2
i 1
Y
Model Sum of Squares:
n
SSM    yˆi  y 
2
i 1
Error Sum of Squares:
n
SSE    yi  yˆi 
2
i 1
• If SSM is much higher than SSE, than the
model explains a lot of the variability in Y
2
R
SSM "Variability in Y explained by the model"
R 

SST
"Total variability in Y"
2
Variability Explained by the Model
Total Variability
• R2 is the proportion of the variability in
Y that is explained by the model
2
R
• For simple linear regression, R2 is just
the squared correlation between X and Y
• For multiple regression, R2 is the
squared correlation between the actual
values and the predicted values
2
R
R 2  0.67
R 2  0.09
Final Exam Grade
Is the model significant?
• If we want to test whether the model is
significant (whether the model helps to
predict y), we can test the hypotheses:
H 0 : 1   2  ...   k  0
H a : At least one i  0
• We do this with ANOVA!
ANOVA for Regression
Source
df
Model
k
Sum of
Squares
SSM
Error
n-k-1
SSE
Total
n-1
SST
Mean
F
p-value
Square
MSM =
MSM
SSM/k
MSE Use Fk,n-k-1
MSE =
SSE/(n-k-1)
k: number of explanatory variables
n: sample size
Final Exam Grade
For this model, do the explanatory variables
significantly help to predict final exam score?
(calculate a p-value).
(a) Yes
(b) No
n = 69
SSM = 3125.8
SSE = 1901.4
ANOVA for Regression
5
Sum of
Squares
3125.8
Mean
Square
625.16
Error
63
1901.4
30.18
Total
68
5027.2
Source
df
Model
F
20.71
p-value
0
Final Exam Grade
Simple Linear Regression
• For simple linear regression, the following
tests will all give equivalent p-values:
•
t-test for non-zero correlation
•
t-test for non-zero slope
•
ANOVA for regression
Mean Square Error (MSE)
• Mean square error (MSE) measures the
average variability in the errors (residuals)
• The square root of MSE gives the standard
deviation of the residuals (giving a typical
distance of points from the line)
• This number is also given in the R output as
the residual standard error, and is known as s
in the textbook
Final Exam Grade
Simple Linear Model
yi  0  1 xi   i
i ~ N  0,  
Residual standard error = MSE = se
estimates the standard deviation of
the residuals (the spread of the
normal distributions around the
predicted values)
Residual Standard Error
• Use the fact that the residual standard error
is 5.494 and your predicted final exam score to
compute an approximate 95% prediction
interval for your final exam score
yˆ  2  5.494
• NOTE: This calculation only takes into account
errors around the line, not uncertainty in the line
itself, so your true prediction interval will be slightly
wider
To Come…
• How do we decide which explanatory
variables to include in the model?
• How do we use categorical explanatory
variables?
• What is the coefficient of one explanatory
variable depends on the value of another
explanatory variable?