Multiple Regression

Download Report

Transcript Multiple Regression

Multiple
Regression
1
Multiple linear regression generalizes the simple
linear regression model by allowing for many
terms in a mean function rather than just one
intercept and one slope.
1) Simple Linear Regression mean function:
2) Suppose we have a second variable X2 to predict the response
3) Mean function will depend on both the value of X1 and the
value of X2
The main idea in adding X2 is to explain the part of Y that has
not already been explained by X1.
2
Multiple Regression is used to:
• Predict response variable of interest when
several explanatory variables are available
• Analyze the relationship between variables
while controlling for the possible effects of
other variables
3
The model
We have p independent variables that are
correlated to the dependent variable.
Parametrar
error
Y = b0 + b1x1+ b2x2 + …+ bpxp + e
Dependent variable
Independent variables
4
Multiple regression - Demonstration
y
Linear regression model with
one independent variable x.
Y =b0 + b1x + e
The line becomes
a plane.
X1
In the multiple linear regression model
There are several independent variables
Y = b0 + b1x1 + b2x2 + e
X2
5
Assumptions in multiple regression
1. There is a multiple linear
relation between x variables
the and Y variable.
2. The observations are
independent of each other.
3. The variance around the
hyper plane is the same for
all combinations of x
values.
4. The variance around the
hyper plane can be modeled
with a normal distribution.
Assumptions
Y
1
3 and 4
X
6
Multiple Correlation (R) and R2
• The strength of association between Y and the set of
explanatory variables acting together as predictors
• R=Correlation between the Response-value (y) and
the Predicted-values (ŷ) from the multiple regression
model
• R2=The proportion of variability in Y explained by
regression on the x’s.
– The larger the R2, the better the prediction of y by
the set of explanatory variables
7
Estimation of parameters and
evaluation of the model.
 Estimate the parameters with some computer program.
(SPSS, SAS, R, Mintab,….)
 Check if the model assumptions are fulfilled
How? With the help of Residual properties.
 Check how well the model fits the data.
 If the model assumptions are fulfilled and the model is
acceptable, use the parameter estimates for predictions.
8
Some “useful” statistics in Regression outputs
1. Standard error (SE) of the estimated
parameter=the precision of that estimator
2. T-value: t-statistic for testing that the
coefficient is zero. T-value= estimate/its SE
3. P-value: to testing the hypothesis
H0: Coefficient=0
H1: Coefficient0
If the p-value is small  evident that the coefficient is NOT zero.
‘Small’ in most occasions is defined as
“less than 0.05”
9
4. Analysis Of Variance table (AOV)
• Mean Square=Sum of Square /df
• F-statistic=Regression MS/ residual MS
• Used to test if there is no linear relationship between any of
the explanatory variables and Y.
H0: b1=0, …, bp=0 (p predictors, in this case)
(None of the X-variables has a linear relationship with Y)
H1: At least one bi is not zero (At least one x -variable is related
to Y)
•
There is a p-value for this test
10
Example
• “Toulon Theatres” produces commercial advertisements in
newspapers and television.
• Question: We want to understand what kind of
advertisements is better investment.
• During 8 random weeks we observed how much is spent on
advertisement, (TVAdv, NewsAdv) and how much income
(Revenue) is gained instead.
• Model:
Revenue = b0 + b1TVAdv + b2NewsAdv + e
11
Data material
Revenue
TVAdv
NewsAdv
96
5.0
1.5
90
2.0
2.0
95
4.0
1.5
92
2.5
2.5
95
3.0
3.3
94
3.5
2.3
94
2.5
4.2
94
3.0
2.5
Numbers are in 1000 Euro
12
Minitab print-out
Regression Analysis:
Revenue (Dependent variable)
TVAdv and NewsAdv (explanatory variable)
The regression equation is
Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv
Predictor
Constant
TVAdv
NewsAdv
Coef SE Coef
T
P
83,230 1,574 52,88 0,000
2,2902 0,3041 7,53 0,001
1,3010 0,3207 4,06 0,010
S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7%
Analysis of Variance
Source
DF
SS
Regression
2 23,435
Residual Error 5 2,065
Total
7 25,500
MS
F
P
11,718 28,38 0,002
0,413
13
Minitab out-print
Residual Plots for Revenue
Normal Probability Plot
Versus Fits
99
0,50
Residual
Percent
90
50
10
1
0,25
0,00
-0,25
-0,50
-1,0
-0,5
0,0
Residual
0,5
1,0
90,0
91,5
Histogram
96,0
Versus Order
3
0,50
2
Residual
Frequency
93,0
94,5
Fitted Value
1
0,25
0,00
-0,25
-0,50
0
-0,6
-0,4
-0,2 0,0
0,2
Residual
0,4
0,6
1
2
3
4
5
6
Observation Order
7
8
14
Assumptions ok?
• Assumptions
Residual Plots for Revenue
Versus Fits
Constant
variance
99
0,50
Residual
Percent
90
50
10
1
0,25
0,00
-0,25
-0,50
-1,0
-0,5
0,0
Residual
0,5
1,0
90,0
91,5
Histogram
93,0
94,5
Fitted Value
96,0
Independent
(when plooting the
observations by
tim)
Versus Order
3
0,50
2
Residual
Frequency
Normaldistribution.
Normal Probability Plot
1
0,25
0,00
-0,25
-0,50
0
-0,6
-0,4
-0,2 0,0
0,2
Residual
0,4
0,6
1
2
3
4
5
6
Observation Order
7
8
15
Coefficient of determination, R2
Regression Analysis: Revenue versus TVAdv; NewsAdv
The regression equation is
Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv
Predictor
Constant
TVAdv
NewsAdv
Coef SE Coef
T
P
83,230 1,574 52,88 0,000
2,2902 0,3041 7,53 0,001
1,3010 0,3207 4,06 0,010
S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7%
Analysis of Variance
Source
DF
SS
Regression
2 23,435
Residual Error 5 2,065
Total
7 25,500
SSR
SSE
R 
 1
SST
SST
2
MS
F
P
11,718 28,38 0,002
0,413
16
Adjusted R2
Comparing regression models
Overall predictive fit  R2
• It is possible to show that R2 always increase if we
add more x- variables, but adj R2 decreases if the new
x-variable is weekly related to Y
One drawback with R2: more variables  increase in R2
Solution: Adjusted R2
TO compare models with different number of predictors
 Use Adj-R2
17
Regression Analysis: Revenue versus TVAdv; NewsAdv
The regression equation is
Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv
Predictor
Constant
TVAdv
NewsAdv
Coef SE Coef
T
P
83,230 1,574 52,88 0,000
2,2902 0,3041 7,53 0,001
1,3010 0,3207 4,06 0,010
S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7%
Analysis of Variance
Source
DF
SS
Regression
2 23,435
Residual Error 5 2,065
Total
7 25,500
MS
F
P
11,718 28,38 0,002
0,413
SSE n  p  1
MSE
R  1
 1
18
SST n  1
MST
2
a
• ANOVA table:
If any of the explanatory variables are statistically
related to y?
The test statistic is called F and is F-distributed with p
and n-p-1 degrees of freedom.
Fobs
p
Source
DF
SS
Regression
2 23,435
Residual Error 5 2,065
Total
7 25,500
n-p-1
P-value
Analysis of Variance
SSR
SSE
Conclusion?
MS
F
P
11,718 28,38 0,002
0,413
MSR=SSR/p
MSE=SSE/(n-p-1)
19
• We can reject H0.
• We have enough evidence to state that at least
one x- variable is linear related to Y.
• But which x- variable (or variables)?
• We need to look at the P-value for each
regression coefficient.
H0: bi = 0 Xi is not related to Y
H1: bi  0 Xi is related to Y
20
20
Minitab print-out
Regression Analysis: Revenue versus TVAdv; NewsAdv
The regression equation is
Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv
Predictor
Constant
TVAdv
NewsAdv
Coef SE Coef
T
P
83,230 1,574 52,88 0,000
2,2902 0,3041 7,53 0,001
1,3010 0,3207 4,06 0,010
S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7%
Analysis of Variance
Source
DF
SS
Regression
2 23,435
Residual Error 5 2,065
Total
7 25,500
MS
F
P
11,718 28,38 0,002
0,413
21
Interpretation of the parameter estimates
• b0 = 83.23 is the intercept. This is the expected
income if no money is spent on advertisement.
• b1 = 2.290. For each 1000 EUR we spend on
television advertisement the income increase by
2290 EUR, given that the other x- variables (in
this case, Newspaper adv) are constant.
22
• b2 = 1.301. For each 1000 EUR we spend on
newspaper advertisement the income increase by
1301 EUR, given that the other x- variables (in
this case, TV adv) are constant.
• We can use the model to predict income , knowing
how much we spend money on television and
newspaper advertisement.
23
Multiple
Regression with
nominal variables
24
Regression with quantitative and categorical
predictors
An indicator- (dummy-)variablefor a propertyA :
1
xk  
0
If propertyA
Otherwise
Example Johansson:
The company needs a model to predict “repair time”
Y = Repair time (hours)
x = Time duration since last repair (month)
z = Type of repair (0 if mechanical; 1 if electrical)
Y=b0+ b1x+ b2z+ e
Nominal
25
Nominal explanatory variabel (z) with two
categories
y
Y
Line for Z=1
b3
b0+b2
Line for Z=0
x1
b0
X1
A regression with one
nominal (Z) and one
quantitative (x) variable.
E(y)=b0+ b1x+ b2z
x2
A regression with one nominal (Z)
and two quantitative (x1 and x2)
variables.
E(y)=b0+ b1x1+ b2 x2+ b3z
26
Example ”Johansson”
Regression Analysis: ”Repaire Time” by ”Months” and ”Type”
The regression equation is
Time = 0,930 + 0,388 Months + 1,26 Type
Predictor Coef
Constant 0,9305
Months
0,38762
Type
1,2627
SE Coef T
P
0,4670 1,99 0,087
0,06257 6,20 0,000
0,3141 4,02 0,005
S = 0,459048 R-Sq = 85,9% R-Sq(adj) = 81,9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
F
P
2 9,0009 4,5005 21,36 0,001
7 1,4751 0,2107
9 10,4760
27
Example ”Johansson”
Residual Plots for Time
Versus Fits
Normal Probability Plot
99
0,5
Residual
Percent
90
50
0,0
10
1
-0,5
-0,5
-1,0
0,0
Residual
3
2
1,0
0,5
5
4
Fitted Value
Versus Order
Histogram
0,5
3
Residual
Frequency
4
2
0,0
1
0
-0,5
-0,50
-0,25
0,25
0,00
Residual
0,50
0,75
1
2
3
7
6
5
4
Observation Order
8
9
10
28
Example ”Johansson”
Regression Analysis: ”Repaire Time” by
”Months” and ”Type”
The regression equation is
Time = 0,930 + 0,388 Months + 1,26 Type
Predictor Coef
Constant 0,9305
Months
0,38762
Type
1,2627
SE Coef T
P
0,4670 1,99 0,087
0,06257 6,20 0,000
0,3141 4,02 0,005
S = 0,459048 R-Sq = 85,9% R-Sq(adj) = 81,9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
F
P
2 9,0009 4,5005 21,36 0,001
7 1,4751 0,2107
9 10,4760
At least one explanatory variable
is related to y
• Both b1 and b2 are
significantly different
from zero. Both xvariables help to predict
Y.
• b0 is not significantly
different from zero.
• High coefficient of
determination (R2).
29
Interpretation of the parameter estimations
Regression Analysis: Time versus
Months; Type
The regression equation is
Time = 0,930 + 0,388 Months + 1,26
Type
Predictor Coef
Constant 0,9305
Months
0,38762
Type
1,2627
SE Coef T
P
0,4670 1,99 0,087
0,06257 6,20 0,000
0,3141 4,02 0,005
• b0 = 0.93. Expected repair time in hours (56
min) for a mechanical repair of a facility that
has not repaired before. However this
interpretation would be applicable if b0 was
statistically different from zero (Pvalue=0.087).
• b1 = 0.39. For each month without service, the
mean repair time increase by 0.39 hours (23
min)
• b2 = 1.26. If the repair is electrical, the mean
repair time increases by 1.26 hours (1 hour16
min).
30
Example ”Johansson”
Scatterplot of Time vs Months
5,5
5,0
electrical
electrical
4,5
Ty pe
electrical
mechanical
electrical
mechanical
electrical
mechanical
Time
4,0
3,5
3,0
mechanical
electrical
2,5
2,0
mechanical
2
3
4
5
6
Months
7
8
9
31
Nominal explanatory variebles with more than two categories
Example: z = type of repair (mechanical / electrical / computer-aid)
Y
Line: x2 = 0 and x3 = 1
Line: x2 = 1 and x3 = 0
b0+b3
A nominal variable
with k categories is
represented by k-1
dummy variables
Line: x2 = 0 and x3 = 0
b0+b2
b0
X1
Dummy variables
A regression with one nominal (3 categoris) and one
interval (x1) variable.
y  b  b x  b x  b x
0
1
1
2
2
3
3
E(y)=b0+ b1x1+ b2 x2+ b3 x3
x2=1 if mech; x2=0
otherwise
X3=1 if electrical; x3=0
otherwise
x2=x3=0 if computer-aid
32
Problems in
Multiple
Regression
33
Collinearity
When two or more predictor variables are highly
correlated  Difficult to separate their effects
on the response
• Sometimes called Multicollinearity
• Increases the Standard Error b
34
If two x- variables are indicators of a common feature
•
Example 1: x1 = length in cm and x2 = length in inc
•
Example 2: x1 = household income and x2 = income
of the person in the household with highest salary
35
Consequence of Multicollinearity
•
•
•
The estimators of the regression parameters gets
large variance.
No robustness. The estimates of the regression
parameters changes a lot if there is minor changes
in the observations or if a observation is removed
or added.
In some cases the F-test (in ANOVA Table) is
significant but none of the test results in the tstatistics (parameter estimation) are significant. .
36
Variance Inflation Factor (VIF)
• VIF of a variable tells us how much the variance of of that
variable increases by having other predictors in the model
• Suppose we have X1, X2, …, Xp predictors
• Regress Xj on the other p-1 predictors
• Calculate R2 for this model
• R2 measures how well Xj can be predicted by other Xs
• VIF of Xj =1/ (1-R2j)
• High values of R2j  High VIF
• Rule of thumb: VIF > 10 is of concern
37
How to discover Multicollinearity?
•
Correlation matrix x- variables.
Example:
Y = house living space
x1 = Income
x2 = Family size
Person Correlation coefficient (Income & Family
size)=0.978
P-value<0.001
► high correlation between explanatory variables (Xs)
38
Regression Analysis: House living Spacce by ”income” and ”
Family size”
The regression equation is
House living space = - 11,5 + 0,568 income+ 3,4 Family size
•
•
•
•
Predictor
Coef SE Coef
T
P
VIF
Constant
-11,52 70,45 -0,16 0,878
Income
0,5681 0,5504 1,03 0,360 22,750
Family size
3,43
17,41
0,20 0,853 22,750
S = 16,0924 R-Sq = 89,5% R-Sq(adj) = 84,3%
•
Analysis of Variance
•
•
•
•
Source
Regression
Residual Error
Total
F-test significant but
none of the T-tests!
VIF = 22.75 > 10!
DF
SS
MS
F
P
2 8849,9 4424,9 17,09 0,011
4 1035,9 259,0
6 9885,7
39
Solutions when Multicollinearity exists
•
Remove those variables that explain a small portion
of the remaining unexplained variation in y.
•
Construct a summary index by combining responses
of correlated variables (e.g., instead of including both
“weight” and “height” in the model, use “BMI” instead)
40
Example
with dummy variable
41
State finance in war and peace.
• We want to examine if public purchase of
premium bond (x) is related to the national
income Y.
• Data: Yearly registrations of the variables
in Canada during 1933 to 1949
42
Observations
year
1933
1934
1935
1936
1937
1938
1939
1940
1941
y
2,6
3,0
3,6
3,7
3,8
4,1
4,4
7,1
8,0
x
2,4
2,8
3,1
3,4
3,9
4,0
4,2
5,1
6,3
D
0
0
0
0
0
0
0
1
1
year
1942
1943
1944
1945
1946
1947
1948
1949
y
8,9
9,7
10,2
10,1
7,9
8,7
9,1
10,1
x
8,1
8,8
9,6
9,7
9,6
10,4
12,0
12,9
D
1
1
1
1
0
0
0
0
43
Dummy variable
• D is a dummy variable
• D
= 1 if Canada in war
0 if Canada in peace
44
Scatterplot of y vs x
11
D
0
1
10
9
8
y
7
6
5
4
3
2
2
4
6
8
x
10
12
14
45
Regression Analysis: y versus x
(Whitout the dummy)
The regression equation is
y = 1,57 + 0,759 x
Predictor Coef SE Coef T
P
Constant 1,5698 0,6337 2,48 0,026
x
0,75936 0,08307 9,14 0,000
S = 1,15623 R-Sq = 84,8% R-Sq(adj) = 83,8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
F
P
1 111,71 111,71 83,56 0,000
15 20,05 1,34
16 131,76
46
Scatterplot of y vs x
11
10
1
0
1
9
1
8
y
11
0
0
0
1
7
6
5
4
0 0
3
0
0
0
0
0
2
2
4
6
8
x
10
12
14
47
Residual Plots for y
Residuals Versus the Fitted Values
99
2
90
1
Residual
Percent
Normal Probability Plot of the Residuals
50
10
1
-2
-1
0
Residual
1
0
-1
-2
2
6,0
2
4,5
1
3,0
8
Fitted Value
10
12
0
-1
1,5
0,0
6
Residuals Versus the Order of the Data
Residual
Frequency
Histogram of the Residuals
4
-1,5
-1,0
-0,5
0,0
0,5
Residual
1,0
1,5
-2
2
4
6
8
10
12
Observation Order
14
16
48
Residuals vs. Estimated values
Scatterplot of RESI1 vs FITS1
2
D
0
1
RESI1
1
0
0
-1
-2
3
4
5
6
7
8
9
10
11
12
FITS1
49
Regression Analysis: y versus x; D (Whit the dummy)
The regression equation is
y = 1,29 + 0,681 x + 2,30 D
Predictor Coef SE Coef
T
P
Constant 1,2897 0,1155 11,16 0,000
x
0,68141 0,01549 43,99 0,000
D
2,3044 0,1094 21,06 0,000
S = 0,209367 R-Sq = 99,5% R-Sq(adj) = 99,5%
Analysis of Variance
Source
DF
SS
Regression
2 131,145
Residual Error 14
0,614
Total
16 131,759
MS
F
P
65,573 1495,92 0,000
0,044
50
Scatterplot of y vs x
11
D
0
1
10
9
8
y
7
6
5
4
3
2
2
4
6
8
x
10
12
14
51
Residual Plots for y
Residuals Versus the Fitted Values
99
0,4
90
0,2
Residual
Percent
Normal Probability Plot of the Residuals
50
10
1
-0,50
-0,25
0,00
Residual
0,25
2
4
6
Fitted Value
8
10
Residuals Versus the Order of the Data
0,4
6,0
0,2
4,5
Residual
Frequency
-0,2
-0,4
0,50
Histogram of the Residuals
3,0
0,0
-0,2
1,5
0,0
0,0
-0,4 -0,3 -0,2 -0,1 0,0
Residual
0,1
0,2
0,3
-0,4
2
4
6
8
10
12
Observation Order
14
16
52
Residuals vs. Estimated values
Scatterplot of RESI2 vs FITS2
0,4
D
0
1
0,3
0,2
RESI2
0,1
0,0
0
-0,1
-0,2
-0,3
-0,4
2
3
4
5
6
7
8
9
10
11
FITS2
53