Transcript Document

Multiple Linear Regression
uses 2 or more predictors
General form:
z y  b1zx1  b2 z x2  b3z x3  .... bn zxn
Let us take simplest multiple regression case--two predictors:
z y  b1z x1  b2 z x2
Here, the b’s are not simply corx1, yand corx 2, y , unless
x1 and x2 have zero correlation with one another. Any correlation between x1 and x2 makes determining the b’s less simple.
The b’s are related to the partial correlation, in which the
value of the other predictor(s) is held constant. Holding other
predictors constant eliminates the part of the correlation due
to the other predictors and not just to the predictor at hand.
Notation: partial correlation of y with x1, with x2 held
constant, is written cor
y, x1.x 2
z y  b1z x1  b2 z x2
For 2 (or any n) predictors, there are 2 (or any n) equations
in 2 (or any n) unknowns to be solved simultaneously.
When n >3 or so, determinant operations are necessary.
For case of 2 predictors, and using z values (variables
standardized by subtracting their mean and then dividing
by the standard deviation) for simplicity, the solution can
be done by hand. The two equations to be solved
simultaneously are:
b1.2
b1.2(corx1,x2)
+b2.1(cor x1,x2)
+b2.1
= cory,x1
= cory,x2
Goal is to find the two b coefficients, b1.2 and b2.1
b1.2
b1.2(corx1,x2)
+b2.1(cor x1,x2)
+b2.1
= cory,x1
= cory,x2
Example of a multiple regression problem with two predictors
The number of Atlantic hurricanes between June and November
is slightly predictable 6 months in advance (in early December)
using several precursor atmospheric and oceanic variables. Two
variables used are (1) 500 millibar geopotential height in November in the polar north Atlantic (67.5N-85°N latitude, 10E-50°W
longitude); and (2) sea level pressure in November in the North
tropical Pacific (7.5N-22.5°N latitude, 125-175°W longitude).
Location of two long-lead Atlantic hurricane predictor regions
500mb
SLP
http://www.cdc.noaa.gov/map/images/sst/sst.anom.month.gif
Physical reasoning behind the two predictors:
(1) 500 millibar geopotential height in November in the polar
north Atlantic. High heights are associated with a negative
North Atlantic Oscillation (NAO) pattern, tending to associate
with a stronger thermohaline circulation, and also tending to be
followed by weaker upper atmospheric westerlies and weaker
low-level trade winds in the tropical Atlantic the following
hurricane season. All of these favor hurricane activity.
(2) sea level pressure in November in the North tropical Pacific.
High pressure in this region in winter tends to be followed by
La Nina conditions in the coming summer and fall, which favors
easterly Atlantic wind anomalies aloft, and hurricane activity.
First step: Find “regular” correlations among all the variables
(x1 ,x2, y): corx1,y corx2,y corx1,x2
X1: Polar north Atlantic 500 millibar height
X2: North tropical Pacific sea level pressure
corAtlantic500mb,hurricanes
corPacificSLP,hurricanes
corAtlantic500mb, PacificSLP
= 0.20 (x1,y)
= 0.40 (x2,y)
one pre= 0.30 (x1,x2)  dictor vs
the other
Simultaneous equations to be solved
b1.2
+(0.30)b2.1
= 0.20
(0.30)b1.2
+b2.1
= 0.40
Solution: Multiply 1st equation by 3.333, then subtract
second equation from first equation. This gives
(3.033)b1.2
+0
= 0.267
So b1.2 = 0.088 and use this to find that b2.1 = 0.374
Regression equation is Zy = (0.088)zx1 + (0.374)zx2
Multiple correlation coefficient = R = correlation between
predicted y and actual y using multiple regression.
R  b1.2cor x1 y  b2.1cor x2 y
In example above,
R  (.088)(.20)  (.373)(.40) = 0.408
Note this is only very slightly better than using the second
predictor alone in simple regression. This is not surprising,
since the first predictor’s total correlation with y is only
0.2, and it is correlated 0.3 with the second predictor, so
that the second predictor already accounts for some of what
the first predictor has to offer. A decision would probably
be made concerning whether it is worth the effort to include
the first predictor for such a small gain. Note: the multiple
correlation can never decrease when more predictors are added.
Multiple R is usually inflated somewhat compared with
the true relationship, since additional predictors fit
the accidental variations found in the test sample.
Adjustment (decrease) of R for the existence of multiple
predictors gives a less biased estimate of R:
Adjusted R =
2
R (n  1)  k
n  k 1
n = sample size
k = number of predictors
Sampling variability of a simple (x, y) correlation coefficient
around zero when population correlation is zero is approximately
1
StError( zerocorrel) 
n 1
In multiple regression the same approximate relationship
holds except that n must be further decreased by the
number of predictors additional to the first one.
If the number of predictors (x’s) is denoted by k, then
the sampling variability of R around zero, when there is
no true relationship with any of the predictors, is given by
1
StError( zerocorrel) 
nk
It is easier to get a given multiple correlation by chance as
the number of predictors increases.
Partial Correlation is correlation between y and x1, where a
variable x2 is not allowed to vary. Example: in an elementary school, reading ability (y) is highly correlated with
the child’s weight (x1). But both y and x1 are really caused
by something else: the child’s age (call x2). What would the
correlation be between weight and reading ability if the age
were held constant? (Would it drop down to zero?)
ry , x1. x 2 
ry , x1  (ry , x 2 )(rx1, x 2 )
(1  r
b1  ry , x1. x 2
2
y,x2
)(1  r
2
x1, x 2
)
StErrorEst y , x 2
StErrorEstx1, x 2
A similar set of equations exists for the second predictor.
Suppose the three correlations are:
reading vs. weight : ry , x1  0.66
reading vs. age:
weight vs. age:
ry , x 2  0.82
rx1, x 2  0.71
The two partial correlations come out to be:
ry , x1.x 2  0.193
ry , x 2.x1  0.664
Finally, the two regression weights turn out to be:
b1  0.157
b2  0.709
R  0.827
Weight is seen to be a minor factor compared with age.
Another Example – Sahel Drying Trend
Suppose 50 years of climate data suggest that the drying of the
Sahel in northern Africa in July to September may be related
both to warming in the tropical Atlantic and Indian oceans
(x1) as well as local changes in land use in the Sahel Itself (x2).
x1 is expressed as SST, and x2 is expressed as percentage
vegetation decrease (expressed as a positive percentage) from
the vegetation found at the beginning of the 50 year period.
While both factors appear related to the downward trend in
rainfall, the two predictors are somewhat correlated with one
another. Suppose the correlations come out as follows:
Cor(y,x1)= -0.52 Cor(y,x2)= -0.37 Cor(x1,x2)= 0.50
What would be the multiple regression equation in “unit-free”
standard deviation (z) units?
Cor(x1,y)= -0.52
Cor(x2,y)= -0.37 Cor(x1,x2)=0.50
First we set up the two equations to be solved simultaneously
b1.2
b1.2(corx1,x2)
+b2.1(cor x1,x2)
+b2.1
b1.2
(0.50)b1.2
+(0.50)b2.1
+b2.1
= cory,x1
= cory,x2
= -0.52
= -0.37
Want to eliminate (or cancel) b1.2 or b2.1. To eliminate b2.1,
multiply first equation by 2 and subtract second one from it:
1.5 b1.2 = -0.67
and b1.2 = -0.447
and b2.1 = -0.147
Regression equation is Zy = -0.447 zx1 -0.147 zx2
Regression equation is Zy = -0.447zx1 -0.147zx2
If want to express the above equation in physical units, then
must know the means and standard deviations of y, x1 and x2
and make substitutions to replace the z’s.
y  y  z y SDy
z y  ( y  y) / SDy
x1  x1  z x1SDx1
z x1  ( x1  x1) / SDx1
x2  x2  zx2SDx2
z x2  ( x2  x2 ) / SDx2
When substitute and simplify results, y, x1 and x2 terms will
appear instead of z terms. There generally will also be a constant
term that is not found in the z expression because the original
variables probably do not have means of 0 the way z’s always do.
The means and the standard deviations of the three data sets are
y: Jul-Aug-Sep Sahel rainfall (mm): mean 230 mm, SD 88 mm
x1: Tropical Atlantic/Indian ocean SST: mean 28.3 degr C, SD 1.7 C
x2: Deforestation (percent of initial): mean 34%, SD 22%
Zy = -0.447-zx1 -0.147zx2
( y  y)
( x1  x1)
( x2  x2 )
 0.447
 0.147
SDy
SDx1
SDx 2
( x1  28.3)
( x2  34)
( y  230)
 0.447
 0.147
88
1.7
22
After simplification, final form will be:
y = coeff x1 + coeff x2 + constant (here, both coeff <0)
b1
b2
We now compute the multiple correlation R, and the
standard error of estimate for the multiple regression.
Using the two individual correlations and the b terms:
Cor(x1,y)= -0.52 Cor(x2,y)= -0.37 Cor(x1,x2)=0.50
Regression equation is Zy = -0.447 zx1 -0.147 zx2
R  b1.2cor x1 y  b2.1cor x2 y
R  (.447)(.52)  (.147)(.37)
= 0.535
The deforestation factor helps the prediction accuracy only
slightly. If there were less correlation between the two
predictors, then the second predictor would be more valuable.
Standard Error of Estimate =
1 R
2
y ,( x1x 2)
= 0.845
In physical units it is (0.845)(88 mm) =74.3 mm
Let us evaluate the significance of the multiple correlation
of 0.535. How likely could it have arisen by chance alone?
First we find the standard error of samples of 50 drawn from
a population having no correlations at all, using 2 predictors:
1
StError( zerocorrel) 
nk
1
For n=50 and k=2 we get
= 0.145
50  2
For a 2-sided z test at the 0.05 level, we need 1.96(0.145) = 0.28
This is easily exceeded, suggesting that the combination of the
two predictors (SST and deforestation) do have an impact on
Sahel summer rainfall. (Using SST alone in simple regression, with
cor=0.52, would have given nearly the same level of significance.)
Example problem using this regression equation:
Suppose that a climate change model predicts that in year
2050, the SST in the tropical Atlantic and Indian oceans
will be 2.4 standard deviations above the means given for
the 50-year period of the preceding problem. (It is now
about 1.6 standard deviations above that mean.) Assume
that land use practices (percentage deforestation) will be
the same as they are now, which is 1.3 standard deviations
above the mean. Under this scenario, using the multiple
regression relationship above, how many standard deviations
away from the mean will Jul-Aug-Sep Sahel rainfall be,
and what seasonal total rainfall does that correspond to?
The problem can be solved either in physical units or in standard
deviation units, and then the answer can be expressed in either (or
both) kinds of units afterward.
If solved in physical units, the values of the two predictions in SD
units (2.4 and 1.3) can be converted to raw units using the means
and standard deviations of the variables provided previously,
and the raw units form of the regression equation would be used.
If solved in SD units, the simpler equation can be used:
Zy = -0.447zx1 -0.147zx2 The z’s of the two predictors,
according to the scenario given, will be 2.4 and 1.3, respectively.
Then Zy = -0.447(2.4) – 0.147(1.3) = -1.264. This is how many
SDs away from the mean the rainfall would be. Since the rainfall
mean and SD are 230 and 88 mm, respectively, the actual amount
predicted is 230 – 1.264(88) = 230 – 111.2 = 118.8 mm.
Colinearity
When the predictors are highly correlated with one another
in multiple regression, a condition of colinearity exists. When
this happens, the coefficients of two highly correlated
predictors may have opposing signs, even when each of them
has the same sign of simple correlation with the predictand.
(Such opposing signed coefficients minimizes squared errors.)
Issues and problems with this are (1) it is counterintuitive,
and (2) the coefficients are very unstable, such that if one
more sample is added to the data, they may change drastically.
When colinearity exists, the multiple regression formula
will often still provide useful and accurate predictions. To
eliminate colinearity, predictors that are highly correlated
can be combined into a single predictor.
Overfitting
When too many predictors are included in a multiple
regression equation, random correlations between the
variations of y (the predictand) and one of the predictors
are “explained” by the equation. Then when the equation
is used on independent (e.g. future) predictions, the
results are worse than expected.
Overfitting and colinearity are two different issues.
Overfitting is more serious, since it is “deceptive”.
To reduce effects of overfitting: Can use cross-validation.
--withhold one or more cases for forming equation,
then predict those cases; rotate cases withheld
--withhold part of the period for forming equation,
then predict that part of the period.