Non-linear Regressions

Download Report

Transcript Non-linear Regressions

Multiple, stepwise and non-linear
regression models
Peter Shaw
RU
Y = 426 – 4.6X
r = 0.956
Exponential decay curve
600
activity
500
400
300
Series1
200
OR
100
0
0
20
40
60
time
80
100
Y=487*exp(-0.022*X)
r = 1.00
Introduction
Last week we met and ran bivariate correlation/regression
analyses, which were probably simple and familiar to most of
you.
This week we extend this concept, looking at variants; MLR
simple and stepwise, and non-linear regressions.
The universals here are linear regression equations, and
goodness of fit estimated as R2. The bivariate r we met last
week is not meaningful in many cases here.
Multiple Linear Regression, MLR
MLR is just like bivariate regression, except that you have
multiple independent (“explanatory”) variables.

Model:

Y = A + B*X1 + C*X2 (+ D*X3….)

Very often the explanatory variables are co-correlated,
hence the dislike of the term “independents”

The goodness of fit is indicated by the R2 value, where R2
= proportion of total variance in Y which is explained by
the regression. This is just like bivariate regression,
except that we don’t used a signed (+/-) r value.
(Why no r? Because with 2 variables the line either goes up or
down. With 3 variables the line can go +ve wrt X1 and -ve wrt
X2)

A three dimensional data space.
Three variables define a three-dimensional
data space, which can be visualised using
three-dimensional graphs.
The 3-D graph of flower number,
leaf length and plant height, seen
from two rotations.
No. flowers
60
Leaf 1, mm
30
Height 1st Leaf
50
No.flowers
mm
mm
108
12
18
111
11
10
147
23
22
218
21
14
240
37
12
223
30
25
242
28
45
480
77
40
290
40
50
263
55
30
250
500Height, mm
No. flowers
60
30
250
100
Leaf 1, mm
500 Height, mm
Figure 3.19. The 3-D
graph relating crake
density to mowing date
and vegetation type. The
width of the circles
indicates corncrake
number, ranging from 0 to
the maximum of 10.
Erosion depth on marble tombstones in three
Sheffield churchyards as a function of the age of the
stone.
Basic regression equation: Y = -0.03 + 0.027*X
[Nb there’s an argument for forcing A = 0, but I’ll skip that]
3.0
Depth, mm
2.0
1.0
0
0
50
Age, years
100
Rural site
City edge
City centre
Erosion depth on marble tombstones in three
Sheffield churchyards as a function of the age of the
stone. A separate regression line has been fitted for
each of the three churchyards. erosion = 0.42 + 0.024*age + -0.13*distance,
R2=0.73 (97 df)
which can be compared with the bivariate equation:
3.0
erosion = -0.28 + 0.027*age, R2 = 0.63 (98 df).
There are several philosophical
catches here: Intercept? Parallel
lines?
Depth, mm
2.0
Leads to ancova
Key
1.0
Rural site
City edge
City centre
0
0
50
Age, years
100
Notice how adding a second explanatory variable increased
the % variance explained – this is the whole point of MLR.
In fact adding an explanatory variable is ALMOST CERTAIN to
increase the R2 – if it does not, it must be perfectly correlated
with the first explanatory variable (or have zero variance). The
question is really whether it is worth adding that new variable?
SRCE
df
SS
MS
F
Age
1
58.995 3 58.995
Error
98
34.533
Total
99
93.529
SRCE
df
SS
MS
F
A&D
2
68.33
34.166
131**
Error
97
25.196
0.26
Total
99
93.529
167**
.352
How does this relate to
R2? Recall that R2 = %
variance explained =
58.995/93.529 = 0.6308
H0: MS (A&D-A) = MS error
F1,97 = (68.33-58.995)/1/0.26
= 35.9**
This leads to Stepwise MLR
Given 1 dependent variable and lots of explanatory variables,
start by finding the single highest correlation, and if significant,
iterate:
1: Find the explanatory variable that causes the greatest reduction
in error variance
2: Is this reduction significant?
If so add to the model and repeat.
SPSS agrees!
Both age and
distance are
predictors
Data dredging tip:
If you have (say) 50 spp, and want to find out
quickly which one shows the sharpest
response to a 1or0 factor (MF, logged vs
unlogged etc), try this:
Set the factor as dependent, then run stepwise
MLR using all spp as predictors of that factor.
Eccentric but works like a charm!
MLR is powerful and
dangerous:
MLR is widely used in research, especially for “data
dredging”.

It gives accurate R2 values, and valid least-squares
estimates of parameters – intercept, gradients wrt X1, X2
etc.

Sometimes its power can seem almost magical – I have a
neat little dataset in which MLR can derive parameters of
the polynomial function

xX = A + B*X + C*X2 + D*X3…
Empirically fitting polynomial functions by MLR:
The triangular function
We can find the general
relation between number of
points and number of rows –
empirically! We solve the
polynomial by MLR, then
1+2+3+4..+99+100 = ?
The case of the mysterious tipburn on
the CEGB’s Scots Pine
Tipburn on Pinus sylvestris in
Liphook open air fumigation
experiment 1985-1990
We had some homogenous aged pines exposed to SO2
in a sophisticated open-air fumigation system, max c.
200ppb. Books told us that Pine only got visible
injury above 100 ppm – so we were well into the
invisible damage region. Except that it wasn’t
invisible on a few plants! About 5% were sensitive
and could show damage in our conditions. Apart from
DNA, what was going on?
Data structure
35 rows of data = 7 plots * 5 years
#plants burnt (=dep) then columns describing SO2.
Column: annual mean; 12 * monthly mean; 52*weekly
mean; 365 * daily mean, rather more hourly means ..
Stepwise MLR picked up the mean SO2 conc. during the week of
budburst as best predictor. Bingo!
But beware!









There is a serious pitfall when the explanatory variables
are co-correlated (which is frequent, almost routine).
In this case the gradients should be more-or-less ignored!
They may be accurate, but are utterly misleading.
An illustration: Y correlates +vely with X1 and with X2.
Enter Y, X1 and X2 into an MLR
Y = A + B*X1 + C*X2
But if B >0, C may well be < 0! On the face of it this
suggests that Y is –vely correlated with X2.
In fact this correlation is with the residual information
content in X2 after its correlation with X1 is removed.
At least one multivariate package (CANOCO) instructs that
prior to analysis you must ensure all explanatory variables
are uncorrelated, by excluding offenders from the analysis.
This caveat is widely ignored, indeed widely unknown.
Colinearity & Variance inflation factors
To diagnose colinearity, compute the VIF for each
variable:
VIFj = 1/(1-R2j)
Where R2j is the multiple R2 between variable J and all
others. If your maximum VIF > 10, worry! Exclude
variables or run a PCA (coming up soon!)
Non-linear regressions

R = 0.95**

R = 0 NS
You should now be familiar with the
notion of fitting a best fit line to a
scattergraph, and the use of an R2
value to assess the significance of a
relationship.
Be aware that linear regression
implicitly fits a straight line - it would be
possible to have a perfect relationship
between dependent & independent
variables, but r is non-significant!
Sometimes straight
lines just aren’t enough!
Are these data linear?
X
Y
0
0
1
1
2
4
3
9
4
16
5
25
6
36
R = 0.96 P<0.05
Y = 8X - 9.33
- but is it?!
There are many cases
where the line you want
to fit is not a straight
one!
Temp

time

BMR

Mass
Population dynamics are often
exponential (up or down), as are many
physical processes such as
heating/cooling, radioactive decay etc.
Body size relationships (allometry) are
almost always curvy.
Almost always it is still possible to use
familiar, unsophisticated linear
regression on these. You just have to
apply a simple mathematical trick first.
To illustrate the basic
idea..
Take these clearly non-linear data. Y correlates well with X, but by
calculating a new variable the correlation can be made perfect.
X
Y
Y#
Calculate Y# = Square
0
0
0
root(Y)
1
1
1
2
4
2
3
9
3
4
16
4
5
25
5
6
36
6
It is now evident that the correlation between X and Y* is perfect.
Y# = 1*X + 0, r = 1.0, p<0.01
Logarithms..



For better or worse, the tricks we need to
straighten lines are based on the
mathematical functions called
logarithms.
When I were a lad… we used “logs” at
school every day for long
multiplications/divisions. Calculators
stopped all that, and I find that I have to
explain the basics here.
Forgive me if you already know this –
someone sitting next to you won’t!
Let’s start with a few powers:












N 2N
3N
10N
0
1
1
1
1
2
3
10
2
4
9
100
3
8
27
1000
4 16
81
10000
5 32
243 100000
6 64
729 1000000
Given these we see that 26 64, 34 = 81
etc.
An alternative way of expressing this is
to say that the Logarithm base 2(64) = 6,
logarithm base 3(81) = 4.
This is abbreviated:
Log2(16) = 4, log3(9) = 2 etc.
In practice you are not offered log
base 2 or 3 on calculators, though
they are easy to work out.



What you do get is log10 or loge, where e = 2.7182818…
Loge is usually written Ln.
The power of logs is that they translate multiplication into
addition!
This is easy to see, sticking for a moment with logs base 2
8*4=
 (2 * 2 * 2) * (2 * 2)
 = (2*2*2*2*2)
 =32

[23 * 22]
[25]
[Note 3+2 =5]
Universally: Logx(A) + Logx(B) = Logx(A*B)
 For virtually any values of x, A and B. (Exclude numbers <=0)

Which log to use?




Answer: It generally makes no difference at all!
PROVIDING you are consistent. Throughout this
lecture I will use ln = loge, but I could equally have
used log10.
In case you wanted to know:
Loge(X) = log10(x) *loge(10) ETC
The case where you need to be careful is fitting an
explicitly exponential model, but if you calculate a
half life you will get the same answer in any log
base.
Given this we can deal with 2 distinct
types of non-linear relationship which
cover >95% of biological/environmental
situations.


1: Exponential curves. These come out of
population dynamics, where growth and decline
typically involve proportionate changes (ie 10% per
year) and have a doubling time or a half life.
Power relationships, where doubling the
independent variable changes the dependent by 2X
where X is the power constant. This gives us the
allometric relationships typical of size scaling
functions (body size/BMR, river diameter/discharge
etc).
Exponential models
B<0 = exponential
decay
This never quite
goes flat but
heads to an
asymptote





The model can be recognised by
having a half life or doubling time,
and is defined as follows:
Y = A*exp(B*X)
Where A,B are constants, exp(x) is
the natural antilog ex = 2.718x
Transform by getting log(Y)
Ln(y) = ln(A) + B*X
B>0 = exponential growth.
This is explosive – never persists long!
For exponential data:





Plot log(Y) vs X, and use regression to
get the intercept + gradient.
Intercept = ln(A), gradient = B
If you want the half life, use:
T½ = ln(½)/B
Notice that the gradient will vary
according to log base, so half life is safer
as it doesn’t.
Count Time
400
320
600
0
10
20
500
activity
500
Exponential decay curve
400
300
Series1
200
100
0
0
20
40
60
240
30
200
40
A basic linear fit:
160
50
Y = 426 – 4.6X
120
60
r = 0.956
100
70
85
80
70
90
time
80
100
Log-transform the Y axis!
Exponential decay after log-transformation
7
6
log(Y)
5
4
Series2
3
2
1
0
0
20
40
60
80
100
Y = 6.19 – 0.022, r = 1
Hence
Y=487*exp(-0.022*X)
time
From this we get:
Half life = ln(0.5)/-0.022 = 31.5sec
Power relationships
(allometric models)

BMR


Here the model is:
Y = A*XB
A,B are constants as before
Mass



Linearise by log-transforming Y
AND X
Ln(Y) = ln(A)+ln(X)*B
Plot ln(y) vs ln(X): intercept = ln(A),
gradient = B
7
X
3.77
1.21
3.43
6.57
6.82
6.45
4.9
29
3
24
88
95
85
49
6
5
4
3
2
YVAR
Y
0
40
60
80
100
1.0
logY = -0.155 + 0.5*log10X
Hence Y = 0.7*sq. root(X)
.8
.6
.4
.2
LOGY
1.46
0.48
1.38
1.94
1.98
1.93
1.69
20
XVAR
LogY LogX
0.58
0.08
0.54
0.82
0.83
0.81
0.69
1
0.0
.4
.6
.8
1.0
1.2
1.4
1.6
1.8
2.0
To summarise:


1: plot and inspect data
2: explore by getting r value for
 I:
Y vs X [linear]
 II: ln(Y) vs X [exponential]
 III: ln(Y) vs ln(X) [allometric]

The best r value implies the best
model, though the differences can
be very minor.
This won’t always work!


N flux
from
soil
Insect
density
Some models defy linear
transformations. The commonest
example is a quadratic relationship
Y = A + B*X + C*X2
Here there is no alternative but to call up a
very powerful, general procedure, the
general linear model (GLM). SPSS
offers this as well as quadratic
regression: you get an r value +
estimates of A, B and C. Handle this just
as we solved the triangular function
above.