Transcript Document

Linear regression
and Correlation
Peter Shaw
Introduction
X Y



Y
X
These inter-related areas together
make up one of the most powerful
and useful areas of data analysis
They are simple to understand,
given a few easily-learnt facts and
– crucially – the knack of thinking
of data as points in a 2D space.
Do not worry unduly about the
formulae or procedures: what
matters is the mental imagery.
What you need:
Y

X
Is 2 variables which are paired together
in some way:

Measures against time

X Y

Body height vs time: 1 sample per year
Responses to treatments

Growth of plants grown in a series of
different fertiliser concentations
One of these is assumed to depend on
the other: this is the dependent variable,
always shown on the Y axis of a graph
 The other is the independent variable,
and goes on the X axis.
Time, if measured, is always the
independent variable – nothing affects
the rate of flow of time!

Height
age
Golden rules for a scattergraph:
Label for Y axis
= dependent
variable
Plant mass
g
A graph of yield against fertiliser
aplication for this examplar system
Title, which
should be fully
self-explanatory
Draw one bestfit line
Fertiliser added
g
Label for X axis
= independent
variable
NEVER dot – dot!! Unless your are
absolutely sure that interpolation is valid
Lichen
cover on
tombstone
This is WRONG
year
There are 2 questions
you can ask here:

X Y
1: How likely is it that this pattern could
occur by chance?

R = 0.95,
Df = 8, p < 0.01

Y
Y = 1 + 2*X

2: What is the best description of the
relationship between the two variables?



X
Use correlation
This involves calculating a correlation
coefficient r, then finding the probability of
obtaining this value of r.

Use regression
This involves calculating the equation of the
best fit line.
Linear regression tries to explain the
relationship as an equation of the form
Y = A + B*X
Correlation


Here we calculate an index which tells us
how closely the data approximate to a
straight line
These indices are called correlation
coefficients.


Obscure technical tip:
rs is simply r applied
to the ranked values of
the data

There are 2 correlation coefficients,
depending on whether the data are
normally distributed:
Parametric data: use Pearson’s Product
Moment Correlation Coefficient,mercifully
always known as r
Non-parametric data: Use
Spearman’s correlation Coefficient rs
Both correlation coefficients
behave in exactly the same way
They range between 1.0 and –1.0: never >1 nor <-1
 The value tells you how closely the data approximate
to a straight line.
r = 0.6 ish
r = -1.0
r = 1.0
r = -0.8ish
How to calculate?

1: You do not need to know this – it
is usually done by PC or calculator
Parametric data:
use
r=
Σxy - ΣxΣy/N
___________
Sqrt[ (Σxx - ΣxΣx/N) * (Σy*y - ΣyΣy/N) ]
Non-parametric data:
rank X data and Y data
separately, find the difference
between the X and Y ranks for
each observation. Call this
difference D
then use rs = 1-6*Σ(D*D)
_______
(N-1)N(N+1)
Significance testing:



r = 1.0
Why N-2? Because
with 2 data points
your line is certain to
be a perfect fit


define your significance level (p=0.05)
H0: There is no association between Y
and X: any indication of this is due to
chance
H1: There is a relationship between Y
and X (which may be +ve or –ve).
[WARNING: You are not inferring
causality]
Find your df. Here it = N-2
Compare your r value with the critical
value listed in tables – but ignore any
negative signs. Larger than tabulated
values of r are significant.
Example data – 2 measures of
leaf decomposer activity
X = CO2, micL/g/hr
 Y = FDA Activity (y)OD/g/hr
50
40










X
136.8
72.0
68.4
41.4
91.8
115.2
82.8
161.0
93.6
FDA activity OG/g/hr
30
Y
40.28
14.46
13.73
8.98
13.73
31.17
23.40
27.94
27.94
20
10
0
40
60
80
100
120
140
CO2 uMol/g/hr
r = 0.80, df = 7, p<0.01
rs = 0.86 df =7, p<0.01
160
180
Best fit lines



What is meant by a best fit?
There are infinitely many different
lines that can be fitted to any
dataset, most of which are clearly
not a good fit.
There is a formal definition of a
best-fit line, and it involves the
“residuals”, the deviations of each
data point from the best fit line.
A best fit line …



Intercept = A
=value of Y when X = 0
Is the one that minimises the sum
of residuals squared. This is
known as a least-squares best fit.
The usual best-fit line (supplied by
calculators and most PC
packages) is the one which
minimises the vertical sum of
residuals squared for the model
Y = A + B*X
Gradient, B, = extent to
which Y increases when X
increases by 1
Be aware that there are
alternative models!
Usual model:
Y = A+B*X with vertical residuals
B = Σxy - ΣxΣy/N
___________
Σxx - ΣxΣx/N
A = mean(Y) – B*mean(X)
Y = A+B*X,
orthogonal residuals
Y=B*X, vertical residuals
Y = B*X
orthogonal residuals
Fig 3.6 – The standard model for a best fit line:
Residuals are vertical, line passes through the overall mean of
the data. The significance of this relationship may be measured
by calculating a correlation coefficient.
mean value of x, μx
mean value of y, μy
Dependent
variable
The gradient of the line may be
calculated as follows:
The intercept is
generally not
zero
B = Σ(X-μx)*(Y-μy)
Σ(X-μx)*(X-μx)
Intercept =
A = μy -B* μx
0
0
Independent variable
Fig 3.7 – The zero-intercept model for a best fit line:
Residuals are vertical, line passes through 0,0. No
significance testing or correlation coefficient is possible for this model.
Dependent
variable
Gradient of the line =
B = ΣXY
ΣXX
The intercept is
exactly zero
Intercept = A = 0
0
0
Independent
variable
Fig 3.8 – The Reduced major axis model for a
best fit line: Residuals are orthogonal, the line
passes through the overall mean of the data.
No significance testing is possible for this model.
mean value of x, μx
mean value of y, μy
Dependent
variable
Gradient of the line =
The intercept is
generally not
zero
B = standard deviation (Y)
standard deviation (X)
Intercept =
A = μy -B* μx
0
0
Independent
variable
When you ask for a best-fit line:



What you get are 2 numbers, A and B, the intercept
and the gradient.
These are enough to specify the 2 line in 2
dimensional space.
(Note that you can equally fit a best-fit plane in 3D
space, but this needs 3 parameters: an intercept and
2 gradients).
Why the Y-X choice really matters:
X-Y = Y-X
Given 2 variables you can plot 2 equivalent graphs: Y against X, or X
against Y. In fact these are not quite the same!
Swapping the axes around has no effect on the Correlation (hence the
likelihood of the pattern occurring by chance).
BUT
It does deeply affect the relationship inferred, the actual best fit line. The
Y on X line is NOT just the X on Y line transposed.
V2
R = -0.9
IS NOT
THE
SAME
V1 AS:
V1
R = -0.9
V2
Why “Regression”?
The term was coined by Sir Frances Galton in his 1885 address to
the BAAS, in describing his findings about the relationship
between height of children and of their parents. He found that the
heights tended to be to less extreme than their parents – closer to
the mean, so tall parents had tall kids, but less extremely tall,
while short couples had sort kids, but less extremely short. (The
gradient of the Kids on parents line was 0.61) He called this
"Regression Towards Mediocrity In Hereditary Stature,"
(Galton, F. (1886)
Inverting this logic might predict that parents are more extreme
than heir children. This is not so : the gradient of the parents on
kids line was 0.29, even less than the 1.0 expected from equality.
Remember: the best fit line does not just transpose when X and Y
swap!
Galton’s data
Predicted
value of Y
This allows us to predict
values – an act known
as extrapolation



Observed value
of X
Given a regression equation
Y = A + B*X
We can predict the expected value
for Y given any value of X. This is
exactly equivalent to drawing a line
on the graph.
An example:
The litter FDA data
again
Equation of the regression line:
Y = 0.77+0.23*X
Correlation coefficient r = 0.80, p<0.01
If the line is constrained to pass through 0,0 its
equation becomes Y = 0.23*X. In this case the
zero-intercept line is visually indistinguishable
from the line plotted.
When x = 100
Y should be
22.6+0.773
= 23.73
FDAactivity, ODg-1 hr-1
40
20
50
CO2, g g-1 hr-1
100
150
100
Y = 11.88*X-1003.53
r=0.977 p< 0.001
0
1st DCA axis
200
The DCA ordination of annual toadstool data
graphed as a function of year.
86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01
Year
Flowchart for handling
regression data
Are your data normally distributed?
No
Yes
Calculate r. Is it significant?
Yes
No
Plot data and fit a
best-fit line.
Annotate the
graph with r, p and
the regression
equation.
Say you have
done the work
and it was NS
(credit where it’s
due!)
Calculate Spearman’s
correlation coefficient and
assess its significance.
Consider how best to
graph data.
Reliability




Inter-Rater or Inter-Observer Reliability
 Used to assess the degree to which different
raters/observers give consistent estimates of the same
phenomenon (coefficient of association)
Test-Retest Reliability
 Used to assess the consistency of a measure from one time
to another (correlation)
Parallel-Forms Reliability
 Used to assess the consistency of the results of two tests
constructed in the same way from the same content domain
(correlation)
Internal Consistency Reliability
 Used to assess the consistency of results across items
within a test (Cronbach’s alpha).
Cronbach’s alpha
This is an index of reliability, and is usually used when comparing a
set of questions in a questionnaire score to see whether they
genuinely seem to be getting the same sort of results. You will
need it to analyse the GHQ data.
It is not a statistical test, carries no H0 or significance level, but it
does supply a guideline: Cronbach’s alpha >0.7 for a set of
questions to be considered reliably consistent.
IN SPSS use Scale – reliability analysis, then in save options
selected “scale if deleted”. The pattern you want to hunt is of alpha
values <0.7 which become >0.7 if one variable is removed; this is
warning that the variable is unreliable.