Multivariate stats 1: diversity, MLR

Download Report

Transcript Multivariate stats 1: diversity, MLR

Multivariate data
Peter Shaw
Apposite quotes
“I would always start an analysis by running a PCA
ordination” (Prof John Jeffers).
“Multivariate statistics tell you what you already know”
(Prof. Michael Usher).
Most ecologists use multivariate techniques, but will
admit that they don’t really understand what they are
doing (Peter Shaw).
“Multivariate statistics are the last resort of the scoundrel”
(Forestry Commission statistician).
Introduction


So far we have dealt with cases where we have 1
response variable – height, yield, etc. Typically the
dataset will contain a 2nd variable which classifies the
response variable (treatment, plot, sex etc), and in the
case of bivariate correlation/regression the 2 variables are
analysed together.
In ecology, 2 variable datasets are the exception. Start
including species + habitat data and chances are you’ll
have 40 variables of raw data. And that’s before you
include metadata, log-transformed variables etc.
Metadata Raw site data
site, date,
plot etc
pH, elevation etc
4-10ish
Raw species
data
10-100
Log-trans
data etc
10-100
Ecological data:




It is of course possible and valid to handle each of
these variables independently – but very tedious.
You have 40 species – you run 40 anovas. 10 come
out significant – good. But this took 40 pages of output
– and you still have to present this lot!
Multivariate techniques exist to simplify this sort of
situation. Instead of running 40 analyses on 1 species,
you run 1 analysis on all 40 species at the same time.
Much more elegant!
The catch is that the algorithms are messy – usually
PC only, and the maths scare people.
A summary of the types of statistical analyses,
showing where multivariate techniques fit in.
Univariate
Bivariate
Multivariate
Descriptive
Statistics
mean,
Equation of Most
standard
best fit line multivariate
deviation,etc
techniques
come here
Inferential
Statistics
t test,
anova etc
r value
MonteCarlo
permutation
tests
In the next 2 lectures:



I will introduce you to a few of the more commonly
used approaches to simplifying multivariate ecological
data.
Not all the techniques I will cover are strictly
multivariate: technically a multivariate analysis gives a
multivariate output. You enter N equally important
variables, and derive N new variables, sorted into
order of decreasing importance.
Today I will mention 2 approaches which don’t do this:
diversity indices, and multiple linear regression. They
ARE useful for ecologists and DO take multivariate
data, so I feel justified in slotting them here.
Diversity indices


Diversity is a deceptive concept. Magurran (1991) likens the
concept of diversity to an optical illusion; the more it is examined,
the less clear it becomes.
Ecologists recognise that the diversity of an ecological system has
2 facets:
 species number (=richness)
 evenness of distribution
Both these systems have 30 individuals and
3 species. Which is more diverse?
Sp. A Sp. B Sp. C
10
10
10
And how about this?
Sp. A Sp. B Sp. C
1
1
28
A B C
10 2 3
D E F
1 1 1
This vagueness leads to
a multiplicity of indices..



The simplest and commonest is simply
species richness – too easy to need
explaining. But do remember that
species boundaries are often blurred,
and that the total is consequently not
clear.
Today I only intend to cover 2 indices
(my favourites!). Be aware that there are
many more.
These are:


The Simpson index
The Shannon index
The Simpsons index: D



This is an intuitively simple, appealing
index.
It involves sampling individuals from a
population one at a time (like pulling
balls out of a hat).
What is the probability of sampling the
same species twice in 2 consecutive
samples? Call this p.


If there is only 1 species, p = 1.0
If all samples find different species, p =
0.0
Formally:



The probability of sampling species i = pi.
Hence p(sampling species 1 twice) = pi * pi.
Hence p(sampling any species twice)




= p(sp1) + p(sp2)… +p(spN)
Hence the simplest version of this index = i pi * pi
This has the counter-intuitive property that 0 =
infinite diversity, 1 = no diversity
Hence the usual formulation is:

D = 1 - i pi * pi
Applying the Simpson index to the
communities listed previously:
Sp. A Sp. B Sp. C
10
10
10
p = 1/3 p=1/3 p = 1/3
D = 1-(1/9 + 1/9 + 1/9)
= 0.667
Sp. A
Sp. B
Sp. C
1
1
28
p=1/30 p = 1/30 p = 28/30
D = 1-(1/900 + 1/900 + 784/900)
= 14/900 = 0.0156
Applied to the demo communities
given previously, this tells what was
already obvious: the first community
had a higher diversity (measured as
evenness of species distribution)
The Shannon index


Very often mis-named as the
Shannon-Weaver index, this widely
used index comes out of some
very obscure mathematics in
information theory. It dates back to
work by Claude Shannon in the
Bell telephone company labs in the
1940s.
It is the most commonly used, and
least understood (!) diversity index
To calculate:




H = -i[pi*log(pi)]
Note that you have the choice of
logarithm base. Really you should
use base 2,when H defines the
information content of the dataset
in bits.
To do this use
log2(x) = log10(x) / log10(2)
One of the oddities of the
Shannon index..






Is that the index varies with number of species, as well as
their evenness.
The maximum possible score for a community with N
species is log(N) – this occurs when all species are equally
frequent.
Because of this (and log-base problems, and the difficulty of
working out what the score actually means) I prefer to
convert H into an evenness index; H as a proportion of
what it could be if all species were equally frequent. This is
E, the evenness.
E = H / log(N).
E = 1 implies total evenness, E  0 implies the opposite.
This index is independent of log base and of species
number.
Applying the Shannon index to the
communities listed previously:
Sp. A Sp. B Sp. C
10
10
10
p = 1/3 p=1/3 p = 1/3
H10 = -3*(log10(1/3)*1/3)
= -3*(log10(1/3)*1/3)
= 0.477
E = H / log(3) = 1,
ie this community is as
even as it could be
Sp. A
Sp. B
Sp. C
1
1
28
p=1/30 p = 1/30 p = 28/30
H10 = 1/30*log10(1/30)
+ 1/30*log10(1/30)
+ 28/30*log10(28/30)
= 0.126
E = H / log(3) = 0.265
Multiple Linear
Regression





MLR is just like bivariate regression, except that you have
multiple independent (“explanatory”) variables.
Model:
Y = A + B*X1 + C*X2 (+ D*X3….)
Very often the explanatory variables are co-correlated,
hence the dislike of the term “independents”
The goodness of fit is indicated by the R2 value, where
R2 = proportion of total variance in Y which is explained
by the regression. This is just like bivariate regression,
except that we don’t used a signed (+/-) r value.

(Why no r? Because with 2 variables the line either goes up
or down. With 3 variables the line can go +ve wrt X1 and
-ve wrt X2)
MLR is powerful and
dangerous:




MLR is widely used in research,
especially for “data dredging”.
It gives accurate R2 values, and valid
least-squares estimates of parameters –
intercept, gradients wrt X1, X2 etc.
Sometimes its power can seem almost
magical – I have a neat little dataset in
which MLR can derive parameters of the
polynomial function
xX = A + B*X + C*X2 + D*X3…
But beware!









There is a serious pitfall when the explanatory variables
are co-correlated (which is frequent, almost routine).
In this case the gradients should be more-or-less ignored!
They may be accurate, but are utterly misleading.
An illustration: Y correlates +vely with X1 and with X2.
Enter Y, X1 and X2 into an MLR
Y = A + B*X1 + C*X2
But if B >0, C may well be < 0! On the face of it this
suggests that Y is –vely correlated with X2.
In fact this correlation is with the residual information
content in X2 after its correlation with X1 is removed.
At least one multivariate package (CANOCO) instructs that
prior to analysis you must ensure all explanatory variables
are uncorrelated, by excluding offenders from the analysis.
This caveat is widely ignored, indeed widely unknown.
Dry
3
Introduction to ordination


Wet
1,2
4,5
Muddy
“Ordination” is the term used to
arrange multivariate data in a
rational order, usually into a 2D
space (ie a diagram on paper).
The underlying concept is that of
taking a set of community data (or
chemical or physical…) and
organising them into a picture that
gives you insight into their
organisation.
A direct ordination, projecting communities onto time: The sand dune
succession at the southern end of lake Michegan (re-drawn from
Olsen 1958)
Cottonwood. Populus deltoides
Black oak Quercus velutina
Pines, Pinus spp
Bare sand
stabilised by
Marram grass
0
350 600
Years stable
Elevation, m
193
189
183
177
850 1100
3500
10,000
2000
Lodgepole
pine
Red fir
Subalpine
conifer
Jeffrey
pine
pinyon
Wet meadows
3000
Alpine dwarf scrub
Mixed conifers
Ponserosa pine
1000
Montane
hardwood
hardwoods
Elevation, m
4000
Chaparral
Montane
hardwood
Valley-foothill hardwood
Annual grassland
Moist….…………………Dry
A 2-D direct ordination, also called
a mosaic diagram, in this case
showing the distribution of
vegetation types in relation to
elevation and moisture in the
Sequoia national park. This is an
example of a direct ordination,
laying out communities in relation
to two well-understood axes of
variation. Redrawn from Vankat
(1982) with kind permission of the
California botanical society.
Bray-Curtis ordination



This technique is good for introducing the
concept of ordination, but is almost never used
nowadays.
It dates back to the late 1950s when
computers were unavailable, and had the
advantage that it could be run by hand.
3 steps:
Convert raw data to a matrix of distances
between samples
 identify end points
 plot each sample on a graph in relation to the
end points.

Sample data - a succession:
Choose a measure of distance
between years:
The usual one is the Bray-Curtis
index (the Czekanowski index).
Between year 1 & 2:
Y1
Y2
Minimum
Sum
A
B
100 0
90 10
90 0
190 10
C
0
0
0  = 90
0  =200
distance = 1-2*90/200 = 0.1
Year A
1
100
2
90
3
80
4
60
5
50
6
40
7
20
8
5
9
0
10
0
B
0
10
20
35
50
60
30
20
10
0
C
0
0
5
10
20
30
40
60
75
90
The matrix of B_C distances between each of the 10
years. Notice that the matrix is symmetrical about the
leading diagonal, so only lower half is shown.
PS I did this by hand.
PPS it took about 30 minutes!
1
2
3
4
5
6
7
8
9
10
1 0.00
2 0.10 0.00
3 0.29 0.12 0.00
4 0.41 0.28 0.15 0.00
5 0.54 0.45 0.33 0.16 0.00
6 0.65 0.50 0.45 0.28 0.12 0.00
7 0.79 0.68 0.54 0.38 0.33 0.27 0.00
8 0.95 0.84 0.68 0.63 0.56 0.49 0.26 0.00
9 1.00 0.89 0.74 0.79 0.71 0.63 0.43 0.18 0.00
10 1.00 1.00 0.95 0.90 0.81 0.73 0.56 0.33 0.14 0.00
Now establish the end points this is a subjective choice, I
choose years 1 and 10 as being
extreme values.

Now draw a line with 1 at one end, 10 at the
other.

The length of this line = a distance of 1.0,
based on the matrix above.
Distance = 1.0
1
10
Year2 locates here:
Now locate obs 2, 1.0
from year 10 and 0.1
from year 1. This can be
done by circles.
Radius = 1.0
10
1
Radius = 0.1
Distance 1<-> 2 = 0.1 units, distance 2<->10 = 1.0 units, so draw
2 circles:
The final ordination of the points (by hand
3
2
1
4
5
6
7
8
9
10
Principal Components
Analysis, PCA

This is our first true multivariate
technique, and is one of my
favourites.

It is fairly close to multiple linear
regression, with one big
conceptual difference (and a
hugely different output).
The difference lies in
fitting of residuals:
Y
X2
X1


V3
V2
V1
MLR fits them vertically to one
special variable (Y, the dependent).
PCA fits them orthogonally - all
variables are equally important,
there is no dependent.
Conceptually the
intentions are very
different:



MLR seeks to find the best hyper-plane through the
data.
PCA [can be thought of as] starting off by fitting one
best LINE through the cloud of datapoints, like
shining a laser beam through an array of tethered
balloons. This line passes through the middle of the
dataset (defined as the mean of each variable), and
runs along the axis which explains the greatest
amount of variation within the data.
This first line is known as the first principal axis – it is
the most useful single line that can be fitted through
the data (formally the linear combination of variables
with the greatest variance).
The 1st principal axis of a dataset – shown as a laser in a room of
tethered balloons!
Overall mean of the dataset
Often the first axis of a dataset will be an obvious
simple source of variation: overall size
(body/catchment size etc) for allometric data,
experimental treatment in a well designed experiment.
A good indicator of the importance of a principal axis
is the % variance it explains. This will tend to
decrease as the number of variables = number of axes
in the dataspace increases. For half-decent data with
10-20 variables you expect c. 30% of variation on the
1st principal axis.
Having fitted one principal axis, you can fit a 2nd. It
will explain less variance than the 1st axis.
This 2nd axis explains the maximum possible variation,
subject to 2 constraints:
1: is orthogonal (at 90 degrees to) the 1st
principal axis
2: it runs through the mean of the dataset.
The 2nd principal axis of a dataset,– shown in blue
This diagram shows a 3D dataspace (axes being V1, V2 and V3).
Overall mean of the dataset
V1
1.0
0.3
0.5
1.3
1.5
2.5
2.5
3.4
V2
0.1
1.0
2.0
0.8
1.5
2.2
2.8
3.4
V3
0.3
2.1
2.0
1.0
1.2
2.5
2.5
3.5
We can now cast a
shadow..





The first 2 principal axes define a 2D
plane, on which the positions of all the
datapoints can be projected.
Note that this is exactly the same as
casting a shadow.
Thus PCA allows us to examine the
shadow of a hig-dimensional object, say
a 30 dimensional dataspace (defined by
30 variables such as species density).
Such a projection is a basic
ordination diagram.
I always use such diagrams when
exploring data – the scattergraph of the
1st 2 principal axes of a dataset is the
most genuinely informative description of
the entire data.
More axes:
A PCA can generate as many principal
axes as there are axes in the original
dataspace, but each one is progressively
less important than the one preceeding
it.
In my experience the first axis is usually
intelligible (often blindingly obvious), the
second often useful, the third rarely
useful, 4th and above always seem to be
random noise.
To understand a PCA
output..


Inspecting the scattergraph is
useful,but you need to know the
importance of each variable with
respect to each axis. This is the
gradient of the axis with respect to
that variable.
This is given as a standard in PCA
output, but you need to know how
the numbers are arrived at to know
what the gibberish means!
How it’s done..


1: derive the matrix of all correlation coefficients
– the correlation matrix. Note the similarity to
Bray-Curtis ordination: we start with N columns
of data, then derive an N*N matrix informing us
about the relationship between each pair of
columns.
2: Derive the eigenvectors and eigenvalues of
this matrix. It turns out that MOST multivariate
techniques involve eigenvector analysis, so you
may as well get used to the term!
Eigenvectors:







Are intimately linked with matrix multiplication. You don’t
need to know this bit!
Take an [N*N] matrix M, and use it to multiply a [1*N]
vector of 1’s.
This gives you a new [1*N] vector, of different numbers.
Call this V1
Multiply V1 by M, to get V2.
Multiply V2 * M to get V3, etc.
After infinite repetitions the elements of V will settle down
to a steady pattern – this is the dominant eigenvector of
the matrix M 1.
Each time 1 is multiplied by M it grows by a constant
multiple, which is the first eigenvalue of M E1.
*
1
V1
1
1
*
V1
V2
*
V2
V3
After a while, each successive multiplication
preserves the shape (the eigenvector) while increasing
values by a constant amount (the eigenvalue
The projections are
made:










by multiplying the source data by the corresponding
eigenvector elements, then adding these together.
Thus the projection of site A on the first principal axis is
based on the calculation:
(spp1A*V11 + spp2A*V21+spp3A*V31…)
Where spp1A = number of species 1 at site A, etc
V21 = 1st eigenvector element for species 2
Luckily the PC does this for you.
There is one added complication: you do not usually use raw data in the
above calculation. It is possible to do so, but the resulting scores are very
dominated by the commonest species.
Instead all species data is first converted to Z scores, so that mean = 0.00
and sd = 1.00
This means that principal axes are always centred on the origin 0,0,0,0,…
It also means that typically half the numbers in a PCA output are negative,
and all are apparently unintelligible!
The important part of a
PCA..

Is interpreting the output!
DON’T PANIC!!
Stage 1: look at the variances

Stage 2: examine scatter plots



Stage 3 – interpret scatterplots by examination of
eigenvector elements
Stage 1: look at the variances

The first principal axis will have explain a portion of the
total variance in the data. The amount explained is usually
given by the package, but can in any case be derived as
100*1st Eigenvalue/Nvar [Nvar = number of variables].
From this you can deduce that the eigenvalues always
sum together to Nvar.

Values above 50% on the 1st axis are hopeful, values in
the 20s suggest a poor or noisy dataset, but this varies
systematically with Nvar.

There is no inferential test here since no null hypothesis is
being tested. It is possible to ask how your 1st eigenvalue
compares with that found in Nvar columns of random
noise, though actually surprisingly contentious how best to
go about this.

I suggest you compare your eigenvalues against those
predicted under the broken stick model

Expected values for the percentage variance accounted for by the first
five principal axes of a PCA under the broken stick model. Axes found
to account for less variance than the values tabulated here may be
disregarded as random noise.
This model suggests
that for random data
with N variables, the
pth axis should have
an eigenvalue λp as
follows:
i=p
λp = Σ (1/i)
i=n
Nvar
3
4
5
6
7
8
9
10
11
12
13
14
15
Axis 1
61.1
52
45.6
40.8
37
33.9
31.4
29.2
27.4
25.8
24.4
23.2
22.1
Axis 2
27.7
27
25.6
24.1
22.7
21.4
20.3
19.2
18.3
17.5
16.7
16
15.4
Axis 3
11.1
14.5
15.6
15.8
15.6
15.2
14.7
14.2
13.8
13.3
12.9
12.5
12.1
And higher axes..
You can continue examining axes up to Nvar – but you hope that
only the 1st 2-4 have much useful information in. When to stop?
The usual default is the ‘Eigenvalues > 1’ rule (which could be
called the %variance >100/Nvar’ rule, but is actually called the
Kaiser-Guttman criterion!), but this tends to give you several useless
small axes. Again I suggest you compare with the broken stick
distribution.
Generally I only bother looking at the 1st 2-3 axes: the PCA usually
tells you something blindingly obvious on the 1st axis, subtler on the
second, then degenerates into random noise. In theory highly
structured data could continue giving interpretable signal into the 4th
axis or higher.
Scree plots
100
No pattern
at all!!
% variance
50
0
Excellent
resolution!!
% variance
50
0
100
Here you plot out eigenvalue (or % variance – they are
equivalent) against axis number.
1 2 3 4 5
Axis number
1 2 3 4 5
Axis number
Stage 2: inspect the ordination
diagrams




Plot the scattergraph of the ordination, starting with Axis 1
against Axis 2, and have a look.
This is the shadow of the dataset.
Look for clusters, outliers, gradients. Each point is one
observation, so you can identify odd points and check the
values in them. (Good way to pick up typing errors).
Overlay the graph with various markers – often this picks out
important trends.
Data on Collembola of industrial waste
sites.
The 1st axis of a PCA ordination detected
habitat type: woodland vs open lagoon,
while the second detected succession
within the Tilbury trial plots.
PCA ordination of Collembola
PCA ordination, re-plotted to highlight
succession on PFA sites
succession in the Tilbury trials
5
4
4
3
3
2
2
Second principal axis
5
1
0
-1
-2
-3
-3
-2
-1
First principal axis
0
1
2
1
site age, years
0
7.00
-1
6.00
Scrub woods
-2
5.00
Open lagoon
-3
4.00
Habitat
3
4
-3.0
-2.0
-1.0
First principal axis
0.0
1.0
2.0
3.0
4.0
Stage 3: Look at the eigenvector elements to interpret
the scattergraphs

which species / variables have large loadings (positive or
negative). Do some variables have opposed loadings (one very
+ve, one very –ve) suggesting a gradient between the two
extremes?

The actual values of the eigenvector elements are meaningless –
it is their pattern which matters. In fact the sign of elements will
sometimes differ between packages if they use different
algorithms! It’s OK! The diagrams will look the same, it’s just that
the pattern of the points will be reversed.

Biplots



Since there is an intimate
connection between eigenvector
loadings and axis scores, it is
helpful to inspect them together.
There is an elegant solution here,
known as a biplot.
You plot the site scores as points
on a graph, and put eigenvector
elements on the same graph
(usually as arrows).
Pond community handout data – site scores plotted by SPSS
These are the new variables which appear after PCA: FAC1
and FAC2. Note aquatic sites at left hand side.
2.0
1.5
1.0
.5
0.0
-.5
WATER
-1.0
1.00
-1.5
-1.5
.00
-1.0
-.5
0.0
.5
REGR factor score 1 for analysis
1.0
1
1.5
Factor scores (=eigenvector elements) for each species
in pond dataset. Note aquatic species at left hand side.
.8
potamoge
.7
.6
.5
.4
epilob
.3
ranuscle
.2
AX2
phragaus
.1
potentil
0.0
-1.0
AX1
-.5
0.0
.5
1.0
Pond community handout data – the biplot.
Note aquatic sites and species at left hand side. Note also that this
is only a description of the data – no hypotheses are stated and no
significance values can be calculated.
2.0
Potamog
1.5
1.0
Epilobium
.5
Ranunc
0.0
Potent.
-.5
WATER
-1.0
Phrag.
-1.5
-1.5
1.00
.00
-1.0
-.5
0.0
.5
REGR factor score 1 for analysis
1.0
1
1.5