Transcript Overview

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 12 Overview

John Birks

OVERVIEW

• Topics covered • Exploratory data analysis • Clustering • Gradient analysis • Hypothesis testing • Principle of parsimony in data analysis • Possible future developments • Conventional • Less conventional • Some applications • Volcanic tephras • Scotland’s most famous product • Integrated analyses • Problems of percentage compositional data • Log-ratios • Chameleons of CA and CCA • Software availability • Web sites • Final comments

EXPLORATORY DATA ANALYSIS

Essential first step Feel for the data – ranges, need for transformations, rogue or outlying observations

NEVER FORGET THE GRAPH CLUSTERING

Can be useful for some purposes – basic description, summarisation of large data sets. Fraught with problems and difficulties – choice of DC, choice of clustering method, difficulties of validation and evaluation Good general purpose

TWINSPAN – ORBACLAN – COINSPAN GRADIENT ANALYSIS

Regression, calibration, ordination, constrained ordination, discriminant analysis and canonical variates analysis, analysis of stratigraphical and spatial data.

HYPOTHESIS TESTING

Randomisation tests, Monte Carlo permutation tests.

1987 Wageningen Cajo ter Braak

Classification of gradient analysis techniques by type of problem, response model and method of estimation.

Type of problem Regression Calibration Ordination Constrained ordination a Partial ordination b Partial constrained ordination c Linear Response Model Least-Squares Estimation Multiple regression Linear calibration; ‘inverse regression’ Principal components analysis (PCA) Redundancy analysis (RDA) d Partial components analysis Partial redundancy analsyis Unimodal Response Model Maximum Likelihood Estimation Gaussian regression Gaussian calibration Gaussian ordination Gaussian canonical ordination Partial Gaussian ordination Partial Gaussian canonical ordination Weighted Averaging Estimation Weighted averaging of site scores (WA) Weighted averaging of species scores (WA) Correspondence analysis (CA); detrended CA (DCA) Canonical CA (CCA); detrended CCA (DCCA) Partial CA; partial DCA Partial CCA; partial detrended CCA a Constrained multivariate regression b Ordination after regression on covariables c Constrained ordination after regression on covariables = constrained partial multivariate regression d “Reduced-rank regression” = “PCA of y with respect to x

A straight line displays the linear relation between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (  ). (a = intercept; b = slope or regression coefficient).

A unimodal relation between the abundance value (y) of a species and an environmental variable (x). (u = optimum or mode: t = tolerance; c = maximum).

GRADIENT ANALYSIS

Linear based-models or unimodal-based methods Critical question, not a matter of personal preference If gradients are short, sound statistical reasons to use linear methods – Gaussian based methods break down, edge effects in CA and related techniques become serious, biplot interpretations easy.

If gradients are long, linear methods become ineffective (‘horseshoe’ effect).

How to estimate gradient length?

Regression Calibration Ordination Constrained ordination Hierarchical series of response models GLM and HOF GLM, DCCA (single x variable) DCA (detrending by segments, non-linear rescaling) DCCA (detrending by segments, non-linear rescaling) Partial ordination Partial DCA (detrending by segments, non-linear rescaling) Partial constrained Partial DCCA (detrending by segments, non-linear rescaling) ordination

HYPOTHESIS TESTING

Monte Carlo permutation tests and randomisation tests Distribution free, do not require normality of error distribution Do require INDEPENDENCE or EXCHANGEABILITY Validity of permutation test results depends on the validity of the type of permutation for the data set at hand.

Completely randomised observations, completely random permutation is appropriate = randomisation test.

Randomised block design-permutation must be conditioned on blocks, e.g. type of farm declared as covariable, if randomisation is conditioned on these, permutations are restricted to within farm.

Time series or line transect – restricted permutations and data kept in order.

Spatial data on grid – restricted permutations and data kept in position.

Repeated measurements – BACI

PRINCIPLE OF PARSIMONY IN DATA ANALYSIS

William of Occam (Ockham), 14 th century English nominalist philosopher. Insisted that given a set of equally good explanations for a given phenomenon, the explanation to be favoured is the SIMPLEST EXPLANATION.

Strong appeal to common sense.

Entities should not be multiplied without necessity.

It is vain to do with more what can be done with less.

An explanation of the facts should be no more complicated than necessary.

Among competing hypotheses or models, favour the simplest one that is consistent with the data.

‘Shaved’ explanations to the minimum.

In data analysis: 1) Models should have as few parameters as possible.

2) Linear models should be preferred to non-linear models.

3) Models relying on few assumptions should be preferred to those relying on many.

4) Models should be simplified/pared down until they are MINIMAL ADEQUATE.

5) Simple explanations should be preferred to complex explanations .

RELEVANCE OF PRINCIPLE OF PARSIMONY TO DATA ANALYSIS

MINIMAL ADEQUATE MODEL (MAM) CLUSTERING - as statistically acceptable as the most complex model - only contains significant parameters - high explanatory power - large number of degrees of freedom - may not be one MAM - prefer simple cluster analysis methods (few assumptions, simple values of  ,  ,  ) - intuitively sensible REGRESSION CALIBRATION ORDINATION - GAM – GLM - In GAM, simplest smoothers to be used - In GLM, model simplification to find MAM (e.g. AIC) - minimum number of components for lowest RMSEP in PLS or WA-PLS - retain smallest number of statistically significant axes (broken stick test) - retain ‘signal’ at expense of noise

PARTIAL ORDINATION remove effects of ‘nuisance variables’ (covariables or concomitant variables) by partialling out their effects ordination of residuals retain smallest number of statistically significant axes (broken stick test) ‘signal’ at expense of ‘noise’ and ‘nuisance variables’ CONSTRAINED ORDINATION most powerful if the number of predictor variables is small compared to number of samples. Constraints are strong, arch effects avoided, no need for detrending, outlier effects minimised minimal adequate model (forward selection, VIF, variable selection, AIC) only retain statistically significant axes PARTIAL CONSTRAINED ORDINATION as above + partial ordination STRATIGRAPHICAL DATA ANALYSIS only retain statistically significant zones simplify data to major axes or gradients of variation

CHOICE BETWEEN INDIRECT

&

DIRECT GRADIENT ANALYSIS

Indirect gradient analysis – two steps Direct gradient analysis – one combined step If relevant environmental data are to hand, direct approach is likely to be more effective and simpler than indirect approach. Generally achieve a simpler model from direct gradient analysis.

CHOICE BETWEEN REGRESSION

&

ORDINATION CONSTRAINED

Both regression procedures! One

Y

or many

Y

.

Depends on purpose – is it an advantage to analyse all species simultaneously or individually?

CONSTRAINED ORDINATION REGRESSION

Community assemblage or individual taxa?

HOLISTIC INDIVIDUALISTIC COMMON GRADIENTS QUICK, SIMPLE SEPARATE GRADIENTS SLOW, COMPLEX, DEMANDING LITTLE THEORY EXPLORATORY MUCH THEORY (GLM) MORE CONFIRMATORY, IN DEPTH

LIMITING FACTORS

Research questions Hypotheses to be tested and evaluated Data quality

TYPES OF GRADIENT ANALYSIS METHODS BASED ON WEIGHTED AVERAGING

Community data - incidences (1/0) or abundances (  of species at sites.

0) Environmental data - quantitative and/or qualitative (1/0) variables at same sites.

Use weighted averages of species scores (appropriate for unimodal biological data) and linear combinations (weighted sums) of environmental variables (appropriate for linear environmental data)

Method Abbreviation Response variables (y) Predictors (x)

Correspondence analysis Canonical correspondence analysis CCA partial least squares CA (also DCA) Community data CCA (also DCCA) CCA-PLS Weighted averaging calibration WA partial least squares Co-correspondence analysis WA WA-PLS CO-CA Community data Community data Environmental variable Environmental variable(s) Community data Environmental variables Many environmental variables Community data Community data Community data

Lecture

6 7 11 8 8 11 Also partial CA, partial DCA, partial CCA, partial DCCA.

POSSIBLE FUTURE DEVELOPMENTS - CONVENTIONAL

Lecture topic

2 Exploratory data Model specific ‘outlier’ detection; interactive analysis graphics 3 Clustering 4, 5 Regression analysis COINSPAN; better randomisation tests; CART; latent class analysis GLM and GAM framework evaluation by cross validation. Give up SS, deviance, t, etc!

6 Indirect gradient 7 Direct gradient ? quest for the ‘ideal’ ordination method, 2-analysis matrix CA and PCA 3-matrix CCA and RDA (biology, environment, analysis species attributes); multi-component variance partition ing, vector-based reduced rank models with GAMs 8 Calibration and

9 10 11

reconstruction Classification WAPLS; non-linear deshrinking; ? ML; mixed response models; chemometrics, Bayesian framework, more consideration of spatial autocorrelation ? give up classical methods; use permutation tests; classification and regression trees and random forests Stratigraphical and ? more consideration of temporal and spatial spatial data autocorrelation Hypothesis testing More realistic permutation tests (restrictions); better p estimation

NEURAL NETWORKS – THE LESS CONVENTIONAL DATA ANALYSIS APPROACH IN THE FUTURE?

Back propagation neural network – layers containing neurons input vector input layer hidden layer output layer output vector Clearly can have different types of input and output vectors, e.g.

INPUT VECTORS OUTPUT VECTORS

> 1 Predictor > 1 ‘Responses’ > 1 Variables 1 or more Responses 1 or more ‘Predictors’ 2 or more Classes Regression Inverse regression or calibration Discriminant analysis

CALIBRATION (INVERSE REGRESSION) AND ENVIRONMENTAL RECONSTRUCTIONS

Malmgren & Nordlund (1997) Palaeo-3 136, 359–373 Planktonic foraminifera 54 core-top samples Summer water and winter water temperatures Core E48–22 Extends to oxygen stage 9 320,000 years Compared neural network as a calibration tool with: Imbrie & Kipp principal component regression Modern analog technique (MAT) 2-block PLS (SIMCA) WA-PLS Estimate RMSE

CRITERION FOR NETWORK SUCCESS

Cross-validation leave-one-out (average error rate in training set) RMSEP (predictions based on leave-one-out cross-validation) 3 neurons 600–700 cycles

RMSEP

Neural N PLS MAT Imbrie & WA-PLS Kipp

Summer

0.71

1.01

1.26

1.22

1.04

Winter °C

0.76

1.05

1.14

1.05

0.86

r s

0.99

0.98

0.97

0.97

0.97

r w

0.98

0.97

0.96

0.96

0.96

Changes in root-mean-square errors (RMSE) for S in relation to number of training epochs for 3-layer BP neural networks with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. The networks were trained over 50 intervals of 100 epochs each (in total of 5,000 epochs). As expected, the RMSEs decrease as training proceeds. The minimum RMSE, 0.3539, was obtained after training a network with 10 neurons in the hidden layer over 5,000 epochs. Similar results were obtained also for W (not shown in diagram).

Changes in root-mean-square errors of prediction (RMSEP) for S with increasing number of training epochs in a 3-layer back propagation neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. These error rates were determined using the Leave-One-Out technique, implying training of the networks over 54 sets consisting of 53 observations each, with one observation left out for later testing. The lowest RMSEPs for both S and W, 0.7176 and 0.7636, respectively, were obtained for a configuration with 3 neurons (only the results for S are shown in the diagram). Note that set-ups with 1, 2, and 3 neurons gave lower RMSEPs than for 4, 5, and 10 neurons.

Summer Winter

Relationships between observed and predicted S and W using a 3-layer BP neural networks with 3 neurons in the hidden layer. Lines are linear regression lines. The product-moment correlation coefficients (r) are shown in the lower right hand corners.

Prediction errors for different network configurations: root-mean-square errors for the differences between observed and predicted S and W using a 3-layer BP neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer.

2 3 No. neurons 1 4 5 10 RMSEP 0.8779

0.7850

0.7176

1.0621

1.0032

1.2108

S No. epochs 500 1800 600 700 2200 500 RMSEP 0.8796

0.9013

0.7636

0.8776

0.9206

0.9332

W No. epochs 300 700 700 700 3600 3000 Root-mean-square errors of prediction (RMSEP) are based on the Leave-One-Out technique in which each of the 54 observations in the data set is left out one at a time and the network is trained on the remaining observations. The trained network is then used to predict the excluded observation. The network was run over 50 intervals of 100 epochs each, and the error rates were recorded after each interval.

Prediction error for different methods: Root-mean-square errors of prediction (RMSEP) for S and W obtained from a 3-layer BP network, Imbrie-Kipp Transfer Functions (IKTF), the Modern Analog Technique (MAT), and Soft Modelling of Class Analogy (SIMCA) Method BP network IKTF MAT SIMCA WA-PLS S 0.7136

1.2224

1.2610

1.0058

1.0419

W 0.7636

1.0550

1.1346

1.0501

0.8560

Neural Network PLS WA-PLS Predictions were made using the Leave-One-Out technique

Predictions of S and W in core E48-22 from southern Indian Ocean based on a BP network, compared to the oxygen isotope (  18 O of

Globorotalia

truncautulinoides) curve presented by Williams (1976) for the uppermost 440 cm of the core. The cross correlation coefficients for the relationships between  18 O and the predicted S and W are –0.68 and –0.71, respectively, for zero lags (p<0.001). Interglacial isotope stages 1, 5, 7, and 9 as interpreted here, are indicated in the diagram.

Problems with ANN implementation and cross-validation

Easy to over-fit the model.

Leave-one-out cross-validation is not a stringent test as ANN will continue to train and optimise its network to the one sample left out. Need a training set (ca. 80%) and an optimisation (or selection set) (ca. 10%) to select the ANN model with the lowest prediction error AND an independent test set (ca. 10%) whose prediction error is calculated using the model selected by the optimisation set.

Telford et al. (2004) Palaeoceanography 19

947 Atlantic foraminifera data.

Split randomly 100 times into training set (747 samples), optimisation set (100 samples), and test set (100 samples).

Training set Optimisation set Test set Median RMSEP (ºC) ANN MAT 0.72

0.94

1.11

0.94

0.94

1.02

No advantage in the hours of ANN computing when cross validated rigorously. ANN appears to be a very complicated (and slow) way of doing a MAT!

May not be so good after all!

DIATOMS AND NEURAL NETWORKS

Descriptive statistics for the SWAP diatom-pH data set No. of samples No. of taxa % no. of +ve values in data Total inertia 167 267 18.47

3.39

N2 for samples N2 for taxa pH Min.

5.13

1 4.33

Median Mean 28.58

29.22

14.99

5.27

23.76

5.56

Max.

57.18

120.86

7.25

S.D.

0.77

Range 2.92

SWAP data-set: 167 lakes convergence

Artificial Neural Network

Yves Prairie & Julien Racca (2002)

SWAP data-set: 167 lakes jack-knife predicted pH against observed pH Yves Prairie & Racca (2002) Julien

pH reconstruction by ANN and WA-PLS: (RLGH core) Yves Prairie & Julien Racca (2002)

SKELETONISATION ALGORITHM

Pruning algorithm comparable to BACKWARD ELIMINATION in regression models 1. Measure relevance P i for each taxon i P i = E without i – E with i where E = RMSE 2. Train network with all taxa using back-propagation 3. Compute relevance P i based on error propagation and weights 4. Taxon with smallest estimated relevance P i importance] [Did this in 5% classes of 5. Re-train the network to a minimum again [After deleting a taxon, the values of the remaining taxon are not re-calculated, so the input data are always the same original relative abundance values] Racca et al. (2003)

N2 ANN functionality

Leave-one-predicted pH ANN

ROUND LOCH OF GLENHEAD

30% pruned ANN 60% pruned ANN 85% pruned ANN 0% pruned ANN All taxa WA All taxa ML

General characteristics of the 37 most functional taxa for calibration based on ANN modelling approach.

Summary statistics of the SWAP diatom pH inference models according to the classes of taxa included based on the Skeletonisation procedure Apparent Cross-validation

Cross-validation Apparent Ideally apparent RMSE should be a reliable measure of the actual predictive of a model, and the difference between apparent and cross-validated RMSE indicates the extent to which the model has overfitted the data

Examples of the recently published diatom-based inference models in palaeolimnology used.

CURSE OF DIMENSIONALITY related to ratio of number of taxa to number of lakes, as this ratio determines the ratio of the dimensional space in which the function is determined to the number of observations for which the function is determined.

MAXIMUM ROBUSTNESS – ratio of taxa : lakes as small as possible (1) increase the number of lakes (2) decrease the number of taxa

“Neural networks have the potential for data analysis and represent a viable alternative to more conventional data-analytical methods”.

Malmgren & Nordlund (1997) Advantages: 1) Mixed linear and non-linear responses.

2) 3) 4) Good empirical performance.

Wide applicability.

Many predictors and many ‘responses’.

Disadvantages: 1) Very much a black box.

2) 3) Conceptually complex.

Little underlying theory.

4) Easy to misuse and report erroneous model performance statistics.

PATTERN RECOGNITION

Unsupervised (cluster analysis, indirect gradient analysis) or supervised (discriminant analysis, direct gradient analysis) Statistical theory Linear methods Discriminants & Decision Theory Neural network BELIEF NETWORKS Non-parametric methods CART trees Nearest-neighbour K-NN LDA

VOLCANIC TEPHRAS IN N.W.EUROPE OF LATE GLACIAL AND EARLY HOLOCENE AGE

Vedde Ash

mid Younger Dryas ca 10600 14 C yrs BP (Rhyolitic type) Kråkenes, Norway Several other sites in W Norway Borrobol, Scotland Tynaspirit, Scotland Whitrig, Scotland

Vedde

(Basaltic type)

Borrobol

Kråkenes W Norway Lower LG Interstadial Borrobol, Scotland Tynaspirit, Scotland Whitrig, Scotland ca 12500 14 C yrs BP

Saksunarvatn

early Holocene ca 9000 14 C yrs BP = 9930 – 10010 cal yr Faeroes Kråkenes, W Norway Dallican Water, Shetland SiO 2 TiO 2 Al 2 O 3 FeO MnO MgO CaO Na 2 O K 2 O “The way in which correlation by tephrochronology may revolutionise approaches to reconstructing the sequence of events in the N.E.Atlantic region...” Lowe & Turney (1997)

SiO 2 Al 2 O 3 TiO 2 FeO V VB B MgO S V VB B CaO S V VB B K 2 O S V VB B Na 2 O S V VB B S V VB B S V VB B S V VB B S

 2 = 0.841 28%  1 = 0.988 32.9% CANONICAL VARIATES ANALYSIS (= multiple discriminant analysis) Group means

Borrobol Vedde CVA – individual samples Saksun Vedde B.

CVA

CVA- biplot of variables

Vedde Scotland + a few Vedde Norway Vedde Norway Borrobol Saksunati Vedde Basalt 0.955 cophenetic correlation • BorrobolSaksunavatnVedde BasalticVedde NorwayVedde Scotland Minimum-variance cluster analysis √% data = chord distance

PCA √% data 97.4%  2 = 0.016 1.6%  1 = 0.96 95.9%

Vedde Norway Borobol Vedde Scotland Saksunavatn Vedde basaltic

Vedde Scotland Borrobol Vedde Norway PCA 97.4% All samples Saksunavatn Vedde Basaltic

PCA 97.4% All samples “Tephrochronology offers the potential of overcoming problems of correlation because ash layers provide time parallel markers and therefore precise comparisons between sequences” “The geochemical signature of each ash is unmistakable” Lowe & Turney (1997) Turney et al. (1997)

SCOTLAND'S MOST FAMOUS PRODUCT

Lapointe & Legendre (1994) Applied Statistics 43, 237-257

Dendrogram representing the minimum variance hierarchical classification of single-malt Scotch whiskies: two scales are provided at the top of the graph - the number of groups formed by cutting the dendrogram vertically at the given points and the fusion distances of the hierarchical classification (represented by vertical segments in the dendrogram); the vertical order of the whiskies is partly arbitrary - swapping the branches of a dendrogram does not change the corresponding cophenetic matrix (the 12 groups detailed in Appendix A are labelled A-L here)

Map of Scotland showing the positions of the Scottish distilleries, divided into 11 groups (symbols) in the regional classification of single malt whiskies (Appendix B) (the six Speyside groups are deferred to Fig. 3):distiilery names are represented by four letter abbreviations (see Fig. 3); the names of regions and of some major cities are also indicated - notice that two Scotches in the present study come from the Springbank distillery; Springbank pertains to the western group whereas Longrow is a member of the Islay group

.

Map of the Speyside region showing six of the 11 groups (symbols) of Scotch distilleries of the regional classification of single-malt whiskies (Appendix B) (the names of regions and of some major cities are also indicated) and abbreviations and full names of the distilleries.

Looked at spatially constrained classification and constrained ordination (RDA) Looked at similarities between results based on: Colour Nose Body Palate All give consistent results. Can use one to predict the other, except for finish.

Finish

TEST OF CONGRUENCE AMONG DISTANCE MATRICES (CADM)

Legendre & Lapointe (2005) 5 data sets - colour (14 variables +/-) - nose (12 variables +/-) - body (8 variables +/-) 1 2 3 - palate (15 variables +/-) - finish (19 variables +/-) 4 5 (1 - Jaccard coefficient) ½ to give 5 distance matrices Overall CADM test - null hypothesis of incongruence rejected (H 0 ) (p = 0.0001) Compare 1 with 2-5 2 with 1, 3-5 3 with 1, 2, 4, 5 4 with 1-3, 5 5 with 1-4 - H 0 - H 0 - H 0 - H 0 - H 0 rejected rejected rejected rejected not rejected Mantel test (2 matrices) Finish not related to Colour, Nose, Body or Palate.

Principal co-ordinates analysis of Mantel-test statistics. Axis 1 = 28.7%, axis 2 = 26.3%.

Why is FINISH so different?

It is important!

How were the whiskies tested by the tasters?

Did they swallow or spit?

If the latter, the finish variables may not be fully detected.

ONLY WHEN SWALLOWING CAN ONE TOTALLY CAPTURE THE AFTERTASTE.

But, “some professional blenders work only with their nose, not finding it necessary to let the whisky pass their lips”.

SINGLE MALTS MUST BE SWALLOWED!

INTEGRATED ANALYSES OF BIOLOGICAL AND ENVIRONMENTAL DATA

For nature conservation and management purposes, useful to have an overview of the natural zonation of the area as a whole. Such zonation should: 1. Have characteristic or indicator species or life-forms 2. Correspond to a circumscribed range of environments 3. Have some geographical coherence Requires integrated analysis of biological and environmental data.

INDIRECT CLUSTERING APPROACH

Biological data

Clusters e.g. TWINSPAN cf. Indirect gradient analysis Biological data PCA or CA Biological clusters

Environmental data

e.g. DISCRIM Canonical variates analysis RIVPACS Regression with environmental data

DIRECT CLUSTERING APPROACH

1. Latent class analysis with biological data as +/- or counts following binomial or Poisson distribution and environmental data following, after log transformation, normal distribution.

ter Braak et al. (2003) Ecological Modelling 160: 235-248

Biological data + Environmental data

Clusters or Zones 2. CCA, RDA, or DCCA of biological and environmental data combined in multivariate direct gradient analysis, followed by minimum-variance cluster analysis (Ward's method) or k-means minimum-variance cluster analysis.

Estimate characteristic species for each cluster.

Carey et al. (1995) J. Ecology 83: 833 845. Biogeographical zonation of Scotland.

Characteristic species of biogeographical zones

3. Principal co-ordinates analysis of mixed (biological and environmental) data using Gower's (1971) coefficient.

s ij

k m

  1

w ijk s ijk k m

  1

w ijk

where

s

ij

variable

k

is the similarity between sites

i

and

w

ijk

is typically 1 or 0 depending on whether or not the comparison is valid for variable matches. For binary variables

s

ij k

and . Weights of zero are assigned when categorical data the component similarity

s

ijk j

as measured by the unknown for one or both sites or to binary variables to exclude negative is the Jaccard coefficient. For is one when the two sites have the same value and zero otherwise. For quantitative data

k

is

s ijk

 1 

x ik

x jk R k

where

R

k

is the range of variable

k

AN EXAMPLE

Site 1 Site 2 Site 3 Altitude 120 150 110 Moisture 1 2 3 Limestone + + Sheep + Age 1 2 3

s

12  1  ( 1  30 40 )  1  1  1  0 1   1 0  0  1  0  1  1  0  0 .

0625 Clusters can then be defined using the principal co-ordinate axes scores in a minimum-variance cluster analysis or a partitioning of the sites on the basis of the ordination scores.

4. Constrained indicator species analysis (COINSPAN) Carleton, T.J. et al. (1996) J. Vegetation Science 7: 125-130 Like TWINSPAN (biological data only) but uses CCA first axis instead of CA first axis (as in TWINSPAN) as the basis for ordering samples prior to creating dichotomies.

The resulting clustering is based on CCA axis 1, a linear combination of environmental variables that maximises the dispersion of species scores.

COINSPAN clustering thus integrates biology and environment together. Surprisingly little used - has considerable potential.

PROBLEMS OF PERCENTAGE (COMPOSITIONAL) DATA

Jackson D.A. (1997) Ecology 78, 929–940 Simulated data SIM 200 observations x 5 variables Different means and variances

Mean Variance

x

1

30 16

x

2

60 16

x

3

60 64

x

4

120 64

x

5

120 4096 Correlations between all variables = 0 Transformed into percentages Raw data – BASIS Transformed data – PERCENTAGE or PROPORTIONS

COMPOSITION

r

BASIS Bivariate casement plots of the basis (lower triangular matrix) and composition (upper triang ular matrix) for the simulated data SIM. The basis relationship are independently generated, and correlations approximate zero. Note the strong linear relationships in the composition arising due to the constant-sum constraint, i.e. matrix closure. S1-S5 represent variables.

COMPOSITION BASIS Frequency distributions of the bivariate correlations for SIM obtained under randomization. Each plot corres ponds to the correlation between two variables from the basis (lower triangular matrix) or the composition (upper triangular matrix) used in the previous figure. The basis matrix was randomized within each column, the composition recalculated, and the correlation recalculated. Each plot is a frequency distribution of the correlations obtained from 10 000 randomized matrices.

Eigenvector coefficients from a principal component analysis of the correlation matrix of SIM. Results from a PCA of the basis and the composition are presented.

Composition Basis SIM Scree plots of the eigenvalues for each component from the (a) simulated data (SIM) and (b) herbivorous zoo plankton data (ZOO). The solid line represents the eigenvalues from the basic data (i.e. non standardised), and the dashed line represents the eigenvalues from the compositional data (i.e. proportions).

Basis Composition Scatterplots of the first two components from a principal component analysis of SIM using the (a) basis and (b) composition in calculating the correlation matrix. Letters refer to the points positioned at the ends of axes 1 and 2.

CLUSTER ANALYSIS

BASIS COMPOSITION UPGMA cluster analysis based on a correlation matrix of the variables (S1-S5 and H1-H5) from: (a) the basis data of the simulation data (SIM); (b) the compositional data of SIM; (c) the basis data of the zooplankton data (ZOO); and (d) the compositional data of ZOO.

REF

POSSIBLE SOLUTIONS

REF

1. CENTRED LOG RATIO

Aitchison (1986) All variables are retained in analysis but are standardised by dividing each variable by a denominator based on a geometric composite of all variables.

PCA covariance matrix

Y ij

 cov  log 

x i g

  ,  log 

x j g

  i, j, ..., m and g(x) is the geometric mean of the variables, i.e.

g

 

x i

1

m

Advantages : 1. All variables are retained.

2. Pairwise relationships are the same regardless of using basis or compositional data.

Problems : 1. With SIM, correlations still very strong!

0.412

0.843

-0.799

-0.906

2. Zero values have unidentified log-ratio value. Replace zero values by small value.

3. Matrix is singular, so only m-1 components.

REF

REF 2) REF

CORRESPONDENCE ANALYSIS

Only considers proportional relationships between variables; unaffected by using basis or compositional data.

CA/DCA/CCA – focuses on relative abundances PCA/RDA – focuses on absolute abundance If an environmental variable influences total biomass, but leaves the species composition unchanged, the variable will be important in PCA/RDA but not at all important in CA/DCA/CCA.

One approach analyse total biomass separately by regression analyse species composition by CCA Analyses are fully complementary.

REF (PCA/RDA would probably give results close to the regression analysis).

REF

REF

UNRESOLVED QUESTION SINCE 1986 IN CA/CCA

REF How can CA and CCA 1. Model unimodal function (c.f. WA as approximate Gaussian ML regression) and REF 2. Be linear with fit

y ik

 

y i

y

k y

    1 

b k

1

x i

1  ...

 Partial answer CA and CCA model compositional data (proportions) This compares with Aitchinson's log-ratio model and the polytomous GLM which are linear in centred logs but unimodal in the original data.

REF

THE TWO FACES OF CORRESPONDENCE ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS

REF CA and CCA are methods for analysing unimodal data.

REF CA and CCA are CHAMELEONS 1) Unimodal methods 2) Linear methods CCA can be derived as a weighted form of reduced rank regression = redundancy analysis = principal component analysis with respect to instrumental variables. The key element is that the relative abundance is a linear function of the environmental variables (relative here means relative to sample total and species total).

As unimodality and compositional data often go hand in hand, common element is that CCA models compositional (i.e. relative) abundance data instead of the absolute abundance data.

ECOLOGICAL TERMS

CCA (and CA) models relative abundances; takes sample size for granted. Usually the  diversity of a sample increases with its size. CCA and CA take that aspect of  -diversity for granted and focuses, instead, on the  -diversity (dissimilarity between sites). If the trend in  -diversity coincides with  -diversity (e.g. species disappear one by one along a gradient), CA and CCA can extract such trends.

In unimodal context, species scores are weighted averages of sample scores and vice versa. In linear context, species scores are derived from a weighted linear regression of transformed species data on to the sample scores.

REF

REF REF Linear context most useful when gradient length is < 3SD. Unimodal context most useful when gradient length is > 4SD. For intermediate lengths, either contexts may be useful.

Can transform unimodal model into linear model by ‘take logarithms and double centre’ (for data with no zeroes).

If data contain zeroes, no explicit linearising data transformation because we cannot take logarithms. In CA and CCA, a transformation is implicit that is close to the exact transformation.

EXACT

log with  

y

ik g

 

y ik g

   

g i

g

k

g

 respectively and is the overall geometric average

CA/CCA

y

ik

 

y ik y y

   

y i

y

k

y

i

and across samples for species

k

INHERIT THEIR TWO FACES FROM MODELS OF COMPOSITIONAL DATA.

REF REF

DATA TYPE AND CHOICE OF ORDINATION METHOD

Besides gradient length (standard deviations), data type is also important in selecting ordination method.

Unconstrained PCA (linear) Constrained Constrained Absolute abundance RDA (linear) (PRC) (linear) Relative abundance (Compositional differences) CA, DCA (unimodal) CCA, DCCA (unimodal) PCA/RDA are weighted summations; CA/CCA are weighted averages, hence the difference between modelling absolute values (PCA/RDA) or relative values (CA/CCA).

Cannot currently model satisfactorily absolute abundances over long graidents. Need to partition the data into smaller gradients first (e.g. TWINSPAN).

SPECIES ABSENCES IN DATA SETS

Besides removing the absolute abundance effect, CA, DCA, CCA, and partial CCA (and WA and WA-PLS) do not consider species absences or zero values in the biological data.

Zero values - ? Show real absence ? Reflect incomplete sampling ? Chance

Is this an advantage or disadvantage?

SOFTWARE AVAILABILITY

CANOCO

USA &

CANODRAW

MicroComputer Power 111 Clover Lane ITHACA, NY 14850 [email protected]

http://www.microcomputerpower.com

MAT, ZONE, WINTRAN, C2

Steve Juggins Geography Department University of Newcastle NEWCASTLE UPON TYNE NE1 7RH ([email protected]) http://www.campus.ncl.ac.uk/staff/ Stephen.Juggins/

HOF

Jari Oksanen Department of Biology University of Oulu OULU Finland ([email protected]) http://cc.oulu.fi/~jarioksa/

TWINSPAN

(Mark Hill),

DISCRIM

(Cajo ter Braak), TWINGRP, RATEPOL, SPLIT , etc John Birks Department of Biology University of Bergen Allégaten 41 N-5007 BERGEN Norway ([email protected])

QUERIES

[email protected]

John Birks, Department of Biology, University of Bergen, Allégaten 41, N-5007 Bergen, Norway Fax: (+47) 55 58 96 67 [email protected]

Gavin Simpson, Environmental Change Research Centre, University College London, Gower Street, London, WC1E 6BT, UK http://www.homepages.ucl.ac.uk/~ucfagls/ncourse/

VALUABLE WEB SITES FOR NUMERICAL ECOLOGISTS AND PALAEOECOLOGISTS

www.okstate.edu/artsci/botany/ordinate Mike Palmer's ordination site with masses of documentation, explanatory notes, links, details of software, etc.

www.canoco.com

Cajo ter Braak's site about CANOCO and with answers to many frequently asked questions (FAQ) www.microcomputerpower.com

Richard Furnas' site about CANOCO and related software availability and ordering www.canodraw.com

Petr Šmilauer's site about CANODRAW and CANOCO and related software http://regent.bf.jcu.cz/maed Details of Petr Šmilauer and Jan Lepš' course and data on multivariate analysis of ecological data.

WEB SITES continued

http://cc.oulu.fi/~jarioksa/ Jari Oksanen's site with his R vegan package, lecture notes, programs (e.g. HOF), documentation, comments, FAQ, and much more http://www.bio.umontreal.ca/legendre/indexEnglish.html

Pierre Legendre's site with details of publications, software, activities, etc.

http://www.bio.umontreal.ca/Casgrain/en/labo/index.html

Software from Pierre Legendre's lab http://labdsv.nr.usu.edu/ Dave Robert's site about quantitative vegetation ecology with lecture notes, software details, etc.

www.nku.edu/~boycer/fso/ Rick Boyce's site about fuzzy set ordination http://cran.r-project.org/ R website

WEB SITES continued

www.stat.auckland.ac.nz/~mja/ Marti Andersen's site with new software, details of publications, activities, etc.

www.campus.ncl.ac.uk/staff/Stephen.Juggins

Steve Juggins' site for C2, WinTran, ZONE, etc.

www.chrono.qub.ac.uk/psimpoll/psimpoll.html

Keith Bennett's site for palaeoecological software, notes, etc.

www.chrono.qub.ac.uk/inqua Keith Bennett's site of INQUA Data Analysis Sub-Commission software, newsletters, etc.

www.env.duke.edu/landscape/classes/env358/env358.html

Dean Urban's site with excellent lecture notes on Multivariate Methods for Environmental Applications

FINAL COMMENTS

Numerical Analysis of Biological Data

Basic building-blocks and concepts and the resulting numerical methods

Continuum concept Niches Weighted averaging

CA/DCA

'Communities'

TWINSPAN Cluster analysis Metric scaling Indicator-species analysis INDVAL Non-metric scaling

Numerical Analysis of Environmental Data

Basic building-blocks and concepts and the resulting numerical methods

GLM Gradients Correlation & covariance Cross validation Permutation tests

Regression models

Linear combinations PLS

Multiple regression RDA + PCA Cluster analysis Procrustes rotation Co-inertia analysis Linear discriminant analysis, canonical correlation analysis

Numerical Analysis of Biological and Environmental Data

Basic building-blocks and concepts and the resulting numerical methods

GLM & GAM Niches & Gradients Cross validation

WA Regression models

Weighted averaging

WA-PLS CCA-PLS Co-CA Multiple regression

+

CA/DCA TWINSPAN CCA DISCRIM

PLS

COINSPAN

Permutation tests

Cluster analysis Co-inertia analysis Distance-based PCoA Canonical analysis of principal co-ordinates (CAP) Multiple discriminant analysis

Andrew Lang 1844-1912. He uses statistics as a drunken man uses lamp posts – for support rather than illumination. From MacKay, 1977, and reproduced through the courtesy of the Institute of Physics.

Statistics are for illumination!

Sketches illustrating statistical zap and shotgun

THE PEOPLE WHO HAVE MADE THE STATISTICAL ZAP POSSIBLE

Mark O. Hill Cajo J.F. ter Braak Pierre Legendre Marti J. Anderson Richard Telford Steve Juggins Gavin Simpson