Document 7594388

Download Report

Transcript Document 7594388

Canonical Correlation Analysis,
Redundancy Analysis and Canonical
Correspondence Analysis
Hal Whitehead
BIOL4062/5062
• Canonical Correlation Analysis
• Redundancy Analysis
• Canonical Correspondence Analysis
Multivariate Statistics with Two
Groups of Variables
• Look at relationships
between two groups of
variables
– species variables vs
environment variables
(community ecology)
– genetic variables vs
environmental
variables (population
genetics)
Units
Variables
X’s
Y’s
Canonical Correlation Analysis
• Multivariate extension of correlation analysis
• Looks at relationship between two sets of
variables
Canonical Correlation Analysis
Given a linear combination of X variables:
F = f1X1 + f2X2 + ... + fpXp
and a linear combination of Y variables:
G = g1Y1 + g2Y2 + ... + gqYq
The first canonical correlation is:
Maximum correlation coefficient between F and G,
for all F and G
F1={f11,f12,...,f1p} and G1={g11,g12,...,g1q}
are corresponding canonical variates
Canonical Correlation Analysis
Maximize r(F,G)
5
1.5
6
G
17
16
F(16)
7
15
4
F(7)
20
X2
5
7
3
14
4
9
12
3
8
6 17
11
19
F
1 18
G(7)
9
1.0
15
Y2
18
1 5
14
4
20
16
11
G(16)
3
10
13 2
0.5
13
2
19
8
10
12
2
4.0
4.5
5.0
X1
5.5
6.0
0.0
1.0
1.5
2.0
Y1
Canonical Correlation Analysis
The first canonical correlation is:
Maximum correlation coefficient between F and G,
for all F and G
F1={f11,f12,...,f1p} and G1={g11,g12,...,g1q}
are corresponding first canonical variates
The second canonical correlation is:
Maximum correlation coefficient between F and G,
for all F, orthogonal to F1, and G, orthogonal to G1
F2={f21,f22,...,f2p} and G2={g21,g22,...,g2q}
are corresponding second canonical variates
etc.
Canonical Correlation Analysis
• So each canonical correlation is associated
with a pair of canonical variates
• Canonical correlations decrease
• Canonical correlations are higher than
generally found with simple correlations
– as coefficients are chosen to maximize
correlations
Canonical Correlation Analysis
Correlation Matrix:
X1 X2 X3 ... Xp
X1
X2
.
.
Xp
Y1
.
.
Yq
A (pxp)
C' (qxp)
Y1 ... Yq
C (pxq)
B (qxq)
Canonical correlations are:
Squareroots of Eigenvalues of
B-1 C' A-1 C
Canonical variates for Y variables
are Eigenvectors
Number of canonical correlations =
min(No. X’s, No. Y’s)
Can test whether canonical
correlations are significantly
different from 0
Canonical Correlation Analysis
What are the canonical correlations?
Are they, in toto, significantly different from zero?
Are some significant, others not? Which ones?
What are the corresponding canonical variates?
How does each original variable contribute towards
each canonical variate (use loadings)?
How much of the joint covariance of the two sets of
variables is explained by each pair of canonical
variates?
Relationship to:
Canonical Variate Analysis
• We can define dummy (1:0) variables to
define groups of units:
– 1 = in group; 0 = out of group
• A canonical correlation analysis between
these dummy grouping variables and the
original variables is equivalent to a
canonical variate analysis
Redundancy Analysis
y1 <=> y2 Correlation Analysis
x => y Simple Regression Analysis
X => y Multiple Regression Analysis
(X={x1,x2,...})
Y1 <=> Y2 Canonical Correlation Analysis
X => Y Redundancy Analysis
How one set of variables (X) may explain
another set (Y)
Redundancy Analysis
• “Redundancy” expresses how much of the
variance in one set of variables can be
explained by the other
Redundancy Analysis
Output:
canonical variates describing how X explains Y
non-canonical variates
(principal components of the residuals of Y)
results may be presented as a biplot:
two types of points representing the units and
X-variables, vectors giving the Y-variables
Hourly records of sperm whale behaviour
• Variables:
–
–
–
–
–
–
–
–
–
–
–
–
Mean cluster size
Max. cluster size
Mean speed
Heading consistency
Fluke-up rate
Breach rate
Lobtail rate
Spyhop rate
Sidefluke rate
Coda rate
Creak rate
High click rate
• Data collected:
– Off Galapagos Islands
– 1985 and 1987
• Units:
– hours spent following
sperm whales
– 440 hours
Hourly records of sperm whale behaviour
• Variables:
–
–
–
–
–
–
–
–
–
–
–
–
Mean cluster size
Max. cluster size
Mean speed
Heading consistency
Fluke-up rate
Breach rate
Lobtail rate
Spyhop rate
Sidefluke rate
Coda rate
Creak rate
High click rate
• Data collected:
– Off Galapagos Islands
– 1985 and 1987
• Units:
Physical
Acoustic
– hours spent following
sperm whales
– 440 hours
Canonical Correlation Analysis:
Physical vs. Acoustic Behaviour
1
2
3
Canonical correlations
0.72
0.49
0.21
P-values
0.00
0.00
0.06
Redundancies:
V(Acoustic) | V(Physical)
V(Physical) | V(Acoustic)
34%
32%
20%
8%
<1%
<1%
Physical vs. Acoustic Behaviour
Canonical correlations
Loadings:
Mean cluster size
Max. cluster size
Mean speed
Heading consistency
Fluke-up rate
Breach rate
Lobtail rate
Spyhop rate
Sidefluke rate
Coda rate
Creak rate
High click rate
1
2
-0.95
-0.85
0.21
0.32
0.73
-0.16
-0.22
-0.18
-0.21
-0.64
-0.50
0.76
0.07
0.47
0.06
-0.27
0.23
0.02
0.03
0.32
0.35
0.64
0.79
0.64
Canonical Correspondence Analysis
• Canonical correlation analysis assumes a
linear relationship between two sets of
variables
• In some situations this is not reasonable
(e.g. community ecology)
• Canonical correspondence analysis
assumes Gaussian (bell-shaped) relationship
between sets of variables
• “Species” variables are Gaussian functions
of “Environmental” variables
CANOCO
Species A
Species B
Species C
Canonical Correspondence
Analysis
Species abundance
Species abundance
Canonical Correlation
Analysis
Environmental variable X
Species abundance
Species abundance
Environmental variable X
Environmental variable Y
Environmental variable Y
Species abundance
Species abundance
Environmental variable X
Species abundance
Environmental variable Y
Species abundance
1.4X + 0.2Y
Best combination of X and Y
Species abundance
Species abundance
Environmental variable X
Species abundance
Environmental variable Y
Species abundance
1.4X + 0.2Y
Best combination of X and Y
Species abundance
Species abundance
Environmental variable X
Species abundance
Environmental variable Y
Species abundance
1.4X + 0.2Y
Best combination of X and Y
Canonical correspondence
analysis: Dutch spiders
• 26 environmental variables
• 12 spider species
• 100 samples (pit-fall traps)
Axes
Eigenvalues
Species-environment correlations
Cumulative percentage variance
of species data
of species-environment relation
1
.535
.959
2
.214
.934
3
.063
.650
4
.019
.782
46.6
63.2
65.2
88.5
70.7
95.9
72.3
98.2
Axis 2
Axis 1
Canonical correspondence
analysis can be detrended
The ‘Horseshoe effect’
Environmental Gradient
Sp A
Sp B
Sp C
Sp D
Sp E
Sp F
Sp G
Sp H
Sp I
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
Axis 2
Axis 1
Detrended Axis 2
Detrended
Canonical Correspondence Analysis
Detrended Axis 1
• Canonical Correlation Analysis
– Examines relationship between two sets of variables
• Redundancy Analysis
– Examines how set of dependent variables relates to set
of independent variables
• Canonical Correspondence Analysis
– Counterpart of Canonical Correlation and Redundancy
Analyses when relationship between sets of variables is
Gaussian not linear