Diapositiva 1 - IMEDEA Divulga CSIC-UIB

Download Report

Transcript Diapositiva 1 - IMEDEA Divulga CSIC-UIB

Exercices
Multivariate Data Analysis
Topic 1 Multivariate Data Analysis
Topic 1 Theory: Multivariate Data Analysis
Introduction to Multivariate Data Analysis
Principal Component Analysis (PCA)
Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:
Introduction to MATLAB
Examples of PCA (cluster analysis of samples, identification and
geographical distribution of contamination sources/patterns…)
Examples of Multivariate Regression (prediction of concentration
of chemicals from spectral analysis, investigation of correlation
patterns and of the relative importance of variables,…
Romà Tauler (IDAEA, CSIC, Barcelona)
Febrero 2009
Introduction to MATLAB.
What is MATLAB?
Matlab is a contraction for “Matrix Laboratory" and, though originally
designed as a tool for the manipulation of matrices, is now capable of
performing a wide range of numerica computations.
Matlab also possess esextensive graphics capabilities.
Introduction to MATLAB
Command line programming environment
command window prompt (»)
Matrix algebra: scalars, vectors, matrices
Work / use:
•Interactively at the command line
•Create/use programs (functions or scripts)
•Toolboxes add on additional functionality
The MATLAB Workspace
Workspace is where: variables are stored,
create variables, manipulate and operate
on variables
Save workspace variables
Information about variables in the
workspace: who and whos
»whos
Name Size Bytes Class
fsparse 100x100 1604 sparse array
modstruct 1x1 130 struct array
my3D 10x20x104 166400 double array
mymat 5x4 160 double array
myvect 1x3 24 double array
somechars 1x8 16 char array
zcells 2x2 167082 cell array
Grand total is 41766 elements using 335416
bytes
MATLAB Data Types
•double -- double precision floating point
-- number array (this is the traditional
-- MATLAB matrix or array)
•sparse -- 2-D real (or complex) sparse matrix
•struct -- Structure array
•cell -- cell array
•char -- Character array
•logical -- Logical arrays (1,0)
<class_name> -- Custom object class
dataset -- Standard Data Object
Command Line Help: help functionname; lookfor method; which functionname
helpwin
Importing Data into MATLAB
MATLAB can read flat ASCII files
Import Wizard
A variety of image formats can be imported with
IMREAD function (JPEG, BMP, TIFF, etc.)
Various spreadsheet import functions
Custom developed routines for reading binary
instrument files
Additional Functions for Importing Data
‘xlsfinfo’ - reads sheetnames from .xls file
‘xlsread’ - reads in data from .xls file
Format types
A = [1 2 0; 2 5 -1; 4 10 -1]
A=1 2 0
2 5 -1
4 10 -1
>>B = A'
B=124
2 5 10
0 -1 -1
>>C = A .* B
C=1 4 0
4 25 -10
0 -10 1
The same for the ./and.\operators
NaNconcept:
NaN is the IEEE arithmetic representation for Not-a-Number.
A NaN is obtained as a result of mathematically undefined operations like
0.0/0.0 and inf-inf.
Useful functions for beginners:
HELP:On-line help, display text at command line.
LOOKFOR:Search all M-files for keyword.
WHOS:List current variables, long form.
MAX:Largest component.
MIN:Smallest component.
ROUND, CEIL, FLOOR, FIX:Rounding.
SQUEEZE:Remove singleton dimensions.
FIND:Find indices of nonzero elements.
MEAN:Average or mean value.
ISNAN:True for Not-a-Number.
FLIPUD:Flip matrix in up/down direction.
FLIPDIM:Flip matrix along specified dimension.
RESHAPE:Change size.
PERMUTE:Permute array dimensions.
REPMAT:Replicate and tile an array.
EVAL:Execute string with MATLAB expression.
Indexing into Three-way (and higher)
Arrays
MATLAB supports three-way and higher
arrays Indexing extends easily to multiway:
»x(:,:,2) = ones(4,5)*5
»x = round(rand(4,5,2)*10)
x(:,:,1) =
x(:,:,1) =
10 9 8 9 9
10 9 8 9 9
28479
28479
65624
65624
50849
50849
x(:,:,2) =
x(:,:,2) =
55555
11348
55555
42295
55555
82052
55555
06747
Cell Arrays
Cell arrays are a handy way
to store different length
matrices from batch process
data, example at left
»x = cell(4,1)
x=
[]
[]
[]
[]
»x{1} = rand(4,5);
»x{2} = rand(10,5);
»x{3} = rand(6,5);
»x{4} = rand(8,5);
»x
x=
[ 4x5 double]
[10x5 double]
[ 6x5 double]
[ 8x5 double]
»x{1}
ans =
0.8381 0.8318 0.3046 0.3028 0.3784
0.0196 0.5028 0.1897 0.5417 0.8600
0.6813 0.7095 0.1934 0.1509 0.8537
0.3795 0.4289 0.6822 0.6979 0.5936
help diary
DIARY Save text of MATLAB session.
DIARY FILENAME causes a copy of all subsequent command window input
and most of the resulting command window output to be appended to the
named file. If no file is specified, the file 'diary' is used.
DIARY OFF suspends it.
DIARY ON turns it back on.
DIARY, by itself, toggles the diary state.
Use the functional form of DIARY, such as DIARY('file'),
when the file name is stored in a string.
See also <a href="matlab:help save">save</a>.
Reference page in Help browser
<a href="matlab:doc diary">doc diary</a>
doc diary
diary
Introduction to Linear Algebra
•
•
•
•
•
•
•
•
•
•
•
Definitions
scalar, vector, matrix
Linear Algebra Operations
vector and matrix addition
vector and matrix multiplication
projection
Gaussian elimination
the concept of rank
matrix inverses
rank deficiency
......
Projection of a vector y onto a vector x
Projection of a vector y onto a subspace X (onto the columns of X)
Diagonalization of a non-singular symetric matrix.
Eigenvalues and eigenvectors. Calculation of the principal
components.
(X1, X2, ..., Xn)
Linear Transformation  (PC1, PC2, ...., Pcn)
PC1 = l11X1 + l12X2 + .... + l1nXn
PC2 = l21X2 + l22X2 + .... + l2nXn
...................................................
PCn = ln1X1 + ln2X2 + ..... + lnnXn
(PC1, PC2, .....Pcn) = (X1, X2, ....Xn)
PC
 l11l 21 ... l n1 


l
l
...
l
12
22
n
2


 ............. 


 l n1l n 2 ... l nn 
=
X L
with the constraints applied in ascending order:
1. Var(PC1) maximum
2. Var(PC2) maximum but with Cov(PC1,PC2) = 0
.....................................................................................
n. Var(PCn) maximum but with
Cov(PC1,PCn) = 0, Cov(PC2,PCn) = 0, Cov(PC3,PCn) = 0, ..........................., Cov(PCn-1,PCn) = 0
* Diagonalization of a non singular square symetric matrix
S = Cov(X1, X2, ...., Xn)
S = L Diag(1,2,...n) Lt
=L D() Lt
L is an orthonormal matrix; it has the eigenvectors of S (loadings); they are in the
columns of matrix L
 l11l21...ln1  1 0...0  l11l12 ...l1n 




 l12l22 ...ln 2  02 ...0  l21l22 ...l2 n 
S=
............  ..........  ............. 




 l l ...l  00...  l l ....l 
n  n1 n 2
nn 
 1n 2 n nn 
Eigenvalues of matrix S are in the diagonal of matrix D
1 = Var(PC1), 2 = Var(PC2), .... ,n = Var(PCn)
s11+s22+...+snn = Trace(S) = Trace(D()) = 1+2+...+n
Det(S) = Det(D()) = 12.....n
Znn = (zij) is the matrix of scores; object coordinates in the new axes (new variables,
or PCs)
Znn=Xnn Lnn ;
Znn Ltnn =
Xnn
Linear Combination of the original variables
Factors
Principal Components (PC)
Canonic Variables
Latent Variables
Discriminant Functions
...............................................
Linear Combination of random variables
y = a1x1 + a2x2 + ....+ anxn = at x
E(y) = a1E(x1) + a2E(x2) + ....+ anE(xn)
Var(y) = (a1, a2, ..., an) S a = at S a,
on S és la matriu de variances-covariances de X
z = b1x1 + b2x2 + ...+ bnxn = bt x
Cov(y,z) = at S b
..................................................................................
•Noise Filtering
Selection of the first principal components, e.g.. if e
PC are selected
Xmn =
ZmeLtee + Emn
Emn is the residuals matrix, after subtracting the
contributions of the first PCs
* Euclidean Distance
d2(Oi,Oj) = d2( (xi1,xi2,...,xin) (xj1,xj2,...,xjn) ) =
= (xi1-xj1)2 + (xi2-xj2)2 + (xin-xjn)2 =
 x i1  x j1 


x

x
= (xi1-xj1, ...,xin-xjn) I  i 2
j2 
 ........... 


x  x 
 in
jn 
* Mahalanobis Distance
 x i1  x j1 


x

x
j2 
d2m(Oi,Oj) = (xi1-xj1, ...,xin-xjn) S  i 2
 ........... 


x  x 
 in
jn 
where S is the covariances matrix
It takes into account covariance between variables!
Univariate Statistics
n
mean
 X
x
i 1
i
n
n
variance
 X2  s 2x 
 (x
i 1
i
 X)
n 1
n
standard deviation
 X  sx 
 (x
i 1
 X)
i
n 1
n
covariance
correlation
 x,y  s x,y 
rx,y 
s x,y
sxsy
 (x
i 1
i
 X)(yi  Y)
n 1
Multivariate Statistics
Matrix X of experimental measures Xnm
 x11

x
X(n, m)   21
 ...
 x
 n1
x12
x 22
...
xn2
... x1m 

... x 2m 
... ... 

... x nm 
vector of column means:
x = (x1 , x2 , ..., xm )
where
n
x
xj =
i=1
n
ij
, j  1,..., m
Matrix of variances-covariances S(m,m) = (s2ij)
It is a square symmetric matrix
n
2
s jl = Cov(xj , xl) =
(xij  x j )(xil -x l )

i1
n 1
2
2
 s11
s12
 2
s
...
S   21
 ... ...
 2
 s m1 ...
2

... s1m
2 
... s 2m

... ... 

2 
... s mm

Multivariate Statistics
n
2
s jj = Var(xj) =
( x
i 1
ij
 x j )2
=
n 1
1
n 1
xj  xj
X (n,m) = X(n,m) -  x , x , ..., x 
1
 x11  x1

x x
X(n, m)   21 1
 ...
 x  x
 n1 1
x12  x 2
x 22  x 2
...
xn2  x2
2
n mn
2
mean centered data matrix
... x1m  x m 

x 2m  x m 

...
...

... x nm  x m 
S (m,m) = 1/(m-1) X XT(m,n) X(n,m)
covariance matrix
Standard deviations
(s1, s2,..., sn) = (s2111/2, s2221/2,...,s2nn1/2)
Multivariate Statistics
Correlation matrix C (m,m)
X (n,m) => mean centering => X (n,m) standardizing Xs(n,m)
(xij)
x ij  x j
x ij  x j
sj
C (m,m) = Corr(Xj) = 1/(n-1) XsT Xs
Covariance matrix respect the origen M (m,m) = 1/n XT X
 x11

x 21

X(n, m) 
 ...

 x n1
x12
x 22
...
xn2
 x11  x1

x 21  x1

X(n, m) 
 ...

 x n1  x1
... x1m 

... x 2m 
... ... 

... x nm 
x12  x 2
x 22  x 2
...
xn2  x2
... x1m  x m 

x 2m  x m 

...
...

... x nm  x m 
n
2
2
 s11
s12
 2
s 21 ...

S
 ... ...
 2
 s m1 ...
2

... s1m
2 
... s 2m 
... ... 

2 
... s mm 
s 2jl 
 (x
i 1
2

r1m
2 
... ... r2m 
... ... ... 

2 
... ... rmm 
 x j )(x il  x l )
n 1
n
2
(x

x
)
 ij j
s 2jj  i1
 r112
 2
r21

C
 ...
 2
 rm1
ij
n 1
r122 ...
r 
2
ij
sij2
sis j
si  sii2
Univariate Normal Distribution with  mean and 
standard deviation
(x) =
f
1
e
 2

1( x   ) 2
2 2
Sample mean, m, is an estimation of the population mean 
and standard deviation of the sample, s, is an estimation of
the standard deviation of the population, 
Multivariate Normal Distribution
 = (1, 2, ...., n) population mean
x  ( x1 , x 2 ,... x n )
sample mean as estimation of 
covariances matrix  (matrix S is an estimation of )
f
(x1,x2,....,xn) =
1
e
 1 / 2 (2  ) n / 2
 x1  1 
 ........... 
1 / 2 ( x 1   1 ,.... x n   n )  1  ........... 


 xn n 
Other subjects to consider (exercises):
-Statistical distributions (with MATLAB)
•Elementary Statistical functions (in MATLAB)
•Statistical tests
•ANOVA
•Experimental design...
Comparison of sample mean with a
known value (population mean) (0):
BEGIN
yes
zcalc 
no
n  30
x  0
 n
tcalc 
tcal < ttab
n 1 d.. f.
zcal < 1.96
*
yes
no
x  0
x  0
END
x  0
s n
yes
no
x  0
x  0
END
BEGIN
Comparison
between
the
mean of two
samples
yes
no
n1 i n2  30
yes
zcal 
x1  x2
12 22
+
n1 n2
yes
zcal < 1.96
12  22  2
tcal 
x1  x2
1 1
+
s
n1 n2
1  2
tcal < ttab
n1 + n2  2 g. l.
END
yes
normality
after
transformation?
no
no
TESTS
NON PARAMÈTRIC
( n1 1)  s12 + ( n2 1)  s22
n1 + n2  2
yes
1  2
transformation
test F
s2 
no
no
normality?
x1  x2
tcal 
s12 + s22
n1 n2
s2
s2
t1 1 + t2 2
n
n2
t'  21
2
s1 + s2
n1 n2
t1  ttab ( n1 1) g. l.
t2  ttab ( n2 1) g. l.




2
2
2
  s1 + s2  
  n n2  
2
g. l.cal   2 12
2 
 s22  
  s1 
  
  n 
n 
 1 + 2 
 n1 1 n2 1 
t'  ttab for d. f.cal
yes
no
1  
2
1  2
no
tcal < t'
1  2
END
yes
1  2
END
Comparison
between
the
mean of two
samples
BEGIN
yes
n  30
no
normality?
no
transformation
yes
yes
zcal 
d
sd
tcal 
n
normality
after the
transformation
d
sd
no
n
non parametric
tests
tcal < ttab
n 1 d. f.
zcal <196
.
yes
no
d  0
d  0
END
yes
no
d  0
d  0
END
Topic 1 Multivariate Data Analysis
Topic 1 Theory: Multivariate Data Analysis
Introduction to Multivariate Data Analysis
Principal Component Analysis (PCA)
Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:
Introduction to MATLAB
Examples of PCA (cluster analysis of samples, identification
and geographical distribution of contamination
sources/patterns…)
Examples of Multivariate Regression (prediction of concentration
of chemicals from spectral analysis, investigation of correlation
patterns and of the relative importance of variables,…
Romà Tauler (IDAEA, CSIC, Barcelona)
Febrero 2009
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Unsupervised Pattern
Regognition
Supervised Pattern
Regognition
>> load arch
>> whos
Name
Size
Bytes Class
arch
75x10
class
75x1
samps 75x5
vars
10x2
>> plot(arch)
>> plot(arch')
Attributes
6000 double
600 double
750 char
40 char
Data matrix
Classification index
Sample levels
Variable levels
1800
1800
1600
1600
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
0
10
20
30
40
50
60
70
80
0
Fe
Ti
Ba
Ca
K
Mn
Rb
Sr
Y
Zr
1800
Data Statistics
min: 45
max: 1100
mean: 334.7000
median: 131
mode: 45
std: 386.7365
range: 1055
hist(arch(:,v)
25
25
1600
1400
boxplot(arch’)
1200
1000
Values
1 Fe
2 Ti
3 Ba
4 Ca
5K
6 Mn
7 Rb
8 Sr
9Y
10 Zr
800
600
400
200
0
1
20
2
3
4
5
6
Column Number
20
Fe
Ti
15
15
16
16
20
Ba
14
12
8
9
10
18
25
18
20
7
K
14
Ca
15
12
10
10
8
10
10
8
10
6
6
4
5
5
4
5
2
2
0
732
836
940
1044
1148
1252
1356
1460
1564
0
100
1668
30
150
200
250
300
350
400
450
0
9
14
15
21
27
33
57
Rb
10
0
250
0
200
63
300
400
500
600
700
800
900
1000
Y
450
500
550
14
12
8
400
Zr
16
20
350
20
25
14
15
300
1100
18
Sr
16
Mn
51
30
12
20
45
20
18
25
39
12
10
15
10
6
8
8
10
10
4
6
6
4
5
5
2
4
2
2
0
20
30
40
50
60
70
80
90
0
70
80
90
100
110
120
130
140
150
0
0
10
20
30
40
50
60
70
80
0
30
40
50
60
70
80
90
0
40
60
80
100
120
140
160
180
200
220
240
3
3
2
2
1
1
0
0
-1
-1
-2
-2
-3
0
10
20
30
40
50
-3
60
2
4
6
2.5
2
1.5
1
0.5
Values
Data pretreatment
>> xcal=arch(1:63,:);
>> xtest=arch(64:75,:);
>> axcal=auto(xcal);
>> subplot(1,2,1),plot(axcal);
>> subplot(1,2,2),plot(axcal');
>> boxplot(axcal)
0
-0.5
-1
-1.5
-2
-2.5
1
2
3
4
5
6
Column Number
7
8
9
10
8
10
14000
Nr. of components
larch=svd(arch)
larch=larch(1:10)
plot(larch)
12000
10000
8000
6000
laxcal=svd(axcal)
laxcal =
18.1975
11.4439
8.2437
7.1865
3.9787
2.9436
2.4939
1.8726
1.4955
1.3505
plot(laxcal)plot(larch)
4000
2000
0
1
2
3
4
5
6
7
8
9
10
20
18
16
14
12
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9
10
PCA Principal components analysis
PCA on axcal
I/O: [scores,loads,ssq,res,reslm,tsqlm,tsq] = pca(data,plots,scl,lvs);
The input is the data matrix (data). Outputs are the scores
(scores), loadings (loads), variance info (ssq), residuals
(res), Q limit (reslm), T^2 limit (tsqlm), and T^2's (tsq).
Optional inputs are (plots) plots = 0 suppresses all plots,
plots = 1 [default] produces plots with no confidence limits,
plots = 2 produces plots with limits, plots = -1 plots the
eigenvalues only (without limits), a vector (scl) for
plotting scores against, (if scl = 0 sample numbers will
be used), and a scalar (lv) which specifies the
number of principal components to use in the model and
which suppresses the prompt for number of PCs.
[scores,loads,ssq,res,reslm,tsqlm,tsq]=pca(axcal);
Percent Variance Captured by PCA Model
Principal Eigenvalue % Variance % Variance
Component
of
Captured
Captured
Number
Cov(X)
This PC
Total
--------- ---------- ---------- ---------1
5.34e+000
53.41
53.41
2
2.11e+000
21.12
74.53
3
1.10e+000
10.96
85.50
4
8.33e-001
8.33
93.83
5
2.55e-001
2.55
96.38
6
1.40e-001
1.40
97.78
7
1.00e-001
1.00
98.78
8
5.66e-002
0.57
99.35
9
3.61e-002
0.36
99.71
10
2.94e-002
0.29
100.00
Variable Number vs. Loadings for PC# 1
Variable Number vs. Loadings for PC# 2
0.3
0.4
0.2
0.3
0.2
0.1
Loadings for PC# 2
Loadings for PC# 1
0.1
0
-0.1
-0.2
0
-0.1
-0.2
-0.4
-0.5
1
2
3
4
5
6
Variable Number
7
8
9
10
Variable Number vs. Loadings for PC# 3
0.4
0.2
Loadings for PC# 3
0
-0.2
-0.4
1 Fe
2 Ti
3 Ba
4 Ca
5K
6 Mn
7 Rb
8 Sr
9Y
10 Zr
-0.3
-0.4
-0.5
4
5
6
Variable Number
7
8
9
10
5
6
Variable Number
7
8
9
10
8
9
10
0
-0.2
-0.6
3
4
0.2
-0.8
2
3
0.4
-0.4
1
2
Variable Number vs. Loadings for PC# 4
-0.6
-1
1
0.6
Loadings for PC# 4
-0.3
-0.8
1
2
3
4
5
6
Variable Number
7
Sample Scores with 95% Limits
Sample Scores with 95% Limits
5
3
4
2
3
1
1
Score on PC# 2
Score on PC# 1
2
0
-1
-2
0
-1
-2
-3
-3
-4
-5
0
10
20
30
40
Sample Number
50
60
-4
70
0
10
20
Sample Scores with 95% Limits
30
40
Sample Number
50
60
70
50
60
70
Sample Scores with 95% Limits
3
2
1.5
2
1
0.5
Score on PC# 4
Score on PC# 3
1
0
-1
0
-0.5
-1
-1.5
-2
-2
-3
0
10
20
30
40
Sample Number
50
60
70
-2.5
0
10
20
30
40
Sample Number
PLTLOADS Plots loadings from PCA
This function may be used to make 2-D and 3-D plots
of loadings vectors against each other. The inputs to
the function are the matrix of loadings vectors (loads)
where each column represents a loadings vector from the
PCA function and an optional variable of labels (labels)
which describe the original data variables.
Note: labels must be a "column vector" where each label
is in single quotes and has the same number of letters.
Example: labels = ['Height'; 'Weight'; 'Waist '; 'IQ ']
The function will prompt to select 2 or 3-D plots,
for for the numbers of the PCs, and if you would like
"drop lines" and axes on the 3-D plots.
I/O: pltloads(loads,labels)
Loadings for PC# 1 versus PC# 2
0.4
pltloads(loads,vars);
Zr
0.3
0.2
Loadings for PC# 2
0.1
0
Ba
Fe
Mn
Y
-0.1
-0.2
-0.3
Ti
-0.4
Sr
Ca
Rb
K
-0.5
-0.5
-0.4
-0.3
Loadings for PC# 1 versus PC# 3
-0.2
-0.1
Loadings for PC# 1
0
0.1
0.2
0.3
Loadings for PC# 1 versus PC# 4
0.4
0.6
0.2
Ba
0.4
Zr
Ca
Fe
K
Sr
Ba
-0.2
Rb
Zr
-0.4
-0.6
Rb
K
Sr
0.2
Mn
Loadings for PC# 4
Loadings for PC# 3
0
Ti
Ti
0
-0.2
Fe
Mn
Ca
-0.4
-0.8
Y
-0.6
Y
-1
-0.5
-0.4
-0.3
-0.2
-0.1
Loadings for PC# 1
0
0.1
0.2
0.3
-0.8
-0.5
-0.4
-0.3
-0.2
-0.1
Loadings for PC# 1
0
0.1
0.2
0.3
pltscrs(scores,samps(1:63,:),class(1:63,:))
Scores for PC# 1 versus PC# 2
2
SH-13
SHI10SH-2
SHIL1
SH-5
SHII1
SHI13
SHV18
SHIL1
SHII7
SHV12
SHII7
SHV24
SHIIK
SH-68
SHIL1 SH-15
SHII5
SHV14SH-3
SH-1
SH-S1
ANA-9
ANA-1
1
ANA-8ANA-3
ANA-2
ANA-6
ANA-1ANA-2
ANA-1
ANA-2
ANA-1
ANA-1
ANA-4
ANA-5
ANA-1
ANA-7
ANA-1
ANA-1
ANA-1
Scores on PC# 2
0
ANA-1
ANA-1
BLAV7
-1
BLAV9
K-1C
K-1A
BL-2BL-1
BL-3
K-3B
-2
K-2
KAVG
K-1D
K-1B
K-3A
K-4B
-3
BL-6
K-4R
-4
-4
BL-8BLAV1
-3
-2
-1
0
Scores on PC# 1
BL-7
1
2
3
4
Scores for PC# 1 versus PC# 3
3
BL-2
ANA-6
2
ANA-4
ANA-1
ANA-9
BLAV7
K-1A
ANA-3
Scores on PC# 3
1
K-2
ANA-1
K-1D
-1
BL-3
K-1C
ANA-8
ANA-7
ANA-1
ANA-1
ANA-5
ANA-2
ANA-2
ANA-1ANA-1
ANA-1
BL-8
BLAV9
SHV18
SH-3
SHIL1
SHII7
SHIL1
SH-2
SHIL1
SH-68
SHI13SHV24
SH-S1
K-3B
ANA-1
0
BL-1
BLAV1
SH-13BL-7
SHII5
SH-5
SHII1
SHV14
SHII7 SH-15
SHIIK BL-6
KAVG
K-1B
K-4B
ANA-1
ANA-2 ANA-1
K-3A
SHI10
SHV12
SH-1
ANA-1
-2
K-4R
-3
-4
-3
-2
-1
0
Scores on PC# 1
1
2
3
4
Scores for PC# 1 versus PC# 4
2
K-1D
K-2
1.5
K-3B
K-3A
SHIL1
SH-3
SH-S1
SH-13SHII7
SHII5
K-4B
1
ANA-4
ANA-6
K-1C
K-1A
KAVG
K-4R
K-1B
SHII7
Scores on PC# 4
0.5
0
-0.5
-1
SH-68
SH-15
SHV24
SHV18
SHIL1
SH-1
SHIIK
SHV14
SHII1
SH-2
SHIL1
SHV12
SH-5
SHI13
SHI10
ANA-1
ANA-1 ANA-1
ANA-3
ANA-1
ANA-1
ANA-5
ANA-9
ANA-1 ANA-1
ANA-2
ANA-2
ANA-8
ANA-1
ANA-1
ANA-7
ANA-1
BLAV7
ANA-2
BL-2
ANA-1
BL-7
BL-6
-1.5
BL-1
-2
-2.5
-4
BL-3
-3
-2
-1
0
Scores on PC# 1
1
BLAV1
BL-8
BLAV9
2
3
4
PCAPRO Projects new data on old principal components model.
Inputs are the new data (newdata), the old loadings (loads),
the old variance info (ssq), the limit for q (q), the
limit for t^2 (tsq) and an optional variable (plots) which
suppresses the plots when set to 0. Outputs are the new
scores (scores), residuals (res) and t^2 values (tsqvals).
These are plotted as the function proceeds if plots ~= 0.
The I/O format is:
[scores,resids,tsqs] = pcapro(newdata,loads,ssq,q,tsq,plots);
WARNING: Be sure that (newdata) is scaled the same as original data!
AUTO Autoscales matrix to mean zero unit variance
Autoscales a matrix (x) and returns the resulting matrix (ax)
with mean-zero unit variance columns, a vector of means (mx)
and a vector of standard deviations (stdx) used in the scaling.
I/O format is: [ax,mx,stdx] = auto(x);
SCALE Scales matrix as specified.
Scales a matrix (x) using means (mx) and standard
deviations (stds) specified.
I/O format is: sx = scale(x,mx,stdx);
axtest=scale(xtest,mx,stdx);
[scores_xtest]=pcapro(axtest,loads,ssq,reslm,tsqlm);
New Sample Scores with 95% Limits from Old Model
New Sample Scores with 95% Limits from Old Model
5
3
4
2
3
2
Score on PC# 2
Score on PC# 1
1
1
0
-1
-2
0
-1
-3
3
New Sample Scores with 95% Limits from Old Model
-2
-4
2
-5
0
2
4
6
Sample Number
8
10
12
-3
0
New Sample Scores with 95% Limits from Old Model
2
2
4
6
Sample Number
1
8
10
12
0
1
0.5
-1
Score on PC# 4
Score on PC# 3
1.5
-2
0
-0.5
-1
-3
-1.5
-4
0
2
4
6
Sample Number
8
10
12
-2
-2.5
0
2
4
6
8
10
12
pltscrs([scores;scores_xtest],samps);
Scores for PC# 1 versus PC# 2
2
s11
SH-13
SHI10SH-2
SHIL1
SH-5
SHII1
SHI13
SHV18
SHIL1
s8 SHII7
SHV12
s12
SHII7
SHV24
SHIIK
SH-68
SHIL1 SH-15
s9
SHII5
SHV14SH-3
s10SH-S1
SH-1
ANA-9
ANA-1
1
ANA-8ANA-3
ANA-2
ANA-6
ANA-1ANA-2
ANA-1
ANA-2
ANA-1
ANA-1
ANA-4
ANA-5
ANA-1
ANA-7
ANA-1
ANA-1
ANA-1
Scores on PC# 2
0
s1
s7
BLAV7
ANA-1
ANA-1
s6
-1
s3
BLAV9
s4
K-1C
K-1A
BL-2BL-1
BL-3
K-3B
s2
K-2
KAVG
K-1D
K-1B
K-3A
K-4B
-2
-3
BL-6
K-4R
-4
-4
-3
-2
-1
0
Scores on PC# 1
BL-8BLAV1
s5
BL-7
1
2
3
4
Scores for PC# 1 versus PC# 3
3
BL-2
ANA-6
2
ANA-4
ANA-1
ANA-9
K-1A
K-2
K-1D
ANA-3
1
ANA-1
Scores on PC# 3
0
-1
BL-1
s1
BL-3
K-1C
ANA-8
ANA-7
ANA-1
ANA-1
ANA-5
ANA-2
ANA-2
ANA-1ANA-1
ANA-1
K-1B
K-4B
ANA-1
ANA-2 ANA-1
s3
K-3A
BLAV1
s7
SH-13BL-7
s5
s8
SHII5
SH-5
SHII1
SHV14
s11
SHII7 SH-15
SHIIK BL-6
s9s10
SHI10
SHV12
SH-1
s2
KAVG
BL-8
BLAV9
SHV18
SH-3
SHIL1
SHII7
SHIL1
SH-2
SHIL1
SH-68
SHI13SHV24
SH-S1
K-3B
ANA-1
BLAV7
s6
s4
ANA-1
-2
K-4R
-3
s12
-4
-4
-3
-2
-1
0
Scores on PC# 1
1
2
3
4
Scores for PC# 1 versus PC# 4
2
K-1D
K-2
1.5
s2
K-3B
K-3A
1
ANA-4
ANA-6
K-1C
K-1A
KAVG
K-4R
K-1B
Scores on PC# 4
-0.5
-1
SH-3
SH-S1
SH-13SHII7
SHII5
s3
0.5
0
SHIL1
K-4B
ANA-1
ANA-1 ANA-1
ANA-3
ANA-1
ANA-1
ANA-5
ANA-9
ANA-1 ANA-1
ANA-2
ANA-2
ANA-8
ANA-1
ANA-1
ANA-7
ANA-1
s9SHII7
SH-68
SH-15
SHV24
s10
SHV18
SHIL1
SH-1
SHIIK
SHV14
SHII1
SH-2
SHIL1
SHV12
s8
SH-5
SHI13
SHI10
s4
s1
ANA-2
BL-2
ANA-1
BL-7
BLAV1
BL-6
s12
BL-1
BL-8
-1.5
-2
-2.5
-4
s6
BLAV7
s7
s11
BL-3
-3
-2
-1
0
Scores on PC# 1
1
BLAV9
s5
2
3
4
Exercise: multivariate data analysis of
environmental samples
• NW Mediteranean contamination by
organic compounds
load env
whos
Name
sampnames
textdata
varnames
x
plot(x)
plot(x’)
Size
Bytes Class
22x1
74x2
74x1
22x96
Attributes
1458 cell
10874 cell
6296 cell
16896 double
4
2.5
4
x 10
2.5
25, UCM
2
1.5
1
1
0.5
0.5
0
5
10
15
samples
20
25
UCM
2
1.5
0
x 10
25
0
PCBs
0
10
20
30
40
50
60
variables
70
80
90
100
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
'Ty27'
'BC12'
'BC15'
'Ty23'
'TyK'
'Ty8'
'Ty17'
'BC4'
'Ty3‘
'Ty19'
'BC8'
'A2'
'BC10'
'BC6'
'BC4'
'D3'
'BC9'
'D2'
'C1'
'D1'
'BC11‘
'BC7'
alcanes 1-24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
'n-C16'
'n-C17'
'n-C18'
'n-C19'
'n-C20'
'n-C21'
'n-C22'
'n-C23'
'n-C24'
'n-C25'
'n-C26'
'n-C27'
'n-C28'
'n-C29'
'n-C30'
'n-C31'
'n-C32'
'n-C33'
'n-C34'
'n-C35'
'n-C36'
'n-C37'
'n-C38'
'n-C39'
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
'UCM (
'pristane'
49
‘ indeno[1,2,3,-cd]p yrene'
'phytane'
50 'benzo[ghí\ perylene'
'fluoranthene'
51 'benzo[ghí\ fluoranthene'
'phenanthrene'
52 'cyclopenta[cd]pyrene'
'anthracene'
53 'dibenzoanthracenes'
'methy I phenanthrene' 54 'benzo[b]chrysene'
'dimethylphenanthrenes' 55 'coronene'
'fluoranthene'
56 302 ??
'acephenantrylene'
57 'naphtho[1,2,-b]thiophene'
'pyrene'
58 'dibenzothiophene'
'methylfluoranthenes' 59 'naphtho[2,1-b]thiophene'
'benzo[a]fluorene'
60 '4-methyldibenzothiophene'
'benzo[b]fluorene'
61 '3,2-methyldibenzothiophene'
'retene'
62 '1-methyldibenzothiophene'
'benzo[b]phenanthrene' 63 'benzo[b]naphtho[2,1-d]thiophene
'benz[a]anthracene'
64 'benzo[b]naphtho[1,2-d]thiophene
'crysene + triphenylene' 65 'benzo[b]naphtho[2,3-b]thiophene
'benzo[/+b+/c]fluoranthenes'
'benzo[a]fluoranthene'
PAHs, alquenes 26-65
'benzo[e]pyrene'
'benzo[a]pyrene'
'perylene'
'indeno[7,1,2,3-cde/]chrysene'
organochlorine, PCBs 66-84
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
'PCB-52‘
'PCB-101'
'PCB-118'
'PCB-153'
'PCB-138'
'PCB-187'
'PCB-128'
'PCB-180'
'PCB-170
'o,p'-DDD'
‘o,p'-DDE
‘o,p'-DDT
p,p'-DDE
p,p'-DDD
p,p'-DDT
hexaclorobenzene
hexaclorohexane
lindane
octachloroestyrene
esterols 85-96
85
86
87
88
89
90
91
92
93
94
95
96
27-nor-24-methylcholesta-5a,22(£)-dien-3/3-ol
Cholesta-5a,22(£)-dien-3/3-ol
Cholesterol
Cholestanol
brassicasterol
24-methyl-5a(W)-cholest-22(£)-en-3/3-ol
24-methylhcolest-5-en-3/3-ol
stigmasterol
24-ethyl-5a-cholest-22-en-3/3-ol
/3-sitosterol
24-ethyl-5a-cholestan-3/3-ol
dinosterol
200
400
180
boxplot variables 1-24
350
excluding variable 25, UCM
boxplot variables 26-50
160
140
300
120
Values
Values
250
200
100
80
150
60
100
40
50
20
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Column Number
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Column Number
4
x 10
60
2
50
boxplot variables
51-74
1.5
Values
Values
40
30
1
20
0.5
10
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Column Number
12345678910
111213145
1161718292
021223245
2262728390
331323343
5363738490
441424344
546474859501
552535455
6575869601
6626364656676879707
1727374757
677889808
1828384856
88789909
1929394956
Column Number
6
lsx =
>> stdx=std(x);
74.5755
>> mx=mean(x);
>> sx=scale(x,zeros(1,96),stdx);28.6436
14.0461
>> plot(sx)
9.3163
>> plot(sx')
8.7101
>> lsx=svd(sx);
8.3390
>> lsx=lsx(1:10)
7.3844
6.4339
5.5273
5.1046
6
5
4
3
2
1
0
0
5
10
15
20
25
80
70
5
60
4
50
3 components?
3
40
30
2
20
1
10
0
0
10
20
30
40
50
60
70
80
90
100
0
1
2
3
4
5
6
7
8
9
10
[scores,loads,ssq,res,reslm,tsqlm,tsq]=pca(sx);
Warning: Data does not appear to be mean centered.
Variance captured table should be read as sum of
squares captured.
Percent Variance Captured by PCA Model
Principal Eigenvalue % Variance % Variance
Component
of
Captured
Captured
Number
Cov(X)
This PC
Total
--------- ---------- ---------- ---------1
2.65e+002
78.50
78.50
2
3.91e+001
11.58
90.08
3
9.39e+000
2.78
92.86
4
4.13e+000
1.23
94.09
5
3.61e+000
1.07
95.16
6
3.31e+000
0.98
96.14
7
2.60e+000
0.77
96.91
8
1.97e+000
0.58
97.49
9
1.45e+000
0.43
97.92
10
1.24e+000
0.37
98.29
Variable Number vs. Loadings for PC# 1
0
-0.02
-0.04
Loadings for PC# 1
-0.06
-0.08
-0.1
-0.12
-0.14
-0.16
-0.18
-0.2
0
10
20
30
40
50
60
Variable Number
70
80
90
100
80
90
100
Variable Number vs. Loadings for PC# 2
0.15
Variable Number vs. Loadings for PC# 3
alcanes
0.3
0.1
PCBs
0.05
0.2
esterols
0
0.15
Loadings for PC# 3
Loadings for PC# 2
alcanes higher PM
0.25
-0.05
-0.1
0.1
0.05
0
-0.05
PAHS
-0.15
-0.1
-0.15
-0.2
0
10
20
30
40
50
60
Variable Number
70
80
90
100
-0.2
0
10
20
30
40
50
60
Variable Number
70
Sample Scores with 95% Limits
40
30
20
Score on PC# 1
10
0
-10
-20
-30
-40
0
5
10
15
20
25
20
25
Sample Number
Sample Scores with 95% Limits
Sample Scores with 95% Limits
15
8
6
10
4
5
Score on PC# 3
Score on PC# 2
2
0
0
-2
-5
-4
-10
-6
-15
0
5
10
15
Sample Number
20
25
-8
0
5
10
15
Sample Number
pltloads(loads);
Loadings for PC# 1 versus PC# 2
0.15
0.1
alcanes
12
9
7
8 10
14
0.05
Loadings for PC# 2
18
11
6
2
27
13
82
67
6671
7374 8084
87
68
81
9590
727069
75
47
26 7886
89
79
85
88
92
94 93
PCBs
77
76
5
83
4
1
3
91
39
16
esterols
51
25
0
96
57
17
15
-0.05
19
20
-0.1
35
33
62
32
34
60
28
54
65 5958
50 61
3022 23
24
37 38
43 46
5321
4045
4155
63
48
29 42 4944
56
PAHs
-0.15
-0.2
-0.2
52
3164
36
-0.18
-0.16
-0.14
-0.12
-0.1
Loadings for PC# 1
-0.08
-0.06
-0.04
-0.02
pltloads(loads);
Loadings for PC# 1 versus PC# 3
0.3
0.25
17
16
19
18
15
0.2
14
21
22 23
alcanes
20
24
10
0.15
Loadings for PC# 3
13
12
11
0.1
9
25
0.05
27
8
0
35
7
6 33
48
34
PAHs
83
77
80 8279
75
71
73
5274 84
81
37 3856
72
55
26 78 67
31
32
44
706968
51
876676
49
59
41
30 57 54
40
85
89
36 29 42 94
47
90
50
2 63 88
86
92
46
45
9553
65
61
64
39
6243
60
93
PCBs
-0.05
5
3
4
-0.1
1
-0.15
91
58
28
96
-0.2
-0.2
-0.18
-0.16
-0.14
-0.12
-0.1
Loadings for PC# 1
-0.08
-0.06
-0.04
-0.02
pltscrs(scores,samp)
Scores for PC# 1 versus PC# 2
15
Ty3
open sea
10
Ty17
Ty19
Ty8
Scores on PC# 2
BC4
5
A2
0
BC8
Ty23
BC10
BC6TyK
BC4
BC11
D2
D1
-5
D3
BC15
BC12 BC7
C1
Ty27
-10
-25
-20
BC9
-15
-10
Scores on PC# 1
-5
0
pltscrs(scores,samp)
Scores for PC# 1 versus PC# 3
6
BC11
4
BC7
C1
BC10
D1
BC4
D2
2
Scores on PC# 3
Ty17
D3
BC9
Ty19
0
BC6 BC4
Ty8
A2
Ty3
BC8
Ty23
-2
Ty27
TyK
-4
BC12
-6
BC15
-8
-25
-20
-15
-10
Scores on PC# 1
-5
0
Dendrogram Using Mahalanobis Distance on 3 PCs
BC4
Ty8
20 Ty27
BC15
BC12
TyK
Ty23
15 BC9
BC10
BC11
BC4
D1
10 D2
D3
BC6
BC7
C1
5 A2
BC8
Ty3
Ty19
Ty17
0
-0.2
0
0.2
0.4
0.6
0.8
1
Distance to K-Nearest Neighbor
1.2
1.4
1.6
cluster(x,samp)
Dendrogram Using Mahalanobis Distance on 3 PCs
BC4
Ty8
20 Ty27
BC15
BC12
TyK
Ty23
15 BC9
BC10
BC11
BC4
D1
10 D2
D3
BC6
BC7
C1
5 A2
BC8
Ty3
Ty19
Dendrogram Using Mahalanobis Distance on 3 PCs
Ty17
0
-0.2
0
0.2
0.4
0.6
0.8
1
Distance to K-Nearest Neighbor
1.2
1.4
BC4
1.6
Ty8
20 Ty27
BC6
TyK
BCN
Gulf
Lion
Ty23
BC9
15 BC10
D3
BC11
D2
D1
10 BC4
BC7
C1
BC15
open
sea
BC12
5 A2
BC8
Ty3
Ty19
Ty17
0
0
0.5
1
Distance to K-Nearest Neighbor
1.5
2
autoscaled data
[axscores,axloads,axssq,axres,axreslm,axtsqlm,axtsq]=pca(ax);
Percent Variance Captured by PCA Model
Principal Eigenvalue % Variance % Variance
Component
of
Captured
Captured
Number
Cov(X)
This PC
Total
--------- ---------- ---------- ---------1
3.98e+001
41.42
41.42
2
2.57e+001
26.78
68.21
3
8.07e+000
8.41
76.62
4
3.72e+000
3.88
80.50
5
3.33e+000
3.47
83.97
6
3.22e+000
3.35
87.32
7
2.60e+000
2.70
90.02
8
1.93e+000
2.01
92.03
9
1.31e+000
1.36
93.39
10
1.17e+000
1.22
94.62
Variable Number vs. Loadings for PC# 1
0.15
0.1
Loadings for PC# 1
0.05
0
-0.05
-0.1
-0.15
-0.2
0
10
20
30
40
50
60
Variable Number
70
80
90
100
Variable Number vs. Loadings for PC# 2
0.02
Variable Number vs. Loadings for PC# 3
0.3
0
0.2
-0.02
0.1
-0.06
Loadings for PC# 3
Loadings for PC# 2
-0.04
-0.08
-0.1
0
-0.1
-0.12
-0.2
-0.14
-0.3
-0.16
-0.18
0
10
20
30
40
50
60
Variable Number
70
80
90
100
-0.4
0
10
20
30
40
50
60
Variable Number
70
80
90
100
Sample Scores with 95% Limits
20
15
Score on PC# 1
10
5
0
-5
-10
-15
0
5
10
15
20
25
20
25
Sample Number
Sample Scores with 95% Limits
6
10
4
5
2
Score on PC# 3
Score on PC# 2
Sample Scores with 95% Limits
15
0
0
-5
-2
-10
-4
-15
0
5
10
15
Sample Number
20
25
-6
0
5
10
15
Sample Number
Loadings for PC# 1 versus PC# 2
0.02
79
0
75
-0.02
24
23
22
21
-0.04
Loadings for PC# 2
8482
77
87
19
-0.08
20
52
55
-0.1
39
16 91
50
9
2
11
25
62
54
44 30
58
53
4335
38
33
45
63
29
46 59
60
61 28
4241
40
37 65
13
1027 94 70
14
18
15
48
49
-0.14
80
89
47 66
67
85 81
74
71
9573
93
92127868
6 88 26
69
72
90
96
17
34
56
-0.12
86
76
-0.06
PAHS
83
alcanes
8 7
4
57
5
51
13
alcanes
31
32
64
-0.16
36
-0.18
-0.2
-0.15
-0.1
-0.05
0
Loadings for PC# 1
0.05
0.1
0.15
PCBs
esterols
Loadings for PC# 1 versus PC# 3
0.3
96
0.2
91
58
43 28
62
50
45 60
61
64
53
4963
46 65
55
42
29
40 54 36
56
41
30
44 33
59 5231
32
35
34
38
37
48
Loadings for PC# 3
0.1
0
39
1
3 45
76
57
51
25
83
77
9
27
-0.1
12
11
20
1013
24
2322
21
-0.2
93
86
92 95
89
88
2 94
85
47
6 90
87
66
84
7 75 26
6867
79
78
69
70
81
8
7273
71
82
74
80
14
15
19
17
-0.3
-0.4
-0.2
-0.15
-0.1
-0.05
0
Loadings for PC# 1
16
0.05
18
0.1
0.15
Scores for PC# 1 versus PC# 2
15
A2
10
BCN
Gulf of Lion
BC11
BC8
Scores on PC# 2
5
Ebro Delta
D2
D1
0
D3
BC4
Ty8
BC4
BC6
TyKTy23
BC10
C1
Ty17
Ty19
BC9
Ty27
open sea
less contam.
BC12
-10
-15
-10
Ty3
BC7
-5
BC15
-5
0
5
Scores on PC# 1
10
15
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
'Ty27'
'BC12'
'BC15'
'Ty23'
'TyK'
'Ty8'
'Ty17'
'BC4'
'Ty3‘
'Ty19'
'BC8'
'A2'
'BC10'
'BC6'
'BC4'
'D3'
'BC9'
'D2'
'C1'
'D1'
'BC11‘
'BC7'
Scores for PC# 1 versus PC# 3
5
BC15
TyK
BCN
4
BC8
BC12 A2
Ty23
3
Scores on PC# 3
2
1
Ty27
BC6
BC4
Ty8
0
-1
-2
Ty3
BC9
D3
Ebro Delta
D2
D1
Ty19
BC4
BC11
Ty17
-3
BC10
C1
-4
BC7
-5
-10
-5
0
5
Scores on PC# 1
10
15
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
'Ty27'
'BC12'
'BC15'
'Ty23'
'TyK'
'Ty8'
'Ty17'
'BC4'
'Ty3‘
'Ty19'
'BC8'
'A2'
'BC10'
'BC6'
'BC4'
'D3'
'BC9'
'D2'
'C1'
'D1'
'BC11‘
'BC7'
cluster(x,samp)
Dendrogram Using Mahalanobis Distance on 3 PCs
BC4
Ty8
20 Ty27
BC6
TyK
Ty23
BC9
15 BC10
D3
BC11
D2
D1
BCN
Gulf
Lion
10 BC4
BC7
C1
BC15
BC12
5 A2
BC8
Ty3
open
sea
Ty19
Ty17
0
0
0.5
1
Distance to K-Nearest Neighbor
1.5
2
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
•
•
•
•
•
•
•
Introduction
CTD data description
PCA results for XTOT, Xdcm,Xsurf,Xdeep
PLS prediction yfluor = f(Xdcm)
PARAFAC modelling of X(80,200,10)
MCR of Xfluor, Xcond, Xtemp,...
PCA of continuos integrated data
PROYECTO ATOS (Julio 2007)
90 
N
20
18c
18a
16
3436a
3739a
39c42a
33a
33c
40 42c
15c
15a
2830a
14
21
20a
32
29
13
22
27b
43b
2526b 31
12a
12c
44
23c
23a
11
45
10
46a
46c
4849b
6b789a9c
E
15
10
long
5
0
5b
longitud
-5
4b
80 
-10
N
39
36
19
42
18
17
41
40
35
34
38
37
16
33
28
30
32
43
20
15
44
26
29
14
21
25
31
4524
27
13
46
22
4711
12
49
48
23
10
9
7
68
5
3b
W
-15
2b
-20 11b
10
20
30
Decluttered
40
Sample
50
60
70
80
82
80
4
78
3
N
8
6b7
39b 42c
3536b
38 4042a
1718b
33a
33c
43b
44
15a
15c 20a
2931
21 2526b
46a
46c
27b
13
22
4849b
12b
23a
23c
11
9b10
78N 5b
76
latd
70 
80N
2
latitud
4b
74
3b
1
72
2b
30


W
20  W
20
10  W


0
10 E
E
70
68
Decluttered
11b
10
20
30
40
Sample
50
60
70
80
NE
ICE
y42a y39a
y36ay38
y41 y40 y35
y34 y37 y18ay19
y32
y43a
y16 y17
y44 y25 y26a
y30a y28 y33a
y20a
y45 y31 y29
y21y15a
y14
y46a
y24 y27a
y13 y22
y48y47
y49a
y23a
y12a
y11
y10
y9a
y8
y7
y6a
Colder
waters
Warmer
waters
SW

10 E11 E
CTD data evaluation and Integration
10 CTD measured variables:
1 Depth, press
2 Temperature, temp
3 Conductivity, cond
4 Salt concentration, salt
5 Oxygent dissolved, oxyg
6 beam light transmission, btrm
7 fluorescence, fluor
8 turbidimetry, turb
9 latitude, latd
10 longitude, long
80 estaciones
10 variables
depths
(100,..
1000 m)
Estación
CTD1
…..
10 variables
depths
(100,…
1000 m)
Estación
CTD20
…..
Fast data acquisition
10 variables
Data should be
averaged and filtered
depths
(100,…
1000 m)
Estación
CTD49
Gross data table :
X(53367x10)
81 CTD experiments
CTD (49 estaciones
with replicates and
depths to 100-1000
Station 19 was
removed
because it had only
54 depths
80 experiments
49 stations
Building
X(10,80,200)
For each variable:
Xvar(80,200), Yvar
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
•
•
•
•
•
•
•
Introduction
CTD data description
PCA results for XTOT, Xdcm,Xsurf,Xdeep
PLS prediction yfluor = f(Xdcm)
PARAFAC modelling of X(80,200,10)
MCR of Xfluor, Xcond, Xtemp,...
PCA of continuos integrated data
temp
cond
7
35
10
10
6
34
20
20
5
33
30
30
4
32
40
40
31
3
50
50
30
2
60
60
29
70
1
70
80
0
80
28
27
90
100
10
SW
20
11
30
NE
23
salt
40
-1
90
26
100
50
60
37
70
80
49
10
20
30
40
50
60
70
80
NW
temp/salt
0.2
35
10
34.5
20
34
30
10
20
30
40
33.5
40
50
33
50
60
32.5
60
70
32
70
80
31.5
80
90
31
100
0.15
0.1
0.05
0
90
-0.05
100
10
20
30
40
50
60
70
80
10
20
30
40
50
60
70
80
oxyg
btrm
95
400
10
10
90
20
20
350
30
40
30
85
40
80
300
50
50
60
60
75
250
70
70
80
70
80
65
200
90
90
60
100
100
10
20
30
40
50
60
70
80
10
20
30
turb
40
50
60
70
80
fluor
10
45
10
200
20
40
20
35
30
30
30
150
40
40
50
50
25
20
100
60
60
70
70
15
10
50
80
80
90
0
100
5
90
0
100
10
20
30
40
50
60
70
80
10
20
11
30
40
23
50
60
37
70
80
49
temp
salt
8
35.5
surf
dcm
deep
temp
7
35
6
salt
34.5
5
34
4
33.5
3
33
2
32.5
1
31.5
-1
-2
surf
dcm
deep
32
0
0
10
20
30
40
50
60
70
31
80
0
10
20
30
40
50
60
70
80
oxyg
450
fluo
50
surf
dcm
deep
surf
dcm
deep
oxyg
400
40
fluor
350
30
300
20
250
10
200
0
-10
150
0
10
20
30
40
50
60
70
80
0
10
20
30
40
50
60
70
80
correlation in dcm (máximo de clorofila
máximo de fluorescencia)
1
2
3
4
5
6
7
8
9
10
'press'
'temp'
'cond'
'salt'
'oxyg'
'btra'
'fluo'
'turb'
'long'
'latd'
1
10
9
8
7
6
5
4
3
2
1
Correlation Map, Variables in Original Order
0.8
0.6
2
0.4
3
0.2
4
5
0
6
-0.2
7
-0.4
8
-0.6
9
-0.8
10
Scale Gives Value of R for Each Variable Pair
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
•
•
•
•
•
•
•
Introduction
CTD data description
PCA results for XTOT, Xdcm,Xsurf,Xdeep
PLS prediction yfluor = f(Xdcm)
PARAFAC modelling of X(80,200,10)
MCR of Xfluor, Xcond, Xtemp,...
PCA of continuos integrated data
Variables/Loadings Plot for ydcm
0.5
temp
cond
salt
0.4
0.3
Loadings on PC 1 (39.52%)
Percent Variance Captured by PCA Model
Principal Eigenvalue % Variance % Variance
Component
of
Captured
Captured
Number
Cov(X)
This PC
Total
--------- ---------- ---------- ---------1
3.95e+000
39.52
39.52
2
3.10e+000
31.00
70.53
3
1.54e+000
15.42
85.95
4
8.46e-001
8.46
94.41
long
0.2
turb
0.1
press
0
-0.1
PCA
btra
-0.2
X(80,10)-0.3
At DCM-0.4
oxyg
1
2
3
Variables/Loadings Plot for ydcm
4
5
6
Variable
7
8
0.6
long
fluo
turb
latd
0.1
0
-0.1
-0.3
cond
salt
Loadings on PC 3 (15.42%)
Loadings on PC 2 (31.00%)
long
-0.2
latd
0.4
0.2
temp
10
0.5
oxyg
0.3
9
Variables/Loadings Plot for ydcm
0.5
0.4
latd
fluo
0.3
btra
0.2
0.1
press
0
fluo
-0.1
oxyg
temp
press
-0.2
-0.4
cond
salt
-0.3
turb
-0.5
1
2
3
4
btra
5
6
Variable
7
8
9
10
-0.4
1
2
3
4
5
6
Variable
7
8
9
10
Samples/Scores Plot of ydcm
5
42a
39b
43b
43c
41
4
oxyg
Scores on PC 2 (31.00%)
3
PCA
1
0
-1
-2
2b2a
-3
-5
Decluttered
43a
42c
46b X(80,10)
28
46c
2
fluor
turb
11
23c
23a
5b
13
26b
At DCM
24 12c 23b
26a
46a
10
15c
18c
5a
12a31 temp
36a
1420b
44
cond
30a 20a25
48
salt
15a
36b
7
33a 15b
32 16
833b
33c
49b
9a
34 37
39c
6c 18b
38
1
9b
49a
29 45
btrm
40 39a
17
4a
3b
22
27a
47 4b
6b 6a
27b
-4
-3
-2
-1
0
1
2
3
Scores on PC 1 (39.52%)
Variables/Loadings Plot for ysurf
0.5
temp
salt
0.4
fluo
0.3
Loadings on PC 1 (38.55%)
Percent Variance Captured by PCA Model
Principal Eigenvalue % Variance % Variance
Component
of
Captured
Captured
Number
Cov(X)
This PC
Total
--------- ---------- ---------- ---------1
3.86e+000
38.55
38.55
2
2.01e+000
20.11
58.67
3
1.53e+000
15.27
73.94
4
1.19e+000
11.87
85.81
cond
turb
0.2
0.1
long
press
-0.1
PCA
-0.2
X(80,10)-0.3
At surface
-0.4
Excluded sample 67, 42b
latd
0
oxyg
1
2
3
Variables/Loadings Plot for ysurf
4
btra
5
6
Variable
7
8
9
10
Variables/Loadings Plot for ysurf
0.7
long
0.8
latd
0.6
0.6
turb
Loadings on PC 3 (13.35%)
Loadings on PC 2 (20.11%)
0.5
0.4
0.3
0.2
0.1
oxyg
fluo
press
salt
0
-0.1
oxyg
press
fluo
0.2
0
salt
-0.2
temp
latd
cond
long
turb
-0.4
btra
cond
0.4
btra
temp
-0.2
1
2
3
4
5
6
Variable
7
8
9
10
-0.6
1
2
3
4
5
6
Variable
7
8
9
10
Samples/Scores Plot of ysurf
2
1
Scores on PC 2 (20.11%)
0
18c
18a
18b
28 39a
42c42a
16
43a
33b 1737
30a
41
32
46a
40
24
44
35
43b
26b
34
46b
43c
49a 46c 48
33c
49b
26a
29
47
12b
23b
15a
15b 15c 11
31
23c
27b
22
23a
9b
6b6a 9a
8 temp
9c
7
X(80,10)5b
cond
oxyg
btrm
PCA
-1
At surface
-2
4b 4a
-3
Excluded station 67
2b2a
-5
Decluttered
latd
long
3b
-4
-6
-4
salt
fluor
1
-3
-2
1b
-1
0
1
2
Scores on PC 1 (38.55%)
3
4
5
Topic 1 Multivariate Data Analysis
Topic 1 Theory: Multivariate Data Analysis
Introduction to Multivariate Data Analysis
Principal Component Analysis (PCA)
Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:
Introduction to MATLAB
Examples of PCA (cluster analysis of samples, identification and
geographical distribution of contamination sources/patterns…)
Examples of Multivariate Regression (prediction of
concentration of chemicals from spectral analysis,
investigation of correlation patterns and of the relative
importance of variables,…
Romà Tauler (IDAEA, CSIC, Barcelona)
Febrero 2009
CHEMOMETRICS STUDY OF CTD ARTIC SEA WATER DATA
•
•
•
•
•
•
•
Introduction
CTD data description
PCA results for XTOT, Xdcm,Xsurf,Xdeep
PLS prediction yfluor = f(Xdcm)
PARAFAC modelling of X(80,200,10)
MCR of Xfluor, Xcond, Xtemp,...
PCA of continuos integrated data
Fluorescence PLS prediction
from other parameters
Variables/Loadings Plot for yred11
0.4
turb
0.3
(DCM data X(80,9), y(80,1))
Reg Vector for Y 1
0.2
Linear regression model using
Partial Least Squares calculated with SIMPLS
Cross validation: random samples w/ 8 splits
Percent Variance Captured by Regression Model
-----X-Block----- -----Y-Block----Comp This Total This Total
---- ------- ------- ------- ------1 29.00 29.00
76.95 76.95
2 38.72 67.72
2.51 79.46
0.1
temp
cond
salt
0
-0.1 press
-0.3
-0.4
1
2
3
Variables/Loadings Plot for ydcm
4
5
Variable
btra
6
7
8
9
Samples/Scores Plot of ydcm
60
btra
3.5
50
3
28
40
turb
2.5
Y Predicted 1
VIP Scores for Y 1
latd
-0.2
4
2
1.5
1
long
latd
press
1
temp
2
cond
3
33c
33a 4415b
33b
3822 9a 18c 8 34
27a46a
39a
27b 40
45
36b
7
30
6c
20
23c
23a
43b
43c
26a
41
15c 243121 13
43a
26b 12a
20b 10
20a18a
9c
25
5a
15a
32
30a
16
36a42b
42c
46b
37
46c
6a
6b 39c
48
4a 49b
1
49a
4b
47
3b
10
0
0.5
0
long
oxyg
2a
2b
oxyg
salt
4
5
6
Variable
7
8
9
10
-10
Decluttered
0
5
10
15
20
25
30
Y Measured 1
35
40
45
50
Variables/Loadings Plot for yred11
Fluorescence PLS prediction from other
excluding beam transmission and turbidity
0.6
salt
0.5
cond
temp
(DCM data X(80,7), y(80,1))
Linear regression model using
Partial Least Squares calculated with the SIMPLS
Cross validation: random samples w/ 8 splits
Percent Variance Captured by Regression Model
-----X-Block----- -----Y-Block----Comp This Total This Total
---- ------- ------- ------- ------1 34.94 34.94
33.23 33.23
2 39.55 74.49 10.53 43.76
3 11.72 86.21
12.44 56.20
Loadings on LV 3 (11.72%)
0.4
0.3
oxyg
0.2
0.1
0
-0.1
press
-0.2
long
-0.3
latd
-0.4
1
2
3
Variables/Loadings Plot for yred11
4
Variable
7
50
oxyg
36b
40
2
28
18c
33a 44
15b
3840
33c
17
34
48 22 9a 18b 8
45
29
39a
27a
46a
latd
1.5
Y Predicted 1
VIP Scores for Y 1
6
Samples/Scores Plot of ydcm
2.5
long
1
30
39c
20
6c 49b
49a
6b
4a 6a
47
3b
10
0.5
temp
cond
0
42b
42c
36a
18a
43a
43b
31
24 21 13
26b
43c
15c 46b
26a
12c
46c
20a
11
32 15a
149c 5b
37
25
5a
16
27b
4b
2b
2a
press
1b
salt
0
5
1
1
2
3
4
Variable
5
6
7
-10
Decluttered
0
5
10
15
20
25
30
Y Measured 1
35
40
45
50
Fluorescence PLS prediction
from other parameters
Variables/Loadings Plot for ysurf
0.6
0.4
surf data X(80,9), y(80,1))
latd
Percent Variance Captured by Regression Model
-----X-Block----- -----Y-Block----Comp This
Total
This
Total
---- ------- ------- ------- ------1
33.89 33.89 29.08 29.08
2
22.29 56.18
3.53 32.61
3
14.83 71.01
2.76 35.37
4
4.80 75.81
5.10 40.48
Reg Vector for Y 1
0.2 press
0
salt
temp
cond
turb
-0.2
oxyg
-0.4
-0.6
-0.8
1
2
3
4
Variables/Loadings Plot for ysurf
2
10
temp
Y Predicted 1
VIP Scores for Y 1
8
1.2
1
turb
0.8
6
4
2
0.6
oxyg
0
0.4
long
latd
-2
0.2
press
0
1
2
3
4
5
6
Variable
8
9
10
10
12
cond
1.4
7
Samples/Scores Plot of ysurf
salt
1.6
btra
5
6
Variable
14
btra
1.8
long
7
8
9
10
-4
-5
Decluttered
23c
15a
18a
15c
46b 27a
9a 9c
14
6c
15b
27b
9b
5a 22
23b
6a
7
44
16
43a 43c
33c
4b4043b
29
33a 12a
42b37
46a
12c
49b
2a
39a 48
32
46c
39c
2b
0
5
10
15
Y Measured 1
3111
8
18c
20
25
30