Data Dimensionality Reduction

Download Report

Transcript Data Dimensionality Reduction

Data Dimensionality Reduction:
Introduction to Principal Component Analysis
Case Study:
Multivariate Analysis of Chemistry-Property data in Molten Salts
C. Suh1, S. Graduciz2, M. Gaune-Escard2 , K. Rajan1
Combinatorial Sciences and Materials Informatics Collaboratory
1 Iowa
State University
2 CNRS , Marseilles, France
Krishna Rajan
PRINCIPAL COMPONENT ANALYSIS: PCA
From a set of N correlated descriptors,
we can derive a set of N uncorrelated
descriptors (the principal components).
Each principal component (PC) is a
suitable linear combination of all the
original descriptors. PCA reduces the
.information dimensionality that is often
needed from the vast arrays of data in
a way so that there is minimal loss of
information
( from Nature Reviews Drug Discovery 1, 882-894
(2002) : INTEGRATION OF VIRTUAL AND HIGH
THROUGHPUT SCREENING Jürgen Bajorath ; and
Materials Today; MATERIALS INFORMATICS , K.
Rajan , October 2005
Krishna Rajan
I
Functionality 1 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
Functionality 2 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
…….
II
X1 = f ( x2)
III
X2 = g( x3)
X3= h(x4)
…….
Krishna Rajan
PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4 …….
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4 …….
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4 …….
DIMENSIONALITY REDUCTION: Case study
Database of molten salts properties tabulates numerous properties for each
chemistry :
•What can we learn beyond a “search and retrieve” function?
•Can we find a multivariate correlation (s) among all chemistries and
properties?
•Challenge of reducing the dimensionality of the data set
Krishna Rajan
Principal component analysis (PCA) involves a mathematical
procedure that transforms a number of (possibly) correlated variables
into a (smaller) number of uncorrelated variables called principal
components.
The first principal component accounts for as much of the variability
in the data as possible, and each succeeding component accounts for
as much of the remaining variability as possible.
Krishna Rajan
……
Melting point = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
Density = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
(Janz’s Molten Salts
Database:1700 chemistries
with 7 variables.)
MP
Where xi = molten salt compound chemistries
Dimensionality Reduction of
Molten Salts Data
Spe.con Eq.Con
Temp
Eq.wt: equivalent weight
MP: melting point
Temp: temperature of the measurements
Eq.con: equivalent conductance
Spe.con: specific conductance
D: density
V: viscosity
X1 = f(x2)
X2 = g(x3)
D
X3= h(x4)
V
Eq.wt
Krishna Rajan
MP
Temp
Eq.con
Spe.con
D
…….
BiCl3 series (high viscosity)
Mathematically, PCA relies on the fact that most of the descriptors are
interrelated and these correlations in some instances are high. It results in a
rotation of the coordinate system in such a way that the axes show a maximum of
variation (covariance) along their directions.
This description can be mathematically condensed to a so-called eigenvalue
problem.
•The data manipulation involves decomposition of the data matrix X into two
matrices T and P. The two matrices P and T are orthogonal. The matrix P is
usually called the loadings matrix, and the matrix T is called the scores matrix.
•The eigenvectors of the covariance matrix constitute the principal
components. The corresponding eigenvalues give a hint to how much
"information" is contained in the individual components.
Krishna Rajan
• The loadings can be understood as the weights for each original variable
when calculating the principal component. The matrix T contains the
original data in a rotated coordinate system.
• The mathematical analysis involves finding these new “data” matrices T
and P. The dimensions of T( ie its rank) that captures all the information
of the entire data set of A ( ie # of variables) is far less than that of X
( ideally 2 or 3). One now compresses the N dimensional plot of the data
matrix X into 2 or 3 dimensional plot of T and P.
Krishna Rajan
PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4 …….
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4 …….
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4 …….
The first principal component accounts for the maximum variance (eigenvalue) in
the original dataset. The second, third ( and higher order) principal components are
orthogonal (uncorrelated) to the first and accounts for most of the remaining
variance.
•A new row space is constructed in which to plot the data, where the axes
represent the weighted linear combinations of the variables affecting the data.
Each of these linear combinations are independent of each other and hence
orthogonal.
•The data when plotted in this new space is essentially a correlation plot, where
the position of each data point not only captures all the influences of the
variables on that data but also its relative influence compared to the other data.
Krishna Rajan
Minimal contribution to
additional information content
beyond higher order
principal components.. “Scree” plot helps
to identify the # of PCs needed to capture
reduced dimensionality
Eigenvalue
NB…depending upon nature
of data set, this can be within
2, 3 or higher principal components but
still less than the # of variables in original
data set
PC1
Krishna Rajan
PC2
PC3
PC4
PC5 ……………
Thus the mth PC is orthogonal to all others and has the mth largest variance
in the set of PCs. Once the N PCs have been calculated using eigenvalue/
eigenvector matrix operations, only PCs with variances above a critical level
are retained (scree test).
The M-dimensional principal component space has retained most of the
information from the initial N-dimensional descriptor space, by projecting it
into orthogonal axes of high variance. The complex tasks of prediction or
classification are made easier in this compressed, reduced dimensional space.
Krishna Rajan
PCA: algorithmic summary
Generation
Generation of
of data
data matrix,
matrix, A
A
a
A  

a
Data matrix, A, has 1700 rows(different molten salts) and 7 columns(properties).
The properties in this example includes
1) equivalent weight 2) melting point 3) temperature of the measurements
4) equivalent conductance 5) specific conductance 6) density 7) viscosity
11
m1
Scaling
Scaling (normalization)
(normalization) of
of the
the data
data matrix,
matrix, X
X
X is a scaled matrix of A.
Matrix X in the left is an example of “Unit Variance” scaling.
Each sij represents standard deviation.
X
(a



 ( a
...
a
...
a
mn




11
/ sk 1 )
...
( a1 n / s kn ) 
m1
/ s k 1)
...
( a mn / s kn
Covariance
Covariance matrix
matrix of
of the
the scaled
scaled data
data matrix,
matrix, S
S
S is a covariance matrix of X.
1n


) 
T
S  cov( X ) 
X X
m 1
Eigenvalue
Eigenvalue decomposition
decomposition of
of covariance
covariance matrix
matrix
P is called as loading (or eigenvector) matrix.
 is a eigenvalue matrix (eigenvalues on the diagonal of this diagonal matrix).
Calculation
Calculation of
of scores
scores from
from the
the loadings
loadings
S  P P
T
1
or
cov( X ) p
T
X  t1 p1  t1 p1 
ti; scores (orthogonal), pi: loadings (orthonormal)
Krishna Rajan
i
  p
i
T
 t k pk  E
where k  min{m, n}
i
Dimensionality Reduction of Molten Salts Data
(Janz’s Molten Salts Database:1700 instances with 7 variables.)
Bivariate representation of the data sets
Multivariate (PCA) representation of the data sets
MP
6
4
PC3(19.47%)
Spe.con Eq.Con
Temp
Eq.wt: equivalent weight
MP: melting point
Temp: temperature of the measurements
Eq.con: equivalent conductance
Spe.con: specific conductance
D: density
V: viscosity
2
0
-2
D
-4
-6
-6
-4
4
-2
V
BiCl3 series (high viscosity)
Eq.wt
MP
Krishna Rajan
Temp
Eq.con
Spe.con
D
PC
2
0
0
-2
2
1 (4
9.8
9%
6
-4
4
)
6
-6
(2
2
PC
)
%
3
4
2.
INTERPRETATIONS OF
PRINCIPAL COMPONENT
PROJECTIONS
0.8
6
Correlations
between
variables
captured in
loading plot
PC3(19.47%)
4
2
0
Loadings on PC 2 (22.43%)
0.7
-2
Equivalent weight
0.6
Density
0.5
0.4
Temperature of the measurement
0.3
Equivalent conductance
Melting point
0.2
Specific conductance
0.1
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Loadings on PC 1 (49.89%)
4
-4
PC
-4
4
-2
2
0
0
9.8
9%
-4
4
)
)
%
3
4
-2
2
1 (4
6
6
(
C2
-6
2.
2
P
AgI
3
TlCl
Scores on PC 2 (22.43%)
-6
-6
2
TlNO3
1
HgI2
HgBr2
0
Krishna Rajan
CsI
BaI2
CsF
BaBr2
PbBr2
K2SO4
NaBr
KF
NaI BaCl2
CdI2 SrI
2
LiBr
NaF
CsNO3
SrCl2
MgI2 CdCl
Na2SO4
2
NaCl
MgBr
GaI2
LiCl
2
KCl
K2Cr2O7
BiCl3
YCl3 CaCl2
Li2CO3
AlI3 RbNO3
InCl3
LiF
KNO3
KOH MgCl2
InCl2
ZnCl2 NaNO
2
LiNO3 NaOH
KCNS
InCl
-1
-2
Trends in bonding captured along the PC1 axis of scoring
plot
AgBr
-3
covalent
-5
-4
ionic
-3
-2
-1
0
1
2
Scores on PC 1 (49.89%)
3
4
5
PCA : summary
To summarize, when we start with a multivariate data matrix PCA analysis
permits us to reduce the dimensionality of that data set. This reduction in
dimensionality now offers us better opportunities to:
•Identify the strongest patterns in the data
•Capture most of the variability of the data by a small fraction of the total set
of dimensions
•Eliminate much of the noise in the data making it beneficial for both data
mining and other data analysis algorithms
Krishna Rajan