Transcript Document

Davide Ballabio
Milano Chemometrics and QSAR Research Group
Università Milano - Bicocca
Classification of multiway data
based on the MOLMAP approach
Classification of multiway data
Despite the great interest on the multiway approach, little work was
dedicated on classification.
Bro R. (2006) Critical Reviews in Analytical Chemistry 36 279-293.
… but classification is one of the fundamental methodologies in
chemometrics, commonly used for the study of bi-dimensional data
Classification of multiway data based on the MOLMAP approach
Classification of multiway data
• What is the MOLMAP approach? it’s an algorithm for
calculating molecular descriptors for the study of molecule
chemical information organized into three-way data structures.
Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical
Information and Modeling 45 1775-1783.
• I molecules
• J bounds (the number of bounds can be different for each
molecule)
• K bound properties
i.e. xijk represents the value of the k-th property for the j-th bound
for the i-th molecule.
Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45
1775-1783.
• I molecules
• J bounds (the number of bounds can be different for each
molecule)
• K bound properties
i.e. xijk represents the value of the k-th property for the j-th bound
for the i-th molecule.
Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45
1775-1783.
a) generation of MOLMAP
scores by means of Kohonen
maps
b) classification models with
MOLMAP scores as
independent variables.
Zhang QY, Aires-de-Sousa J. (2005) J.
Chem. Inf. and Mod. 45 1775-1783.
Classification of multiway data
• MOLMAP: molecular descriptors for the study of molecule
chemical information organized into three-way data structures.
• Major steps:
a) generation of MOLMAP scores by means of Kohonen maps
b) development of classification models with MOLMAP scores as
independent variables.
Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45
1775-1783.
extension on analytical multiway data
Theory of MOLMAP approach
K variables (mode 3)
J variables
(mode 2)
I samples
IxJxK
• simulated data: 50 samples (I=50)
Theory of MOLMAP approach
25 variables
15 variables
i-th sample
50 samples
Theory of MOLMAP approach
• simulated data: 50 samples divided in 2 classes + noise
Class 1
Class 2
Theory of MOLMAP approach
How does it work? In 4 steps…
1) unfolding
2) Kohonen maps on the unfolded data
3) MOLMAP score calculation based on Kohonen maps
4) Subsequent classification on MOLMAP scores
Theory of MOLMAP approach
1) data are unfolded
25
25
15
1st sample
15
50
Theory of MOLMAP approach
1) data are unfolded
25
25
15
2nd sample
15
15
50
Theory of MOLMAP approach
1) data are unfolded
25
25
15
750
50
50th sample
Theory of MOLMAP approach
1) data are unfolded
25
multiway sample
750 rows
(input vectors)
Theory of MOLMAP approach
2) Kohonen maps are trained with the unfolded data
25
750 rows
(input vectors)
N
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
Once the map is trained, each multiway sample is mapped….
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
Once the map is trained, each multiway sample is mapped….
25
15
750
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
and score calculated with the pattern of activated neurons:
1 if activated; 0.3 if neighbour
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
and score calculated with the pattern of activated neurons:
1 if activated; 0.3 if neighbour
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
7
7*7 = 49
1.6
1.2
1.8 2.1
1.5
0.6
0.9
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
7
7*7 = 49
1.6
1.2
1.8 2.1
1.5
0.6
0.9
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
7
7*7 = 49
1.6
1.2
1.8 2.1
1.5
0.6
0.9
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
Repeating the same procedure on each multiway sample, we’ll
get the MOLMAP score matrix (M).
size*size
samples
M (50 x 49)
Theory of MOLMAP approach
3) MOLMAP score calculation based on Kohonen maps.
M is a two-way matrix where the information of the original
multiway dataset is compressed by codifying the input vector
positions in the Kohonen Map.
similar samples -> similar mapping -> similar MOLMAP scores
Theory of MOLMAP approach
4) Subsequent classification on MOLMAP scores
Class
M (50 x 49)
1
1
2
2
2
Classification and
Regression Tree (CART)
1
1
2
Theory of MOLMAP approach
• Results simulated data: 750 input vectors in Kohonen map
Theory of MOLMAP approach
• Results on the simulated data:
Class 1
Class 2
Theory of MOLMAP approach
• Results on the simulated data:
Class 1
Class 2
Theory of MOLMAP approach
• Results on the simulated data:
Electronic nose data
• E-Nose is made by non-selective gas sensors able to simulate
human sensing
electronic nose sensors
a-specific fingerprint of food products
analysed by chemometrics
K time profile
J sensors
i-th sample
I samples
size: I x J x K
sensors
time profile
Electronic nose data
• The dataset has included 53 samples of olive oils.
• 2 classes:
• Garda (36)
• Spain (6); Sardegna (5); Campania (4); Abruzzo (2),
considered as not-Garda class;
• The sampling has included also 19 commercial samples to test
the classification model.
• The signals collected by 15 sensors for 100 sampling points
Electronic nose data
• Kohonen settings: 21×21 neurons - 100 epochs
• Results (% of correctly classified samples), CV with venetian
blinds on 3 groups
Model
MOLMAP
PLS-DA
PARAFAC + LDA
PARAFAC + QDA
NERcv NERtest
93
100
98
64
88
94
88
88
Electronic nose data
• Portion of Kohonen map…
Electronic nose data
Discrimination between sensors
Electronic nose data
Discrimination between classes
Electronic nose data
Electronic nose data
Conclusions (preliminary…)
• Good predictive performances
• Then, besides the classification performances:
a) the MOLMAP scores appear as an effective fingerprint.
b) role and importance of each portion of the multiway data
can be analysed in a comprehensive way
• Improvements:
a) classifiers other than CART on the MOLMAP scores
b) modification of the original MOLMAP scoring procedure
Davide Ballabio
Milano Chemometrics and QSAR Research Group
Department of Environmental Sciences
Università Milano – Bicocca
You can download the MOLMAP toolbox for MATLAB here:
http://michem.disat.unimib.it/chm
Reference paper for the MOLMAP analytical approach:
D. Ballabio, V. Consonni, R. Todeschini, Analytica Chimica Acta
(2007), 605, 134-146
Davide Ballabio
Milano Chemometrics and QSAR Research Group
Department of Environmental Sciences
Università Milano – Bicocca
You can download the MOLMAP toolbox for MATLAB here:
http://michem.disat.unimib.it/chm
Reference paper for the MOLMAP analytical approach:
D. Ballabio, V. Consonni, R. Todeschini, Analytica Chimica Acta
(2007), 605, 134-146
Thanks for your attention !