Transcript Document
Davide Ballabio Milano Chemometrics and QSAR Research Group Università Milano - Bicocca Classification of multiway data based on the MOLMAP approach Classification of multiway data Despite the great interest on the multiway approach, little work was dedicated on classification. Bro R. (2006) Critical Reviews in Analytical Chemistry 36 279-293. … but classification is one of the fundamental methodologies in chemometrics, commonly used for the study of bi-dimensional data Classification of multiway data based on the MOLMAP approach Classification of multiway data • What is the MOLMAP approach? it’s an algorithm for calculating molecular descriptors for the study of molecule chemical information organized into three-way data structures. Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45 1775-1783. • I molecules • J bounds (the number of bounds can be different for each molecule) • K bound properties i.e. xijk represents the value of the k-th property for the j-th bound for the i-th molecule. Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45 1775-1783. • I molecules • J bounds (the number of bounds can be different for each molecule) • K bound properties i.e. xijk represents the value of the k-th property for the j-th bound for the i-th molecule. Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45 1775-1783. a) generation of MOLMAP scores by means of Kohonen maps b) classification models with MOLMAP scores as independent variables. Zhang QY, Aires-de-Sousa J. (2005) J. Chem. Inf. and Mod. 45 1775-1783. Classification of multiway data • MOLMAP: molecular descriptors for the study of molecule chemical information organized into three-way data structures. • Major steps: a) generation of MOLMAP scores by means of Kohonen maps b) development of classification models with MOLMAP scores as independent variables. Zhang QY, Aires-de-Sousa J. (2005) Journal of Chemical Information and Modeling 45 1775-1783. extension on analytical multiway data Theory of MOLMAP approach K variables (mode 3) J variables (mode 2) I samples IxJxK • simulated data: 50 samples (I=50) Theory of MOLMAP approach 25 variables 15 variables i-th sample 50 samples Theory of MOLMAP approach • simulated data: 50 samples divided in 2 classes + noise Class 1 Class 2 Theory of MOLMAP approach How does it work? In 4 steps… 1) unfolding 2) Kohonen maps on the unfolded data 3) MOLMAP score calculation based on Kohonen maps 4) Subsequent classification on MOLMAP scores Theory of MOLMAP approach 1) data are unfolded 25 25 15 1st sample 15 50 Theory of MOLMAP approach 1) data are unfolded 25 25 15 2nd sample 15 15 50 Theory of MOLMAP approach 1) data are unfolded 25 25 15 750 50 50th sample Theory of MOLMAP approach 1) data are unfolded 25 multiway sample 750 rows (input vectors) Theory of MOLMAP approach 2) Kohonen maps are trained with the unfolded data 25 750 rows (input vectors) N Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. Once the map is trained, each multiway sample is mapped…. Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. Once the map is trained, each multiway sample is mapped…. 25 15 750 Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. and score calculated with the pattern of activated neurons: 1 if activated; 0.3 if neighbour Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. and score calculated with the pattern of activated neurons: 1 if activated; 0.3 if neighbour Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. 7 7*7 = 49 1.6 1.2 1.8 2.1 1.5 0.6 0.9 Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. 7 7*7 = 49 1.6 1.2 1.8 2.1 1.5 0.6 0.9 Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. 7 7*7 = 49 1.6 1.2 1.8 2.1 1.5 0.6 0.9 Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. Repeating the same procedure on each multiway sample, we’ll get the MOLMAP score matrix (M). size*size samples M (50 x 49) Theory of MOLMAP approach 3) MOLMAP score calculation based on Kohonen maps. M is a two-way matrix where the information of the original multiway dataset is compressed by codifying the input vector positions in the Kohonen Map. similar samples -> similar mapping -> similar MOLMAP scores Theory of MOLMAP approach 4) Subsequent classification on MOLMAP scores Class M (50 x 49) 1 1 2 2 2 Classification and Regression Tree (CART) 1 1 2 Theory of MOLMAP approach • Results simulated data: 750 input vectors in Kohonen map Theory of MOLMAP approach • Results on the simulated data: Class 1 Class 2 Theory of MOLMAP approach • Results on the simulated data: Class 1 Class 2 Theory of MOLMAP approach • Results on the simulated data: Electronic nose data • E-Nose is made by non-selective gas sensors able to simulate human sensing electronic nose sensors a-specific fingerprint of food products analysed by chemometrics K time profile J sensors i-th sample I samples size: I x J x K sensors time profile Electronic nose data • The dataset has included 53 samples of olive oils. • 2 classes: • Garda (36) • Spain (6); Sardegna (5); Campania (4); Abruzzo (2), considered as not-Garda class; • The sampling has included also 19 commercial samples to test the classification model. • The signals collected by 15 sensors for 100 sampling points Electronic nose data • Kohonen settings: 21×21 neurons - 100 epochs • Results (% of correctly classified samples), CV with venetian blinds on 3 groups Model MOLMAP PLS-DA PARAFAC + LDA PARAFAC + QDA NERcv NERtest 93 100 98 64 88 94 88 88 Electronic nose data • Portion of Kohonen map… Electronic nose data Discrimination between sensors Electronic nose data Discrimination between classes Electronic nose data Electronic nose data Conclusions (preliminary…) • Good predictive performances • Then, besides the classification performances: a) the MOLMAP scores appear as an effective fingerprint. b) role and importance of each portion of the multiway data can be analysed in a comprehensive way • Improvements: a) classifiers other than CART on the MOLMAP scores b) modification of the original MOLMAP scoring procedure Davide Ballabio Milano Chemometrics and QSAR Research Group Department of Environmental Sciences Università Milano – Bicocca You can download the MOLMAP toolbox for MATLAB here: http://michem.disat.unimib.it/chm Reference paper for the MOLMAP analytical approach: D. Ballabio, V. Consonni, R. Todeschini, Analytica Chimica Acta (2007), 605, 134-146 Davide Ballabio Milano Chemometrics and QSAR Research Group Department of Environmental Sciences Università Milano – Bicocca You can download the MOLMAP toolbox for MATLAB here: http://michem.disat.unimib.it/chm Reference paper for the MOLMAP analytical approach: D. Ballabio, V. Consonni, R. Todeschini, Analytica Chimica Acta (2007), 605, 134-146 Thanks for your attention !