Statistics for QSPR and QSAR with JChem and WEKA and

Transcript Statistics for QSPR and QSAR with JChem and WEKA and

Statistics evaluation and graphics
with ChemAxon tools and Statistica and WEKA
towards QSPR and QSAR development
Tobias Kind
FiehnLab at UC Davis Genome Center
November 2006
Free Academic Licenses for JChem and
Instant JChem provided by ChemAxon
Academic License for Statistica Dataminer
provided by Statsoft
GNU general public license for WEKA
provided by WEKA Machine Learning Project
1
Metabolomics - The science of the small molecules
Compound Classes:
• sugars
• amino acids
• steroids
• fatty acids
• lipids
• phospholipids
• organic acids ...
Molecules under investigation
Visit us!
www.fiehnlab.ucdavis.edu
3D model of a molecule with surface plot
2
Techniques and tools
• Analytical techniques (LC-MS, GC-MS, FT-MS, NMR, IR)
• BioInformatics, ChomoInformatics
Liquid Chromatography
LC-MS
Gas Chromatography
GC-MS
BioInformatics and Cheminformatics
Statistics (Statistica Dataminer)
Open Source Tools
3
ChemAxon JChem has now PCA and PLS
1) Create new library with JCHEM Manager GUI (testcase here: fingerprints)
2) Exctract fingerprints and do dimension reduction with
principal component analysis (PCA) with command line tool PCA.bat or pca.sh
4
ChemAxon JChem Principal Component Analysis (PCA)
Start PCA by getting information from DB (here Access, but can be Oracle, Derby, MySQL)
Test case 250.000 chemicals from NCI DB
PCA can be done from any descriptor, chemical fingerprints, BCUT etc.
This is just a simple example made from the 16 standard fingerprints.
Be sure only to select descriptors you want (and not the molecule ID)
PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4,
cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= 250000" -o
PCA-scores.txt -t PCA-Eigenvalues.txt
TimeThis : Command Line : run-pca.bat
TimeThis : Start Time : Mon Nov 27 17:02:02 2006
TimeThis :
End Time : Mon Nov 27 17:19:52 2006
TimeThis : Elapsed Time : 00:17:49.812
Problem here:
A) JDBC extraction not tuned - DB extraction of values nearly 2 minutes.
B) PCA calculation time too long - 15 minutes for a matrix 250.000 x 16
The current PCA algorithm needs to be changed, its very inefficent (faster matrix routines exist for JAVA)
Database extraction time with Statistica:
The same PCA with Statistica is finished in:
8 seconds.
1 second (no joke – thats a factor of 1:900).
5
JChem PCA output
Eigenvalues, % and Cumulated variance (in rows)=
1.77
1.623
1.518
1.326
1.106
10.409
9.547
8.93
7.798
6.505
10.409
19.957
28.886
36.684
43.189
Loadings (in rows)=
0.191
0
-0.159
-0.17
0.306
0.076
0.085
0.577
-0.117
0
0.182
-0.167
-0.049
0.181
-0.553
0.141
-0.572
0.077
-0.073
PCA scores
-0.873
0.622
-0.672
-0.102
-0.835
-0.947
-0.215
0.353
0.622
0.981
0.597
0.668
0.723
0.304
-1.519
0.455
-1.106
-2.345
0.301
0
1.843
0.233
-1.015
0.81
-0.705
-0.595
-1.638
-0.737
0.928
-0.131
-0.175
-0.089
0.896
1.147
-0.673
-0.673
0.808
-1.174
1.028
6.048
49.236
0.999
5.879
55.115
0.94
5.527
60.643
0.919
5.407
66.05
0.849
4.994
71.043
0.824
4.847
75.891
0.788
4.638
80.528
0.742
4.362
84.89
0.71
4.177
89.068
0.674
3.965
93.033
0.602
3.543
96.576
0.582
3.424
100
0.617
0.128
0.233
-0.419
0.084
0.126
0.105
-0.123
-0.016
0.338
-0.255
-0.286
0.307
-0.146
0.344
-0.304
0.084
-0.535
-0.263
-0.682
-0.055
-0.324
0.335
0.469
0.348
-0.374
-0.035
-0.31
-0.63
0.235
0.101
0.11
-0.442
0.563
0.063
0.29
0.204
0.748
0.477
-1.853
-0.62
-2.836
-1.918
1.778
0.526
-1.141
0.801
1.877
0.435
0.198
0.796
-0.477
0.343
-0.322
1.016
1.087
0.381
0.551
0.492
-0.631
1.168
-0.197
-0.17
0.806
1.366
0.766
0.32
-0.34
0.353
1.835
-0.221
-0.589
-0.263
-0.91
-0.59
-1.083
-0.526
-1.157
-0.755
-0.529
0.233
0.221
-1.369
-0.082
-0.439
0.484
-1.519
-2.252
3.189
-0.149
0.208
0.192
0.877
0.346
0.011
0.957
-0.962
-1.481
-0.783
1.54
1.919
0.466
1.081
-0.456
0.966
-0.515
-1.754
-0.704
1.704
-2.231
0.2
0.557
-0.299
1.113
-0.722
0.152
1.524
-1.382
-0.218
-0.397
-0.624
0.509
0.919
1.023
-0.881
-1.547
0.705
1.043
1.189
-0.042
-1.294
0.368
2.844
-2.449
-1.642
1.397
1.13
1.308
-2.87
-0.801
-1.399
1.349
-0.649
-1.085
The PCA results matrix is inverted and values *(-1) from Statistica.
Problem: Currently no graphics. But multivariate statistics lives from graphics.
Follwing simple graphic examples are made with Statistica or WEKA via DB query.
6
Following slides  „What could be“ in the future.
or
 „What can be done“ right now.
Check the pretty comprehensive statistics link
http://www.statsoft.com/textbook/stathome.html
7
Machine Learning and statistic tools
Response curves
PLS
Tree model
Cluster Analysis
Neural Network
Feature selection
Machine Learning (KNN)
We use Statistica Dataminer as a comprehensive statistics work tool.
WEKA or YALE are free but (not yet :-) as powerful as the Statistica Dataminer.
8
Connection of a JCHEM molecule DB via JDBC with Statistica
Time for query + copy of 4,000,000 values with 250k molecules 16 fingerprints = 8 seconds.
Test system JChem 3.2 with MS Access with Statistica Dataminer 7.1
Dual Opteron 2.8 GHz
9
Statistica with JChem data
10
PCA Scree plot – determine optimal factors to retain
Eigenvalues of correlation matrix
Scree plot
2.0
1.8
10.91%
10.15%
1.6
9.41%
1.4
8.29%
Visible Step
1.2
Eigenvalue
6.89%
6.37%
6.04%
5.79%
5.32%
5.15%
4.94%
4.63%
4.44%
4.25%
3.77%
3.66%
1.0
0.8
0.6
0.4
-2
0
2
4
6
8
10
Eigenvalue number
12
14
16
18
20
Statistica Dataminer 7.1
Four factors can be retained. The 16 dimensional space can be
compressed into a 4-dimensional space. (Scree plot is not optimal here)
11
PCA Loadings plot – which variables are influential?
Projection of the variables on the factor-plane ( 1 x 2)
1.0
cd_fp2
0.5
cd_fp12
cd_fp5
cd_fp15
cd_fp1 cd_fp10
cd_fp6
cd_fp16
cd_fp4
0.0
cd_fp9cd_fp7 cd_fp3
cd_fp8
Factor 2 : 10.15%
cd_fp13
-0.5
cd_fp14
cd_fp11
-1.0
-1.0
-0.5
0.0
Factor 1 : 10.91%
0.5
1.0
Active
Statistica Dataminer 7.1
Which of the 16 fingerprints are similar? Those who “cluster” together are similar (fp_11 and fp_14).
The variables fp_5 and fp_16 influence factor 1 in the same way. Variables inside or near the center (0,0) have no
discrimination power. Remember PCA is no cluster analysis!
12
PCA Scores plot – picture of the reduced dimensionality.
Statistica Dataminer 7.1
The 16 fingerprints are compressed into 2D. We can use other high dimensionality descriptors for
enhanced examples. Cases (molecules) which „cluster“ together may have same properties or functional groups
(depending on input). Here we see the KOW molecule set covers the whole NCI dataset based on 16 pfs.
13
PCA Scores 3D plot – KOWWIN versus silicon compound test set
3D Scatterplot (Score spreadsheet (Spreadsheet215) 7v*22435c)
SILICON compounds (red)
KOWWIN set (blue)
Statistica Dataminer 7.1
The 16 fingerprints are compressed into 3D. The KOWWIN test set does not cover the whole
molecules space of important silicon containing molecules. You can also do an Overlap Analysis
(compare two databases) within the all-new Instant-JChem.
14
Statistica – Random Forest Machine learning
1024-DIM FC descriptor space
Statistica generates all graphical output + SQL code
Importance plot
Dependent variable:
logP
1600
1400
1200
1000
800
600
Importance (F-value)
Histogram of logP (Obs.)
2800
2600
2400
2200
2000
1800
1600
1400
1200
400
200
0
Var418
Var569
Var188
Var561
Var900
Var468
Var787
Var678
Var494
Var739
Var856
Var704
Var585
Var314
Var78
1000
Number of observations
800
600
400
200
0
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
logP (Obs.)
5
6
7
8
9
10 11 12 13
Chemical fingerprint descriptors generated with JCHEM GenerateMD
GenerateMD performance 1800 molecules/second for 1024 dimensional fp
On Dual Opteron 2,8 GHz (one core used only).
15
CART tree method for QSPR and QSAR
Tree graph for MTP
Num. of non-terminal nodes: 33, Num. of terminal nodes: 34
Tree number: 1
I D
=1
N=2095
M u=166. 185442
Var =4271. 107556
AM 1_dipole
<= 5. 642730
I D
=2
> 5. 642730
N=1596
I D
=3
M u=193. 646586
Var =4067. 765794
Var =3940. 321058
PEO E_PC
+
PM 3_H
O MO
<= 1. 203697
I D
=4
> 1. 203697
N=422
I D
=5
I D
=56
Var =3704. 634302
Var =3501. 469135
Var =4297. 134484
Var =3320. 597122
logP( o/ w)
> 17. 157779
I D
=7
b_1r ot N
<= 1. 758500
N=66
I D
=20
> 1. 758500
N=190
I D
=21
M u=220. 116162
M u=187. 487117
Var =3150. 895078
Var =4444. 152491
Var =3571. 503041
Var =3374. 917552
Var =3241. 355800
Var =2832. 915478
zagr eb
> 221. 169665
I D
=9
N=51
<= 83. 000000
I D
=22
N=75
CASA-
> 83. 000000
I D
=23
weiner Pat h
<= 430. 315615
N=115
I D
=24
> 430. 315615
N=556
I D
=25
<= 1393. 000000
N=428
I D
=60
I D
=65
N=134
M u=210. 018261
M u=153. 394964
M u=184. 530140
M u=204. 389744
M u=242. 832099
M u=142. 213793
M u=197. 285075
Var =3313. 001513
Var =3073. 243790
Var =3219. 061548
Var =3022. 694602
Var =2683. 945636
Var =2263. 935669
Var =2416. 466792
PM 3_H
F
> 82. 823990
I D
=11
PM 3_E
<= 12. 561065
N=31
I D
=26
> 12. 561065
N=400
I D
=27
PC
-
<= - 321. 575760
N=156
I D
=46
> - 321. 575760
N=15
I D
=47
<= - 1. 366000
N=411
I D
=62
N=99
SM R
> - 1. 366000
I D
=63
N=18
<= 10. 963580
I D
=66
N=65
> 10. 963580
I D
=67
N=69
M u=139. 874194
M u=145. 870000
M u=172. 689744
M u=109. 666667
M u=187. 267397
M u=212. 440404
M u=160. 111111
M u=179. 780000
M u=213. 775362
Var =2163. 339861
Var =2785. 218049
Var =2801. 123252
Var =3253. 507718
Var =1955. 555556
Var =3068. 444876
Var =2663. 669079
Var =2680. 265432
Var =2056. 397600
Var =2195. 069103
a_heavy
> 1. 087000
I D
=13
PEO E_VSA_FPNEG
<= 21. 500000
N=61
I D
=28
> 21. 500000
N=197
I D
=29
<= 0. 069790
N=203
I D
=44
N=88
Kier Flex
> 0. 069790
I D
=45
N=68
<= 3. 576279
I D
=48
> 3. 576279
N=104
I D
=49
N=307
M u=118. 404918
M u=129. 835533
M u=161. 430542
M u=157. 796591
M u=191. 963235
M u=205. 144231
M u=181. 211401
Var =1889. 309660
Var =2688. 035237
Var =2308. 531729
Var =2787. 519957
Var =2934. 696924
Var =3007. 575854
Var =3080. 979197
Var =2919. 261601
E
> 53. 550139
I D
=15
N=10
PEO E_VSA_PN
EG
<= 33. 971064
I D
=30
> 33. 971064
N=31
I D
=31
ASA-
<= 3. 755634
N=166
I D
=38
> 3. 755634
N=5
I D
=39
<= 442. 648105
N=198
I D
=50
> 442. 648105
N=294
I D
=51
N=13
M u=134. 230000
M u=88. 448387
M u=137. 564458
M u=66. 259999
M u=163. 833838
M u=177. 819388
M u=257. 923077
Var =1753. 277876
Var =3021. 656100
Var =983. 583092
Var =2176. 347474
Var =434. 806387
Var =2612. 433045
Var =2739. 987556
Var =828. 724852
ASA_P
> - 8. 795825
<= 126. 019935
N=45
I D
=32
Kier A3
> 126. 019935
N=157
I D
=33
N=9
<= 2. 270166
I D
=40
Kier A2
> 2. 270166
N=54
I D
=41
<= 13. 213366
N=144
I D
=52
> 13. 213366
N=289
I D
=53
N=5
M u=89. 049367
M u=106. 355556
M u=134. 021656
M u=199. 366667
M u=193. 735185
M u=152. 620833
M u=179. 684775
M u=70. 000000
Var =2158. 814916
Var =1922. 462462
Var =2566. 764444
Var =2285. 623032
Var =2273. 971372
Var =2562. 540189
Var =1170. 300000
PEO E_VSA- 0
vsa_don
> 26. 067410
I D
=19
> 1. 912989
N=29
M u=167. 829333
Var =1571. 384272
N=58
<= 1. 912989
I D
=64
Var =2890. 560740
AM 1_H
O MO
<= 26. 067410
N=81
M u=157. 766667
M u=94. 826761
I D
=17
I D
=61
Var =5348. 366548
M u=92. 885714
N=158
SM R
_VSA4
> 1393. 000000
N=117
M u=104. 120984
E_vdw
<= - 8. 795825
N=163
M u=166. 937500
PC
+
N=203
I D
=59
M u=193. 364737
M u=100. 075912
N=213
<= 53. 550139
> 6. 500000
N=198
Var =2371. 171432
N=274
<= 1. 087000
I D
=18
I D
=58
M u=163. 419697
PEO E_VSA- 0
I D
=16
<= 6. 500000
N=984
M u=111. 806180
N=305
<= 82. 823990
I D
=14
N=361
M u=205. 383380
ASA-
I D
=12
I D
=57
M u=162. 941482
N=356
<= 221. 169665
I D
=10
> - 9. 631155
N=135
M u=171. 214480
PEO E_VSA+2
I D
=8
<= - 9. 631155
N=1174
M u=119. 878436
<= 17. 157779
I D
=6
N=498
M u=157. 640664
<= 10. 393096
N=100
I D
=34
PEO E_PC
-
> 10. 393096
N=135
I D
=35
N=22
<= - 1. 716735
I D
=42
N=93
st d_dim 3
> - 1. 716735
I D
=43
N=51
<= 0. 564750
I D
=54
N=26
> 0. 564750
I D
=55
N=263
M u=74. 408621
M u=97. 541000
M u=129. 323704
M u=162. 850000
M u=163. 956989
M u=131. 949020
M u=219. 038462
M u=175. 794297
Var =1203. 690097
Var =1588. 215217
Var =1802. 469217
Var =1692. 277045
Var =2207. 595785
Var =1733. 346415
Var =3853. 460059
Var =2266. 680241
b_r ot R
<= 0. 190983
I D
=36
N=94
> 0. 190983
I D
=37
N=41
M u=140. 021277
M u=104. 797561
Var =1688. 292528
Var =1200. 338274
Classification trees, boosting trees, random forest, regression trees
and honest trees and adaptive trees – lots of wood and forests - did you hear about them?
16
Other machine learning techniques from Statistica Dataminer we use
Most of them work for classification and regression
X loading scatter plot (p1 vs. p2)
0.15
0.10
0.05
0.00
p2
-0.05
-0.10
-0.15
-0.15
-0.10
-0.05
0.00
0.05
0.10
Model class
specific model
#
Generalized Linear Models (GLM)
General Discriminant Analysis
1
Binary logit (logistic) regression
2
Binary probit regression
3
Nonlinear model
Multivariate adaptive regression splines
(MARS)
4
Tree models
Standard Classification Trees (CART)
5
Standard General Chi-square
Interaction Detector (CHAID)
6
0.15
p1
Distance to the model Y
Number of components is 2
The normalized distance to the model is 0.742
Automatic
4.0
3.5
Exhaustive CHAID
7
Boosting classification trees
8
Multilayer Perceptron neural network (MLP)
9
Radial Basis Function neural network (RBF)
10
Support Vector Machines (SVM)
11
Naive Bayes classifier
12
k-Nearest Neighbors (KNN)
13
3.0
2.5
2.0
Distance
Neural Networks
1.5
1.0
0.5
0.0
1
99
197
393
589
785
981 1177 1373 1569 1765 1961 2157 2353
295
491
687
883 1079 1275 1471 1667 1863 2059 2255 2451
Case
Machine Learning
Normal probability plot of residuals
Dependent variable: MTP
Test set sample; Number of trees: 100
4
3
0.99
2
0.95
Normal quantile
1
0.85
0
0.50
-1
0.15
0.05
-2
0.01
-3
-4
-200
-150
-100
-50
0
50
100
150
200
Residual
17
Now with open source datamining tool WEKA
URL
SQL
Data
Yellow =
OK
Easy: enter DB URL, enter SQL statement, import data. Try free AquaStudio for SQL!
18
WEKA - Machine learning algorithms in Java
19
WEKA – fingerprint visualization
20
Conclusions regarding statistics:
JChem PCA and PLS output (Eigenvalues, scores, loadings)
are provided only as textfile. More univariate and multivariate tools needed.
1) JChem PCA and PLS results must have graphical output. (They must)
2) JChem PCA must be made faster (factor 600-1000) by using math routines.
3) Integration into Instant-JChem would be good or ChemAxon provides enhanced
bundled statistics tools.
4) Currently JDBC query from JChem to other statistical packages like WEKA or
Statistica or R or MATLAB or YALE is perfect. Each package works best in the field
it was designed for.
21