Transcript Slide 1

Machine learning for metabolomics
Genome Center Bioinformatics Technology Forum
Tobias Kind – September 2006
• A quick introduction into machine learning
• Algorithms and tools and importance of feature selection
• Machine Learning for classification and prediction of cancer data
1
Artificial Intelligence
Machine Learning Algorithms
unsupervised learning:
Clustering methods
supervised learning:
Support vector machines
MARS (multivariate adaptive regression splines)
Neural networks
Random Forest, Boosting trees, Honest trees,
Decision trees
CART (Classification and regression trees)
Genetic programming
transduction:
Bayesian Committee Machine
Transductive Support Vector Machine
...thanks to WIKI and Stuart Gansky
2
Applications in metabolomics
• Classification - genotype/wildtype, sick/healthy, happy*/unhappy
sleepy*/awake, love*/hate, old/young*)
10
5
S IC K
t2
0
H ealthy
-5
• Regression - predicting biological activities
-10
-20
-15
-10
-5
0
5
10
15
20
t1
and molecular properties of unknown substances (QSAR and QSPR)
S c atterplot (Spreadsheet2 10v*30c )
V ar2 = 3.3686+ 1.0028*x; 0.95 Pred.Int.
350
300
250
200
V ar2
150
100
50
0
-50
0
50
100
150
200
250
300
350
V ar1
• Optimization – using experimental design (DOE) for optimizing
experiments (LC, GC, extraction of metabolites) with multiple variables as
input (trial and error is not “old school” its pstudi)
* solved at the end
3
Algorithms we use in metabolomics
Model class
specific model
#
Generalized Linear Models (GLM)
General Discriminant Analysis
1
Binary logit (logistic) regression
2
Binary probit regression
3
Nonlinear model
Multivariate adaptive regression splines
(MARS)
4
Tree models
Standard Classification Trees (CART)
5
Standard General Chi-square
Interaction Detector (CHAID)
6
Neural Networks
Machine Learning
Automatic
Exhaustive CHAID
7
Boosting classification trees
8
Multilayer Perceptron neural network (MLP)
9
Radial Basis Function neural network (RBF)
10
Support Vector Machines (SVM)
11
Naive Bayes classifier
12
k-Nearest Neighbors (KNN)
13
4
Concept of predictive data mining for classification
Data Preparation
Feature Selection
Basic Statistics, Remove extreme outliers, transform or
normalize datasets, mark sets with zero variances
Predict important features with MARS, PLS, NN,
SVM, GDA, GA; apply voting or meta-learning
Model Training +
Cross Validation
Use only important features, apply
bootstrapping if only few datasets;
Use GDA, CART, CHAID, MARS, NN, SVM,
Naive Bayes, kNN for prediction
Model Testing
Calculate Performance with Percent
disagreement and Chi-square statistics
Model Deployment
Deploy model for unknown data;
use PMML, VB, C++, JAVA
5
Automated machine learning workflow implemented in Statistica Dataminer
6
Occam meets Epicurus
aka feature selection
William of Ockham
1285-1349
Occam's Razor:
“Of two equivalent theories or
explanations, all other things being
equal, the simpler one is to be
preferred.”
Eπίκουρος
341 BC – 270 BC
Epicurus:
Principle of multiple explanations
“all consistent model should be
retained”
...thanks to WIKI
7
What's the deal with feature selection?
• Reduces computational complexity
• Curse of dimensionality is avoided
• Improves accuracy
• The selected features can provide insights about the nature of the problem*
* Margin Based Feature Selection Theory and Algorithms; Amir Navot
8
What's the deal with feature selection? Example!
S c ore sc atterplot (t1 vs. t2)
100
Loading sc atterplot (p1 vs. p2)
1.0
0.8
0.6
50
A ML_58
A ML_57
A LL_65
AALL_39
A LL_52
ML_56 A LL_55
A LL_40
A
ML_60
A ML_70
A ML_51
A LL_61
A ML_59
AALL_45
LL_44
ML_50
AALL_62
A LL_43
A LL_46
A LL_66
A LL_49
t2
0
A LL_64
A LL_63
0.2
A ML_71
A LL_47
A ML_54
A LL_67
A LL_42 A LL_41
A ML_72
p2
A ML_53
0.4
0.0
-0.2
-0.4
-50
A FX6525
A FX5332
A FX2728
FX3521
A FX6458
AAFX2633
A FX1028
AAFX3634
A FX3405
A FX6655
AFX6050
FX5823
A FX3255AAFX6986
AFX5655
FX2915
A AFX270
A FX1246
FX6388
FX5851
AAFX3019
A
FX5827
A FX555
A FX147
A FX6625
A FX4576
A FX3722
A
FX7002
A
FX4297
A
FX1062
A
FX4246
A
FX1822
A
FX812
A
FX1172
A FX2998A FX215
A FX6816
FX6127
A FX4650
AAAFX226
FX5431
A FX4448A FX4106
A FX5636
A FX1435
A FX1586
A FX937
A FX1292A FX2908
A FX1266
A
FX286
A FX5372
FX6523
A
FX3086
A
FX3819
A
FX6207
A
FX5668
A
FX1990
AFX3551
FX647
A FX1419
A FX1307
FX6481
A FX4981
A FX2527A A
AAFX188
A FX3684
A FX6380
A FX3429
A FX4766
FX6230 AAFX1327
AAFX4032
A FX6807
FX3492
A
FX4561
A FX6193
FX2218
A FX600
A FX4621
AAFX1952
FX4420
A FX1017
FX820 A FX6190
A FX6407
AA
AAFX1291
A FX3196A FX12
C lFX809
aFX6626
ss {A M L }
A FX5256
A FX4601 A FX1951A FX2882
A FX813
A FX3238 A FX1354
A FX4960
A FX6160
A FX6004
A FX5460
A FX3948
A FX4350
AAFX3159
A FX2691
AAAFX3595
AFX3245
FX6325
FX4854
A FX1759
A FX318
A
FX3648
A
FX1566
A
FX1629
FX1139
A
FX6040
A
FX5514
A
FX6163
A
FX3438
A FX135
A FX5050
A FX2352
A FX3036
A FX2049
A
FX3402
A
FX869
A
FX5031
A
FX5753
A FX4691 A FX568
A FX4828 A FX6941A FX6546
A FX1722
AAFX6619
A FX1764
A
FX2561
A
FX6299
A
FX4069
A
FX5349
A
FX3783
FX1776
A
FX5417
A
FX4230
A
FX5297
A
FX5233
A FX2365
A FX124
AAFX6267
FX5287
A FX4031
A FX4433
A FX6878
FX4556
AAFX3012
FX2954
AFX4248
FX179
A AFX4938
A FX4260
A FX6508
A FX3632
A FX4036
A FX6435
FX1430
A FX703
A FX4027
A FX2173
AAFX18
FX2318
FX950
FX2558
A FX2560
AAFX3370
AAFX1369
A FX1950
A FX6395
AAFX5352
A FX85 AAFX823
FX6935
FX5161
A FX6379 A FX922
A FX466
A FX3577
A FX5239
A FX3080A FX225
A
FX6485
A
FX1183
A
FX5115
AAFX6636
AAFX2692
AFX1622
FX1812
A
FX3825
A FX6808
A FX6168
A
FX1063
FX2057
A
FX2970
A
FX6465
A
FX6076
A
FX929
A FX2981
A FX6689
AAA
FX549
FX6364A FX4118
A FX4199 A FX4579
FX643
FX1613
A FX3892
A FX2618AAFX1293
FX6861
A FX6740
AFX3885
AAAFX6530
AAFX1026
FX2886
A FX4726
A FX2837A FX4224
AFX3296
FX4998
A FX6794
AAFX4785
FX1530
A FX23 A FX6570
A FX6155 A FX2150
A FX5765
C l a ss {A L L }
A FX4205 A FX1802
A FX6675
A
FX6781
A FX7115 A FX2074
A FX2610
A FX5170
A FX3020
A FX6176
A FX2591
A FX6459
A ML_69
A FX2802
-0.6
A LL_48A ML_68
-0.8
-100
-150
-100
-50
0
t1
50
100
150
-1.0
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
p1
Principal component analysis (PCA) example (here of microarray data)
WITHOUT feature selection  no separation possible (red and green points overlap)
Golub, 1999 Science Mag
9
Feature selection
Importanc e plot
D ependent variable:
C LAS S
40
35
Important variables
30
25
20
15
X 95735_at
Im portanc e (C hi-s quare)
10
5
0
M 19309_s _at
L10386_at
M 61764_at
Z 11793_at
H G 4126-H T 4396_at
U 43965_at
U 66083_at
U 33317_rna1_at
U 08316_at
M 69238_at
M 20203_s _at
K 03195_at
U 30998_at
H G 4321-H T 4591_at
Certain variables are more important/useful than others.
All other variables could be noise or slow down computation.
Machine learning or classification/regression still need to be applied.
Feature selection is a powerful pre-filter (based on different algorithms)
10
With feature selection
X loading sc atter plot (p1 vs. p2)
S c ore sc atterplot (t1 vs. t2)
0.3
10
D10202_at
A M L_3
0.2
A M L_2
5
A M L_14
A LL_11103_B -c ell
A LL_28373_B
A LL_19183_B
-c ell -c ell
0.1
A LL_20185_B -c ell
A LL_17281_B -c ell
0
A LL_19881_T-c ell
A LL_9335_B -c ell
A LL_R23_B -c ell
A M L_5
A M L_12
p2
t2
A LL_549_B -c ell
A LL_R11_B
-c ell -c ell
A LL_14749_B
A LL_16415_T-c
ell
A LL_5982_B
-c ell
A LL_17269_T-c
ell
LL_17929_B
-c
AA
LL_14402_T-c
ellell-c ellell
LL_17638_T-c
AALL_7092_B
A
A LL_22474_T-c
LL_20414_B -cell
ell
A LL_9186_T-c
ellell
LL_18239_B
-c
A LL_9723_T-c
ell
A LL_19769_B
-c ell
A LL_21302_B
-c ell
A LL_23953_B
-c ellA M L_13
A M L_16
0.0
A M L_1
A M L_6
A M L_20
X97748_s _at
M 31166_at
X07743_at
X64072_s
_at
D88422_at
HG 2981-HT3127_s _at
4321-HT4591_at
MHG
63138_at
L09209_s
_at
M
80899_at
U33920_at
L77561_at
X62654_rna1_at
U02020_at
U40279_at
U85767_at
X98261_at
M
_at
X14008_rna1_f_at
D14874_at
M83667_rna1_s
19045_f_at
M 22960_at
MU67963_at
32304_s
_at
HG
U41767_s
_at3454-HT3647_at
M
27891_at
M 81695_s _atX80907_at
M 98399_s _at
X66610_at
MX16546_at
69043_at
L13434_at
M 87789_s
M 83652_s
_at _at
M 11147_at
L08246_at
MM16038_at
U77396_at
28130_rna1_s
_at
M 24349_s _at
U46499_at
M
62762_at
Y 00787_s
_at J04990_at
U46751_at
U50136_rna1_at
X95735_at
J03930_at
M 80254_at
M 54995_at
X17042_at
M 55150_at
HG 3725-HT3981_s _at
M 21551_rna1_at
M 84526_at
M 23197_at
U12471_c
ds 1_at
D49950_at
X58431_rna2_s _at
L07648_at
-0.1
X04085_rna1_at
M 24400_at
MM22324_at
81933_at
U82759_at
M 96326_rna1_atU41813_at
A LL_9692_B -c ell
-5
A M L_7
U11701_at
M X74262_at
31303_rna1_at
U37055_rna1_s
_at
U76992_at
U91521_at
U22376_c
ds 2_s
M 31211_s
_at _at
M 31523_at
X13956_at M 86406_at
A F 005043_at
U20362_at
M 58297_at
X77533_at
HG 2379-HT3996_s _at
U83600_at
HG 4316-HT4586_at
M 15780_at
L27584_s _at
D17390_at
A C002115_c ds 4_at
X85116_rna1_s _at
Y 12670_at
M 27783_s _at
-0.2
MD38073_at
29540_at
J04615_at
X59417_at
o u tsid e v ariab le s =
h ig h m o d e llin g p o w e r
X70297_at
in sid e v ariab le s =
lo w m o d e llin g p o w e r
-10
-25
-20
-15
-10
-5
0
t1
5
10
15
20
25
-0.3
-0.20
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
0.20
p1
The same dataset, but only important variables, classification is now possible
Certain algorithms have an in-built feature selection (like MARS, PLS, NN)
Currently it is always useful to perform a feature selection
11
Machine Learning and statistic tools
Response curves
PLS
Tree model
Cluster Analysis
Neural Network
Feature selection
Machine Learning (KNN)
We use Statistica Dataminer as a comprehensive datamining worktool.
WEKA or YALE or R are free but currently not as powerful as the Dataminer.
Multiprocessor support still absent in most versions 
 = that sucks...
12
LIVE Demo with Statistica Dataminer (~10-15 min)
Classification of cancer data
from LC-MS and GC-MS experiments
Click here
13
H
N
QUIZ solutions
HO
H
N
Chiral
NH2
H2N
• happy/unhappy - serotonin in bananas make us happy
HO
O
• sleepy/awake - tryptophan in turkey and watching sport on TV makes us sleepy
• love/hate - oxytocin makes the baby loving the mommy and vice versa
• old/young - secret (minds setting + genotype)
14
Machine learning for metabolomics
Genome Center Bioinformatics Technology Forum
Tobias Kind – September 2006
Thank you!
Thanks to the FiehnLab!
15