Prediction of Molecular Bioactivity for Drug Design Sunita Sarawagi, IITB
Download
Report
Transcript Prediction of Molecular Bioactivity for Drug Design Sunita Sarawagi, IITB
Prediction of Molecular Bioactivity
for Drug Design
Experiences from the KDD Cup 2001 competition
Sunita Sarawagi, IITB
http://www.it.iitb.ac.in/~sunita
Joint work with
B. Anuradha, IITB
Anand Janakiraman, IITB
Jayant Haritsa, IISc
The dataset
Dataset provided by DuPont Pharmaceuticals
Activity of compounds binding to thrombin
Library of compounds included:
1909 known molecules (42 actively binding
thrombin)
139,351 binary features describe the 3-D
structure of each compound
636 new compounds with unknown capacity
to bind to thrombin
Sample data
0,1,0,0,0,0,… …,0,0,0,0,0,0,I
0,0,0,0,0,0,… …,0,0,0,0,0,1,I
0,0,0,0,0,0,… …,0,0,0,0,0,0,I
0,0,0,0,0,0,… …,0,0,0,0,0,0,I
0,1,0,0,0,1,… …,0,1,0,0,0,1,A
0,1,0,0,0,1,… …,0,1,0,0,0,1,A
0,1,0,0,0,1,… …,0,1,0,0,1,1,?
0,1,1,0,0,1,… …,0,1,1,0,0,1,?
Challenges
Large number of binary features,
significantly fewer training instances:
Highly skewed:
140,000 vs 2000!
1867 In-actives, 42 Actives.
Varying degrees of correlation among
features
Differences in the training and test
distributions
Steps
Familiarization with data
data has noise, four equal records (all 0s) with
different labels
Lots more 0s than 1s
Number of 1s significantly higher for As than Is
Feature selection
Build classifiers
Combine classifiers
Incorporate unlabeled test instances
First step: feature selection
Most commercial classifiers cannot handle
140,000 features even with 1 GB memory.
Entropy-based individual feature selection
Step-wise feature selection
Does not handle redundant attributes.
Too brittle
Top entropy attribute with a “1” in each active
compound
Exploiting small counts of Actives
Want all important groups of redundant attributes
Building classifiers
Partition training data using stratified
sampling
Two-thirds training data
One-third validation data
Classification methods attempted
Decision tree classifiers
Naïve-Bayes
SVMs
Hand-crafted clustering/nearest neighbor hybrid
Decision Tree
C4.5
f88235 = 1
f137567 = 1
A (10)
f80106 = 1
A (5)
f26913 = 1
A
I
A
I
3
7
1
459
A (4)
f135832 = 1
A (3)
f25144 = 1
A (2)
A (2)
I (338/6)
Naïve Bayes
Data characteristics very similar to text
lots of features, sparse data, few ones
Naïve Bayes found very effective for text
classification
Accuracy: All actives misclassified!
A
I
A
0
10
I
1
459
Support vector machines
Has received lots of attention recently
Requires tuning: which kernel, what parameters?
Several freely available packages: SVMTorch
Accuracy: slightly worse than decision trees
fj
fi
Hand-crafted hybrid
Find features such that actives cluster
together using appropriate distance measure
fj
fi
Training active
Training inactive
Test Record
Incremental Feature Selection
Pick features ONE by ONE
that result in maximum clustering of the actives.
And maximum separation from the inactives.
Objective function:
Maximum separation between centroids of the
Actives and In-actives
Distance function: matching ones
Careful selection of training Actives.
Accuracy: 100%, 493 features
Final approach
Test data: significantly denser
Methods like SVM, NB, clustering-based will
not generalize
Preferred distribution independent method
Ensemble of Decision Trees
On disjoint attributes --- unconventional
Semi-supervised training
Introduce feedback from the test data in multiple
rounds
Building tree ensembles
Initially picked ~20000 features based on entropy.
More than one tree to take care of large feature
space.
Remove
Remove
features
features
Repeat until accuracy on validation data does not
drop
All groups of redundant features exploited.
Incorporating unlabeled instances
Augment training data with sure test
instances.
Re-train another ensemble of trees using
same method
Include more unlabelled instances with sure
predictions
Repeat few more times...
How to capture drift?
Capturing drift
Solution: Validate with independent data
Be sure to include only correctly labeled data
First approach: Same prediction by all trees
Weighted prediction by each tree
On validation data, found errors in this scheme
Pruning not a solution
Weight: fraction of Actives
Pick the right threshold using validation data.
Stop when no more unlabelled data can be
added
Final state
Three rounds each with about 6 trees
Use meta-learner on validation data to pick
final criteria
Unlabelled data included:
126 actives & 311 inactives
Remaining
200 in confusion
Sum of scores times number of trees claiming
Actives
Several other last minute hacks.
Outcome
Home Team
Winning Entry:
Weighted: 68.4%
Accuracy: 70.03%
Winner’s method
Pre-processing: Feature subset selection
using mutual information (200 of 139,351
features)
Learning Bayesian network models of
different complexity (2 to 12 features)
Choosing a model (ROC area, model
complexity)
Postmortem: Was all this
necessary?
Without semi-supervised learning:
Single decision tree = 49%
6-tree ensemble on training data alone:
Majority = 57%
Confidence weighted = 63%
With unlabelled data: 64.3%
Lessons learnt
Products:
Need tools that scale in number of features
Research problems:
Classifiers that are not tied to distribution
similarity with the training data
More principled way of including unlabelled
instances.