Prediction of Molecular Bioactivity for Drug Design Sunita Sarawagi, IITB

Download Report

Transcript Prediction of Molecular Bioactivity for Drug Design Sunita Sarawagi, IITB

Prediction of Molecular Bioactivity
for Drug Design
Experiences from the KDD Cup 2001 competition
Sunita Sarawagi, IITB
http://www.it.iitb.ac.in/~sunita
Joint work with
B. Anuradha, IITB
Anand Janakiraman, IITB
Jayant Haritsa, IISc
The dataset
Dataset provided by DuPont Pharmaceuticals
 Activity of compounds binding to thrombin
 Library of compounds included:



1909 known molecules (42 actively binding
thrombin)
139,351 binary features describe the 3-D
structure of each compound
636 new compounds with unknown capacity
to bind to thrombin
Sample data
0,1,0,0,0,0,… …,0,0,0,0,0,0,I
0,0,0,0,0,0,… …,0,0,0,0,0,1,I
0,0,0,0,0,0,… …,0,0,0,0,0,0,I
0,0,0,0,0,0,… …,0,0,0,0,0,0,I
0,1,0,0,0,1,… …,0,1,0,0,0,1,A
0,1,0,0,0,1,… …,0,1,0,0,0,1,A
0,1,0,0,0,1,… …,0,1,0,0,1,1,?
0,1,1,0,0,1,… …,0,1,1,0,0,1,?
Challenges

Large number of binary features,
significantly fewer training instances:


Highly skewed:



140,000 vs 2000!
1867 In-actives, 42 Actives.
Varying degrees of correlation among
features
Differences in the training and test
distributions
Steps

Familiarization with data







data has noise, four equal records (all 0s) with
different labels
Lots more 0s than 1s
Number of 1s significantly higher for As than Is
Feature selection
Build classifiers
Combine classifiers
Incorporate unlabeled test instances
First step: feature selection


Most commercial classifiers cannot handle
140,000 features even with 1 GB memory.
Entropy-based individual feature selection


Step-wise feature selection


Does not handle redundant attributes.
Too brittle
Top entropy attribute with a “1” in each active
compound

Exploiting small counts of Actives
Want all important groups of redundant attributes
Building classifiers

Partition training data using stratified
sampling



Two-thirds training data
One-third validation data
Classification methods attempted




Decision tree classifiers
Naïve-Bayes
SVMs
Hand-crafted clustering/nearest neighbor hybrid
Decision Tree

C4.5
f88235 = 1
f137567 = 1
A (10)
f80106 = 1
A (5)
f26913 = 1
A
I
A
I
3
7
1
459
A (4)
f135832 = 1
A (3)
f25144 = 1
A (2)
A (2)
I (338/6)
Naïve Bayes

Data characteristics very similar to text



lots of features, sparse data, few ones
Naïve Bayes found very effective for text
classification
Accuracy: All actives misclassified!
A
I
A
0
10
I
1
459
Support vector machines




Has received lots of attention recently
Requires tuning: which kernel, what parameters?
Several freely available packages: SVMTorch
Accuracy: slightly worse than decision trees
fj
fi
Hand-crafted hybrid

Find features such that actives cluster
together using appropriate distance measure
fj
fi
Training active
Training inactive
Test Record
Incremental Feature Selection

Pick features ONE by ONE



that result in maximum clustering of the actives.
And maximum separation from the inactives.
Objective function:


Maximum separation between centroids of the
Actives and In-actives
Distance function: matching ones

Careful selection of training Actives.

Accuracy: 100%, 493 features
Final approach




Test data: significantly denser
Methods like SVM, NB, clustering-based will
not generalize
Preferred distribution independent method
Ensemble of Decision Trees


On disjoint attributes --- unconventional
Semi-supervised training

Introduce feedback from the test data in multiple
rounds
Building tree ensembles



Initially picked ~20000 features based on entropy.
More than one tree to take care of large feature
space.
Remove
Remove
features
features
Repeat until accuracy on validation data does not
drop

All groups of redundant features exploited.
Incorporating unlabeled instances

Augment training data with sure test
instances.
Re-train another ensemble of trees using
same method
Include more unlabelled instances with sure
predictions
Repeat few more times...

How to capture drift?



Capturing drift



Solution: Validate with independent data
Be sure to include only correctly labeled data
First approach: Same prediction by all trees



Weighted prediction by each tree



On validation data, found errors in this scheme
Pruning not a solution
Weight: fraction of Actives
Pick the right threshold using validation data.
Stop when no more unlabelled data can be
added
Final state

Three rounds each with about 6 trees



Use meta-learner on validation data to pick
final criteria


Unlabelled data included:
 126 actives & 311 inactives
Remaining
 200 in confusion
Sum of scores times number of trees claiming
Actives
Several other last minute hacks.
Outcome
Home Team
Winning Entry:
Weighted: 68.4%
Accuracy: 70.03%
Winner’s method



Pre-processing: Feature subset selection
using mutual information (200 of 139,351
features)
Learning Bayesian network models of
different complexity (2 to 12 features)
Choosing a model (ROC area, model
complexity)
Postmortem: Was all this
necessary?
Without semi-supervised learning:
 Single decision tree = 49%
 6-tree ensemble on training data alone:



Majority = 57%
Confidence weighted = 63%
With unlabelled data: 64.3%
Lessons learnt

Products:


Need tools that scale in number of features
Research problems:


Classifiers that are not tied to distribution
similarity with the training data
More principled way of including unlabelled
instances.