Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction cheminformatics – allow us to computationally describe similarity synthetic chemists – describe through.
Download
Report
Transcript Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction cheminformatics – allow us to computationally describe similarity synthetic chemists – describe through.
Predicting patterns of
biological performance using
chemical substructure features
Diego Borges-Rivera
08/04/08
Introduction
cheminformatics – allow us to computationally describe similarity
synthetic chemists – describe through visual inspection
we will describe compounds by the presence of chemical
substructures
we will attempt to identify sets of substructures that predict biological
performance
Previous work
Clemons/Kahne/Wagner et al. -- disaccharide profiling
in multiple cell states
found sets of substructures relevant to biological activity
patterns
substructures highly specific to disaccharides
10
20
30
40
50
60
substructures
Biological performance profile
400 compounds, 8 assays in duplicate
tested for cell proliferation in 8 different cell lines
class labels are active (A) or inactive (I)
active compound
What are fingerprints?
compound collection
fed into commercial
O
1 bit
A
are present
A
N HO
A
which substructures
A
A
each substructure =
the fingerprint shows
NH
A
software
A
substructure
#7017
substructure
#886
OH
N
O
substructure
#1725
A
A
H
N
A
O
A
Overview of cheminformatic methods
produced fingerprints 7700 total
substructures
filtered set
left 2166 substructures
Overview of computational
methods
two steps independent of each other
feature (substructure)
selection to find
predictive subsets
evaluate methods for
predictive value
ReliefF: substructure selection
O
O
A
A
A
A
A
A
-1
A
0
A
A
A
O
A
Top 5
+1
2166 weights
Bottom 5
A
S
A
A
OH
A
A
A
N
A
A
A
A
O
A
K nearest neighbors (knn): predictive accuracy
Examples: k = 2, 5
compound being
classified = ?
Similarity between compounds
similarity between two
fingerprints
Tanimoto coefficient
this is used twice:
(1)
in ReliefF
(2)
in knn
a b
T ( a, b)
a b
Example:
Compound a: 0 0 1
Compound b: 1 0 1
Tanimoto coefficient = 1 / 2 = .5
Cross-validation: predictive accuracy
10 subsets
test set: one of the subsets
training set: the remaining
subsets
test set
training set
Picking parameters for methods
which parameters produce the best predictive
accuracies
number of neighbors used in ReliefF {1, 2, 4, etc}
number of neighbors used in knn {1, 2, 4, etc}
number of ReliefF substructures used to predict
classes in knn {1, 20, 100, etc}
Picking number of substructures
Bar Chart
predictive accuracy
1.0
1
.9
0.9
.8
0.8
.7
0.7
.6
0.6
.5
0.5
.4
0.4
.3
0.3
.2
0.2
.1
0.1
0.0
0
1
20
userDs
2166
1
20
all
number of substructures used to predict
Group of substructures best able to predict
disposition of oxo- functionality in simple fragments
O
A
A
A
O
O
A
A
branch topologies of acyclic unsaturated fragments, including heteroatom dispositions
A
A
A
A
A
A
A
A
N
H
A
A
A
A
A
A
A
A
A
A
HO
A NH
A
A
aryl- and heteroaryl branch topologies, including heteroatom disposition in ring systems
A
A
A
O
A
A
O
A
A
HO
A
A
A A
O
A A
A
A
N
N
A
A A
A
A
A
N
A
A
A
A
A
A
A
A
Future work
multi-class
different feature selection
Acknowledgements
Computational Chemical Biology
Joshua Gilbert
Paul Clemons
Hyman Carrinski
Summer Research Program in Genomics
Shawna Young
Lucia Vielma
Maura Silverstein