Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction  cheminformatics – allow us to computationally describe similarity  synthetic chemists – describe through.

Download Report

Transcript Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction  cheminformatics – allow us to computationally describe similarity  synthetic chemists – describe through.

Predicting patterns of
biological performance using
chemical substructure features
Diego Borges-Rivera
08/04/08
Introduction

cheminformatics – allow us to computationally describe similarity

synthetic chemists – describe through visual inspection

we will describe compounds by the presence of chemical
substructures

we will attempt to identify sets of substructures that predict biological
performance
Previous work

Clemons/Kahne/Wagner et al. -- disaccharide profiling
in multiple cell states

found sets of substructures relevant to biological activity
patterns

substructures highly specific to disaccharides
10
20
30
40
50
60
substructures
Biological performance profile

400 compounds, 8 assays in duplicate

tested for cell proliferation in 8 different cell lines

class labels are active (A) or inactive (I)
active compound
What are fingerprints?

compound collection
fed into commercial
O

1 bit
A
are present
A
N HO
A
which substructures
A
A
each substructure =
the fingerprint shows
NH
A
software

A
substructure
#7017
substructure
#886
OH
N
O
substructure
#1725
A
A
H
N
A
O
A
Overview of cheminformatic methods

produced fingerprints  7700 total
substructures

filtered set

left 2166 substructures
Overview of computational
methods

two steps independent of each other
feature (substructure)
selection to find
predictive subsets
evaluate methods for
predictive value
ReliefF: substructure selection
O
O
A
A
A
A
A
A
-1
A
0
A
A
A
O
A
Top 5
+1
2166 weights
Bottom 5
A
S
A
A
OH
A
A
A
N
A
A
A
A
O
A
K nearest neighbors (knn): predictive accuracy

Examples: k = 2, 5
compound being
classified = ?
Similarity between compounds

similarity between two
fingerprints

Tanimoto coefficient

this is used twice:
(1)
in ReliefF
(2)
in knn
a b
T ( a, b) 
a b
Example:
Compound a: 0 0 1
Compound b: 1 0 1
Tanimoto coefficient = 1 / 2 = .5
Cross-validation: predictive accuracy

10 subsets

test set: one of the subsets

training set: the remaining
subsets
test set
training set
Picking parameters for methods

which parameters produce the best predictive
accuracies

number of neighbors used in ReliefF {1, 2, 4, etc}

number of neighbors used in knn {1, 2, 4, etc}

number of ReliefF substructures used to predict
classes in knn {1, 20, 100, etc}
Picking number of substructures
Bar Chart
predictive accuracy
1.0
1
.9
0.9
.8
0.8
.7
0.7
.6
0.6
.5
0.5
.4
0.4
.3
0.3
.2
0.2
.1
0.1
0.0
0
1
20
userDs
2166
1
20
all
number of substructures used to predict
Group of substructures best able to predict
disposition of oxo- functionality in simple fragments
O
A
A
A
O
O
A
A
branch topologies of acyclic unsaturated fragments, including heteroatom dispositions
A
A
A
A
A
A
A
A
N
H
A
A
A
A
A
A
A
A
A
A
HO
A NH
A
A
aryl- and heteroaryl branch topologies, including heteroatom disposition in ring systems
A
A
A
O
A
A
O
A
A
HO
A
A
A A
O
A A
A
A
N
N
A
A A
A
A
A
N
A
A
A
A
A
A
A
A
Future work

multi-class

different feature selection
Acknowledgements
Computational Chemical Biology
Joshua Gilbert
Paul Clemons
Hyman Carrinski
Summer Research Program in Genomics
Shawna Young
Lucia Vielma
Maura Silverstein