Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction  cheminformatics – allow us to computationally describe similarity  synthetic chemists – describe through.

Transcript Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction  cheminformatics – allow us to computationally describe similarity  synthetic chemists – describe through.

Predicting patterns of
biological performance using
chemical substructure features
Diego Borges-Rivera
08/04/08
Introduction

cheminformatics – allow us to computationally describe similarity

synthetic chemists – describe through visual inspection

we will describe compounds by the presence of chemical
substructures

we will attempt to identify sets of substructures that predict biological
performance
Previous work

Clemons/Kahne/Wagner et al. -- disaccharide profiling
in multiple cell states

found sets of substructures relevant to biological activity
patterns

substructures highly specific to disaccharides
10
20
30
40
50
60
substructures
Biological performance profile

400 compounds, 8 assays in duplicate

tested for cell proliferation in 8 different cell lines

class labels are active (A) or inactive (I)
active compound
What are fingerprints?

compound collection
fed into commercial
O

1 bit
A
are present
A
N HO
A
which substructures
A
A
each substructure =
the fingerprint shows
NH
A
software

A
substructure
#7017
substructure
#886
OH
N
O
substructure
#1725
A
A
H
N
A
O
A
Overview of cheminformatic methods

produced fingerprints  7700 total
substructures

filtered set

left 2166 substructures
Overview of computational
methods

two steps independent of each other
feature (substructure)
selection to find
predictive subsets
evaluate methods for
predictive value
ReliefF: substructure selection
O
O
A
A
A
A
A
A
-1
A
0
A
A
A
O
A
Top 5
+1
2166 weights
Bottom 5
A
S
A
A
OH
A
A
A
N
A
A
A
A
O
A
K nearest neighbors (knn): predictive accuracy

Examples: k = 2, 5
compound being
classified = ?
Similarity between compounds

similarity between two
fingerprints

Tanimoto coefficient

this is used twice:
(1)
in ReliefF
(2)
in knn
a b
T ( a, b) 
a b
Example:
Compound a: 0 0 1
Compound b: 1 0 1
Tanimoto coefficient = 1 / 2 = .5
Cross-validation: predictive accuracy

10 subsets

test set: one of the subsets

training set: the remaining
subsets
test set
training set
Picking parameters for methods

which parameters produce the best predictive
accuracies

number of neighbors used in ReliefF {1, 2, 4, etc}

number of neighbors used in knn {1, 2, 4, etc}

number of ReliefF substructures used to predict
classes in knn {1, 20, 100, etc}
Picking number of substructures
Bar Chart
predictive accuracy
1.0
1
.9
0.9
.8
0.8
.7
0.7
.6
0.6
.5
0.5
.4
0.4
.3
0.3
.2
0.2
.1
0.1
0.0
0
1
20
userDs
2166
1
20
all
number of substructures used to predict
Group of substructures best able to predict
disposition of oxo- functionality in simple fragments
O
A
A
A
O
O
A
A
branch topologies of acyclic unsaturated fragments, including heteroatom dispositions
A
A
A
A
A
A
A
A
N
H
A
A
A
A
A
A
A
A
A
A
HO
A NH
A
A
aryl- and heteroaryl branch topologies, including heteroatom disposition in ring systems
A
A
A
O
A
A
O
A
A
HO
A
A
A A
O
A A
A
A
N
N
A
A A
A
A
A
N
A
A
A
A
A
A
A
A
Future work

multi-class

different feature selection
Acknowledgements
Computational Chemical Biology
Joshua Gilbert
Paul Clemons
Hyman Carrinski
Summer Research Program in Genomics
Shawna Young
Lucia Vielma
Maura Silverstein

Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction  cheminformatics – allow us to computationally describe similarity  synthetic chemists – describe through.

Transcript Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08 Introduction  cheminformatics – allow us to computationally describe similarity  synthetic chemists – describe through.

Directory