E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University One Talk, Two Projects  NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU.  Geoffrey Fox 

E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University One Talk, Two Projects  NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU.  Geoffrey Fox 

Transcript E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University One Talk, Two Projects  NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU.  Geoffrey Fox 

E-Chemistry and Web 2.0
Marlon Pierce
[email protected]
Community Grids Lab
Indiana University
1
One Talk, Two Projects
 NIH funded Chemical
Informatics and
Cyberinfrastructure
Collaboratory (CICC) @
IU.
 Geoffrey Fox
 Gary Wiggins
 Rajarshi Guha
 David Wild
 Mookie Baik
 Kevin Gilbert
 And others
 Proposed MicrosoftFunded Project: EChemistry
 Carl Lagoze (Cornell),
 Lee Giles (PSU),
 Steve Bryant (NIH),
 Jeremy Frey (Soton),
 Peter Murray-Rust
(Cambridge),
 Herbert Van de Sompel (Los
Alamos),
 Geoffrey Fox (Indiana)
 And others
2
CICC Infrastructure Vision
 Chemical Informatics: drug discovery and other academic chemistry,
pharmacology, and bioinformatics research will be aided by powerful,
modern, open, information technology.
 NIH PubChem and PubMed provide unprecedented open, free data and
information.
 We need a corresponding open service architecture (i.e. avoid stove-piped
applications)
 CICC set up as distributed cyberinfrastructure in eScience model
 Web clients (user interfaces) to distributed databases, results of high
throughput screening instruments, results of computational chemical
simulations and other analyses.
 Composed of clients to open service APIs (mash-ups)
 Aggregated into portals
 Web services manipulate this data and are combined into workflows.
 So our main agenda items: create interesting databases and build lots of
Web services and clients.
3
CICC Databases
Most of our databases aim to add value to
PubChem or link into PubChem
1D (SMILES) and 2D structures
3D structures (MMFF94)
Searchable by CID, SMARTS, 3D similarity
Docked ligands (FRED, Autodock)
906K drug-like compounds into 7 ligands
Will eventually cover ~2000 targets
Philosophy: we have big computers, so let’s
calculate everything ahead of time and put the
results in a DB.
Building Up the Infrastructure
Our SOA philosophy: use standard Web services.
Mostly stateless
Some cluster, HPC work needed but these populate
databases
Services are aggregate-able into different
workflows.
Taverna, Pipeline Pilot, …
You can also build lots of Web clients.
See
http://www.chembiogrid.org/wiki/index.php/CICC_
Web_Resources for links and details.
Not so far from Web 2.0….
5
Sample Services
Type
Service
Functionality
Docking
Provides access to
the results of
docking a subset of
PubChem into a set
Indiana
of ligands.
University
Searchable by 2D
structure and
docking d ocking
score
Freely
accessible
Database
3D Structure
Provides access to
3D structure
Indiana
generated for most University
of PubChem
Freely
accessible
Cheminformatics
OSCAR3
Extract chemical
structures from
text
Cambridge
University
Freely
accessible
Cheminformatics
InChiGoogle
Uses Google to
search for an
InChI
Cambridge
University
Freely
accessible
Cheminformatics
CMLRSSServer
Generates a
CMLRSS feed from
CML data
Cambridge
University
Freely
accessible
Cheminformatics
OpenBabel
Converts chemical
file formats
Cambridge
University
Freely
accesible
Database
Source
License
6
Cheminformatics
ToxTr eeServer
Indiana
University &
Obtains toxicity
European
hazard predictions
Chemical
Bure au
Freely
accessible
Freely
accessible
DBUtil
Generates 166 bit
MACCS keys
Indiana
University &
gNo va
Consulting
Molecular
Similarity
Evaluates 2D/3D
similarity and
evaluate distance
moments for 3D
similarity
calculations
Indiana
University &
CDK
Freely
accessible
Cheminformatics
Molecular
Descriptors
Generatesarious
descriptors
including TPSA,
XLogP, surface
areas
Indiana
University &
CDK
Freely
accessible
Cheminformatics
2D Structure
Diagrams
Generates 2D
Indiana
structure diagram s University &
from SMILES
CDK
Freely
accessible
Cheminformatics
Druglikeness
Methods
Evaluates
measures of
druglikeness
Indiana
University &
CDK
Freely
accessible
Utility Methods
Generates hashed
fingerprints, 2D
coordinate
generation etc.
Indiana
University &
CDK
Freely
accessible
Cheminformatics
Cheminformatics
Cheminformatics
7
Statistics
Sampling
Distributions
Samples from
several
distributions
(norm al, uniform,
Weibull e tc)
Statistics
Linear Regression
Builds line ar
regression models
Indiana
University
Freely
accessible
CNN Regression
Builds neural
Indiana
network regression
University
models
Freely
accessible
Statistics
RF Regression
Builds random
forest regression
models
Indiana
University
Freely
accessible
Statistics
LDA
Builds line ar
discriminant
analysis models
Indiana
University
Freely
accessible
Statistics
K-Means
Performs K-means Indiana
clustering
University
Freely
accessible
Statistics
Feature Selection
Performs feature
selection using
stepwise
regression
Indiana
University
Freely
accessible
Statistics
XY Plots
Generates 2D
scatter plots
Indiana
University
Freely
accessible
Statistics
Histogram Plots
Generates
histogram s
Indiana
University
Freely
accessible
Statistics
Indiana
University
Freely
accessible
8
TabToVOTables
Converts tab
delimited files to
VOTables
Indiana
University
Freely
accessible
Data Exchange
VOTablesToTab
Converts VOTables
Indiana
to tab delimited
University
files
Freely
accessible
Data Exchange
VOTablesToXLS
Converts VOTables
Indiana
to Excel
University
spreadsheet
Freely
accessible
VOTable Retrieve
Retrieves field
names and data
types fro m a
VOTables
document
Indiana
University
Freely
accessible
Data Exchange
VOTableExtract
Extracts columns
from a VOTables
document
Indiana
University
Freely
accessible
Computational
Chemistry
Varun a File
Format
Handles file
formats for
QM/MM packages
Indiana
University
Freely
accessible
Computational
Chemistry
Varun a Analysis
Performs analysis
of re sults from
Jaguar and ADF
Indiana
University
Freely
accessible
Computational
Chemistry
Varun a Query
Searches the
Varun a database
Indiana
University
Freely
accessible
Computational
Chemistry
Varun a Submit
Submits input data
Indiana
for calculation on a
University
local cluster
Freely
accessible
Data Exchange
Data Exchange
9
Fred
Performs docking
Openeye
Software
Commercial
Application
Filter
Property
calculation and
filtering
Openeye
Software
Commercial
Application
Omega
Generates 3D
conformers
Openeye
Software
Commercial
Application
BCI Fingerprint
Generates 1052
Digi tal
BCI st ructural keys Chemistry
Commercial
Application
BCI Clustering
Performs divisive
Digi tal
k-means clustering Chemistry
Commercial
PkC ell
Evaluates
pharmacokinetic
parameters for
druglike molecules
Indiana
University &
University of
Michigan
Freely
accessible
Scripps MLSCN
Toxicity
Gets toxicity
predictions for RF
models built using
MLSCN c ell-line
data
Indiana
University &
Scripps, FL.
Freely
accessible
Application
NTP DTP Anti cancer activity
Gets anti-cancer
actvity predictions Indiana
for the 60 NCI cell University
lines
Freely
accessible
Application
Ames
Mutagenicity
Gets mutagenicity
predictions
Freely
accessible
Application
Application
Application
Indiana
University
10
Web Client Interfaces
Name
Functionality
Type
Links
PubDock
Interface to the
docking da tabase
Web
http://www.chembiogrid.org/cheminf
o/dock/
Pub3D
Interface to the 3D
Web
structure database
http://www.chembiogrid.org/cheminf
o/p3d/
Frequent
Hitters
Identify
compounds that
occur in multiple
assays, with links
to individual
assays
Web
http://www.chembiogrid.org/cheminf
o/freqhit/fh
MLSCN T oxicity
Predictions
Predict whether a
compound will be
toxic or not
Web and
Pipeline Pilot
http://www.chembiogrid.org/cheminf
o/rws/scripps
ToxTr ee
Predict toxicity
hazard c lass
Web
http://cheminfo.informatics.indiana.e
du/~rguh a/code/java/cdkws/cdkws.
html#tox
DTP AntiCancer
Predictions
Predict whether a
compound exhibits
anti-cancer activity Web
against the 60 NCI
cell lines
http://www.chembiogrid.org/cheminf
o/ncidtp/dtp
11
More Clients…
Ames
Mutagenicity
Predictions
Predict whether a
compound is
Web
mutagenic or not in
the Ames test
http://www.chembiogrid.org/cheminf
o/rws/ames
PkC ell
Evaluate
pharmacokinetic
parameters
Web
http://www.chembiogrid.org/cheminf
o/pkc ell/
Kemo
Natural language
interface to
PubChem
Web
http://cheminfo.informatics.indiana.e
du:8080/kemo/
RSS Feeds
Generate RSS
feeds for various
PubChem related
queries
Web and RSS
feed
http://www.chembiogrid.org/cheminf
o/rssint.html
Statistical
Model
Download
Download
statistical models
as R binary files
Web
http://www.chembiogrid.org/cheminf
o/rws/mlist
Web
http://cheminfo.informatics.indiana.e
du/~rguh a/code/java/cdkws/cdkws.
html
Miscellaneous
functions such as
Cheminformatic
structure
s
diagram s,
similarity etc.
12
More Clients…
Varun a
File operations and
Web
result analysis
http://129.79.139.29/filecon/Default.
aspx and
http://129.79.139.29/utili tyclient/De
fault.aspx
VOTables
Plotting data using
VOTables as well
Web
as using Excel files
via VOTables
http://gf1.ucs.indiana.edu:9080/axis
/VOTables.html and
http://www.chembiogrid.org/cheminf
o/rws/xlsvor
PubChemSR
.Net interface to
PubChem
Desktop
application
http://darwin.informatics.indiana.edu
/juhur/To ols/PubChemSR/
rpubchem and
rcdk
R packages to
interface wi th the
CDK and access
PubChem
Desktop
applciation
http://cran.rproject.org/src/contrib/De scriptions/
rcdk.html and htt p://cran.rproject.org/src/contrib/De scriptions/
rpubchem.html
Chimera plugin
A plugin to allow
Chimera to ut ilize
the PubDo ck
database
Desktop
application
(requires
Chimera)
http://poincare.uits.iupui.edu/~heila
nd/cicc/code/
PubChem 3D
View
A Gre asemonkey
script that shows
3D structures
when viewing
Pubchem pages
Web (requires
Firefox and
http://rna.informatics.indiana.edu/hg
Greasemonkey opalak/3DStructView.user.js
)
13
Example: PubDock
 Database of approximately 1
million PubChem structures (the
most drug-like) docked into
proteins taken from the PDB
 Available as a web service, so
structures can be accessed in
your own programs, or using
workflow tools like Pipeline Polit
 Several interfaces developed,
including one based on Chimera
(right) which integrates the
database with the PDB to allow
browsing of compounds in
different targets, or different
compounds in the same target
 Can be used as a tool to help
understand molecular basis of
activity in cellular or image
based assays
14
Example: R Statistics applied to
PubChem data
 By exposing the R statistical package, and the Chemistry Development Kit
(CDK) toolkit as web services and integrating them with PubChem, we can
quickly and easily perform statistical analysis and virtual screening of
PubChem assay data
 Predictive models for particular screens are exposed as web services, and
can be used either as simple web tools or integrated into other applications
 Example uses DTP Tumor Cell Line screens - a predictive model using
Random Forests in R makes predictions of probability of activity across
multiple cell lines.
15
A protein implicated in tumor
growth with known ligand is
selected (in this case HSP90 taken
from the PDB 1Y4 complex)
The screening data from a
cellular HTS assay is
similarity searched for
compounds with similar
2D structures to the
ligand.
Similar structures to the
ligand can be browsed
using client portlets.
Example assay
screening
workflow: finding
cell-protein
relationships
Similar structures are
filtered for drugability, are
converted to 3D, and are
automatically passed to
the OpenEye FRED
docking program for
docking into the target
protein.
Docking results and
activity patterns fed into
R services for building of
activity models and
correlations
Least
Squares
Regression
Random
Forests
Neural
Nets
Once docking is complete,
the user visualizes the highscoring docked structures
in a portlet using the JMOL
applet.
16
Relevance to Web 2.0
Some Web 2.0 Key Features
REST Services
Use of RSS/Atom feeds
Client interfaces are “mashups”
Gadgets, widgets for portals aggregate clients
So…
We provide RSS as an alternative WS format.
We have experimented with RSS feeds, using Yahoo
Pipes to manipulate multiple feeds.
CICC Web interfaces can be easily wrapped as
universal gadgets in iGoogle, Netvibes.
Alternative to classic science gateways.
17
RSS Feeds/REST Services
Provide access to DB's via RSS feeds
Feeds include 2D/3D structures in CML
Viewable in Bioclipse, Jmol as well as Sage etc.
Two feeds currently available
SynSearch – get structures based on full or partial
chemical names
DockSearch – get best N structures for a target
Really hampered by size of DB and Postgres
performance.
Tools and mashups based on web service
infrastructure
http://www.chembiogrid.org/projects/proj_tools.html
19
Mining information from journal
articles
 Until now SciFinder / CAS only chemistry-aware portal
into journal information
 We can access full text of journal articles online (with
subscription)
 ACS does not make full text available … but there are
ways round that!
 RSC is now marking up with SMILES and GO/Goldbook
terms!
 www.projectprospect.org
 Having SMILES or InChI means that we can build a
similarity/structure searchable database of papers: e.g.
“find me all the papers published since 2000 which
contain a structure with >90% similarity to this one”
 In the absence of full text, we can at least use the abstract
20
Text Mining: OSCAR
 A tool for shallow, chemistry-specific natural language
parsing of chemical documents (e.g. journal articles).
 It identifies (or attempts to identify):
 Chemical names: singular nouns, plurals, verbs etc., also
formulae and acronyms.
 Chemical data: Spectra, melting/boiling point, yield etc. in
experimental sections.
 Other entities: Things like N(5)-C(3) and so on.
 Part of the larger SciBorg effort
 See
http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html)
 http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/O
scar3
21
Create a database containing the
text of all recent PubMed abstracts
(2006-2007 = ~500,000)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Use OSCAR to extract all of the
chemical names referred to in the
abstracts and covert to SMILES
DATABASE SERVICE
+
DOCKING SERVICE
Convert molecules to
3D and dock into a
protein of interest
Visualize top docked
molecules in a Googlelike interface
Mash-Up: What published compounds might bind to this protein?
E-Chemistry and Digital
Libraries
We can’t wait to get started….
23
E-Chemistry and Digital Libraries
Key problem with our SOA-based e-Science is
information management.
Where is the service that I need?
What does it do?
We may consider our data-centric services to be
digital libraries.
Data is diverse
Documents
Not just computational information like structures.
Another point of view: how can I link together
publications, results, workflows, etc?
That is, I need to manage digital documents.
24
Digital Libraries
 Open Archives Initiative Object Reuse and Exchange
Project (OAI-ORE)
 Developing standardized, interoperable, and machinereadable mechanisms to express information about
compound information objects on the web.
 Graph-based representations of connected digital objects.
 Objects may be encoded in (for example) RDF or XML,
 Retrievable via repositories with REST service interfaces
(c.f. Atom Publishing Protocal)
 Obtain, harvest, and register
25
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Challenges for E-Chemistry
Can digital library principals be applied to data as
well as documents?
Can you link your workflow to your conference paper?
Can we engineer a publishing framework and
message formats around Web 2.0 principals?
REST, Atom Publishing Protocol, Atom Syndication
Format, JSON, Microformats
Can we do this securely?
Access control, provenance, identify federation are key
problems.
28
Institution
Project Focus
Cambridge




Retrospective Data Extraction
Searching and Indexing
Data M odels/Ontologies
Tools and Applications
Cornell




Data M odels
Interoperability infrastructure
Project Management
Publicity and outreach
Indiana



Infrastructure Integration
T rust and Provenance
Tools and Applications
LANL

Data M odels

Interoperability infrastructure

Chemical Structure Archive

Results of Experimental Biological Activity T esting

Cross References to BioMedical Databases
Penn State



Retrospective Data Extraction
Searching and Indexing
Analysis
Southampton




Prospective & Retrospective Data Provision
Tools and Applications
In-process capture of eChemistry data
Data Linking Š in analysis and publication
PuBChem
More Information
Project Web Site: www.chembiogrid.org
Project Wiki: www.chembiogrid.org/wiki
Contact me: [email protected]
30
31
CICC
Chemical Informatics and Cyberinfrastucture Collaboratory
Funded by the National Institutes of Health
www.chembiogrid.org
CICC
CICC Combines Grid Computing with Chemical Informatics
Large Scale Computing Challenges
Chemical Informatics is non-traditional area of high
performance computing, but many new, challenging
problems may be investigated.
NIH
PubMed
DataBase
Chemical
informatics
text analysis
programs can
process
100,000’s of
abstracts of
online journal
articles to
extract
chemical
signatures of
potential
drugs.
OSCAR
Text
Analysis
Initial 3D
Structure
Calculation
Molecular
Mechanics
Calculations
Cluster
Grouping
Toxicity
Filtering
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data deluge
of publicly available data on potential new drugs.
.
Docking
OSCAR-mined molecular signatures can
be clustered, filtered for toxicity, and
docked onto larger proteins. These are
classic “pleasingly parallel” tasks. Topranking docked molecules can be further
examined for drug potential.
Quantum
Mechanics
Calculations
NIH
PubChem
DataBase
POVRay
Parallel
Rendering
IU’s
Varuna
DataBase
Big Red (and the TeraGrid) will
also enable us to perform time
consuming, multi-stepped
Quantum Chemistry
calculations on all of PubMed.
Results go back to public
databases that are freely
accessible by the scientific
community.
CICC supports the NIH mission by combining state of
the art chemical informatics techniques with
• World class high performance computing
• National-scale computing resources (TeraGrid)
• Internet-standard web services
• International activities for service orchestration
• Open distributed computing infrastructure for scientists
world wide
Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories
MLSCN Post-HTS Biology Decision
Support
Percent Inhibition or
IC50 data is retrieved
from HTS
Question: Was this
screen successful?
Workflows encoding plate
& control well statistics,
distribution analysis, etc
Question: What should the
active/inactive cutoffs be?
Workflows encoding
distribution analysis of
screening results
Question: What can we learn
about the target protein or cell
line from this screen?
Workflows encoding
statistical comparison of
results to similar screens,
docking of compounds
into proteins to correlate
binding, with activity,
literature search of active
compounds, etc
Compounds submitted to
PubChem
PROCESS
CHEMINFORMATICS
Grids can link data
analysis ( e.g image
processing developed in
existing Grids),
traditional Cheminformatics tools, as well
as annotation tools
(Semantic Web,
del.icio.us) and enhance
lead ID and SAR analysis
A Grid of Grids linking
collections of services at
PubChem
ECCR centers
MLSCN centers
GRIDS
R Web Services
34
Why?
Need access to math and stat
functionality
Did not want to recode algorithms
Wanted latest methods
Needed a distributed approach to
computation
Keep computation on a powerful machine
Access it from a smaller machine
35
Why R?
Free, open-source
Many cutting edge methods avilable
Flexible programming language
Interfaces with many languages
Python
Perl
Java
C
36
The R Server
R can be run as a remote compute
server
Requires the rserve package
Allows authenticated access over
TCP/IP
Connections can maintain state
Client libraries for Java & C
37
R as a Web Service
On its own the R server is not a web
service
We provide Java frontends to specific
functionalities
The frontend classes are hosted in a
Tomcat web container
Accessible via SOAP
Full Javadocs for all available WS’s
38
Flowchart
39
Functionality
Two classes of functionality
General functions
Allows you to supply data and build a
predictive model
Sample from various distributions
Obtain scatter plots and hisotgram
Model development functions use a Java frontend to encapsulate model specific information
40
Functionality
Two classes of functionality
Model deployment
Allows you to build a model outside of the
infrastructure
Place the final model in the infrastructure
Becomes available as a web service
Each model deployed requires its own front
end class
In general, these classes are identical - could
be autogenerated
41
Available Functionality
Predictive models - OLS, RF, CNN, LDA
Clustering - k-means
Statistical distributions
XY plot and scatter plots
Model deployment for single model
types and ensemble model types
42
Deployed Models
Since deployed models are visible as
web services we can build a simple web
front end for them
Examples
NCI anti-cancer predictions
Ames mutagenicity predictions
43
Applications
The R WS is not restricted to ‘atomic’
functionality
Can write a whole R program
Load it on the R compute server
Provide a Java WS frontend
Examples
Feature selection
Automated model generation
Pharmacokinetic parameter calculation
44
Data Input/Output
Most modeling applications require data
matrices
Depending on client language we can
use
SOAP array of arrays (2D matrices)
SOAP array (1D vector form of a 2D
matrix)
VOTables
45
Data Input/Output
Some R web services can take a URL
to a VOTables document
Conversion to R or Java matrices is done
by a local VOTables Java library
R also has basic support for VOTables
directly
Ignores binary data streams
46
Interacting With R WS’s
Traditional WS’s do not maintain state
Predictive models are different
A model is built at one time
May be used for prediction at another time
Need to maintain state
State is maintained by serialization to R
binary files on the compute server
Clients deal with model ID’s
47
Interacting with R WS’s
Protocol
Send data to model WS
Get back model ID
Get various information via model ID
Fitted values
Training statistics
New predictions
48
Cheminformatics at Indiana
University School of Informatics
David J. Wild
[email protected]
Associate Director of Chemical Informatics &
Assistant Professor
Indiana University School of Informatics,
Bloomington
http://djwild.info
49
Cheminformatics education at
Indiana
 M.S. in Chemical Informatics
 2 years, 36 semester hours
 Includes a 6-hour capstone / research project
 Opportunity to work in Laboratory Informatics (IUPUI) or
closely with Bioinformatics (IUB)
 Currently 9 students enrolled
 Ph.D. in Informatics, Cheminformatics Specialty
 90 credit hours, including 30 hours dissertation research.
Usually 4 years.
 Research rotations expose students to research in related
areas
 Currently 4 students enrolled
 Graduate Certificate
 4 courses, all available by Distance Education
50
Distance Education for
Cheminformatics
Uses Breeze + teleconference for live sharing
of classes: all that is required is a P.C. and a
telephone. Optional Polycom
videoconferencing.
Lectures are recorded for easy playback
through a web browser
Wiki or similar webpage for dissemination of
course materials
Also participate in CIC courseshare to give
class at University of Michigan
Of 75 students taking our courses since fall 51
Current research in the Wild
lab
Integration of cheminformatics tools and data
sources
A web service infrastructure for cheminformatics
Compound information & aggregation web service
and interface (“by the way box”)
An enhanced chatbot for exploting chemical
information & web services
A semantically-aware workflow tools for
cheminformatics
Data mining the NIH DTP tumor cell line database
PubDock: a docking database for PubChem
52
Current research in the Guha
lab
Predictive Modeling
Interpretation, validation, domain applicability
Generalization to other ‘models’ such as docking,
pharmacophore etc
Integration of multiple data types
Addressing imbalanced and noisy datasets
Analysis of Chemical Spaces
Quantify distributions in spaces
Investigation of density approaches
Applications to lead hopping, model domains
Methods to summarize & compare data
Applications to HTS and smaller lead series type
53
Cheminformatics services
Docking (FRED)
3D structure generation
(OMEGA)
Filtering (FRED, etc)
Database Services
OSCAR3
PostgreSQL + gNova
Fingerprints (BCI, CDK)
PubChem mirror
Clustering (BCI)
(augmented)
Toxicity prediction
Pub3D - 3D structures
(ToxTree)
for PubChem
R-based predictive models
PubDock - Bound 3D
Similarity calculations
structures
(CDK)
Compound-indexed
Descriptor calculation
journal article DB
(CDK)
Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy Heiland, Jungkee
Kim, Marlon E. Pierce, Geoffrey C. Fox and
NIH
Human
Tumor
Cell
David J. Wild, Web service infrastructure for chemoinformatics, Journal of Chemical Information and Modeling, 2007;
2D structure diagrams
47(4) pp 1303-1307
Line
54
(CDK)
Local PubChem mirror
Cheminformatics web service
infrastructure
RSC Project Prospect - what
can we do with the
information?
www.projectprospect.org
>100 papers marked up with SMILES/InChI
(using OSCAR3), plus Gene Ontology and
Goldbook Ontology terms
Created similarity searchable PostgreSQL /
gNova database with paper DOIs, SMILES,
and ontology terms
Web service and simple HTML interfaces for
searching … “which papers reference
compounds similar to this one in the scope of
these ontological terms?”
55
Greasemonkey / OSCAR
script
http://cheminfo.informatics.indiana.edu:8080/ChemGM/index.jsp
56
By the way…
By the way… annotation
(mock-up!)
This compounds is very similar to a
prescription drug, Tamoxifen.
This compound is referenced in 20 journal
articles published in the last 5 years
Similar compounds are associated with the
words “toxic” and “death” in 280 web pages
It appears to be covered under 3 patents
It has been shown to be active in 5 screens
Computer models predict it to show some
activity against 8 protein targets
Here are some comments on this
compound:
David Wild: don’t take any notice of the
computational models - they are rubbish
57
Cheminformatics aware
simple lab notebook (mock
up!)
Plug-in allows structures
to be drawn with
the pen and cleaned up
Some useful chemical reactions
Iodoacetate a Iodoacetamide I-CH4COO- ICH2CONH2
OH
OH
S
+
H2C
C
C
S
O
+
I
O
FIND INFO ABOUT THIS REACTION
This may also react, chem favored by alkaline pH
Free text input can be
converted to machine
readable form by
electrovaya
….
Web service interface
provides access to
computation and searching.
Page is marked up by what
is possible
Automatic detection of
data fields (yield, etc)
Where possible
58
Automatic workflow
generation and natural
language
queries
Develop service ontology using OWL-S or
similar language
Allows service interoperability, replacement and
input/outut compatibility
2d
similarity
3D structures are
compounds
We can then use generic reasoning and
2D -> 3D
network analysis
tools to find paths from
2D
inputs to desired outputs
structure
crawler
result
Natural language
can be parsed
to
inputs
and
P’phore
dock
search
desired outputs
Smart Clients <--> Agents <--> Services
Possible “supercharged life science Google?”59
2D structures
2D structures
3D search
3D structures
3D structures
3D structures & complexes
2D structures are
compounds
3D protein
structure
3D structures are
compounds
dock = bind

E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University One Talk, Two Projects  NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU.  Geoffrey Fox 

Transcript E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University One Talk, Two Projects  NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU.  Geoffrey Fox 

Directory