E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University One Talk, Two Projects NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU. Geoffrey Fox
Download ReportTranscript E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University One Talk, Two Projects NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU. Geoffrey Fox
E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University 1 One Talk, Two Projects NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU. Geoffrey Fox Gary Wiggins Rajarshi Guha David Wild Mookie Baik Kevin Gilbert And others Proposed MicrosoftFunded Project: EChemistry Carl Lagoze (Cornell), Lee Giles (PSU), Steve Bryant (NIH), Jeremy Frey (Soton), Peter Murray-Rust (Cambridge), Herbert Van de Sompel (Los Alamos), Geoffrey Fox (Indiana) And others 2 CICC Infrastructure Vision Chemical Informatics: drug discovery and other academic chemistry, pharmacology, and bioinformatics research will be aided by powerful, modern, open, information technology. NIH PubChem and PubMed provide unprecedented open, free data and information. We need a corresponding open service architecture (i.e. avoid stove-piped applications) CICC set up as distributed cyberinfrastructure in eScience model Web clients (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses. Composed of clients to open service APIs (mash-ups) Aggregated into portals Web services manipulate this data and are combined into workflows. So our main agenda items: create interesting databases and build lots of Web services and clients. 3 CICC Databases Most of our databases aim to add value to PubChem or link into PubChem 1D (SMILES) and 2D structures 3D structures (MMFF94) Searchable by CID, SMARTS, 3D similarity Docked ligands (FRED, Autodock) 906K drug-like compounds into 7 ligands Will eventually cover ~2000 targets Philosophy: we have big computers, so let’s calculate everything ahead of time and put the results in a DB. Building Up the Infrastructure Our SOA philosophy: use standard Web services. Mostly stateless Some cluster, HPC work needed but these populate databases Services are aggregate-able into different workflows. Taverna, Pipeline Pilot, … You can also build lots of Web clients. See http://www.chembiogrid.org/wiki/index.php/CICC_ Web_Resources for links and details. Not so far from Web 2.0…. 5 Sample Services Type Service Functionality Docking Provides access to the results of docking a subset of PubChem into a set Indiana of ligands. University Searchable by 2D structure and docking d ocking score Freely accessible Database 3D Structure Provides access to 3D structure Indiana generated for most University of PubChem Freely accessible Cheminformatics OSCAR3 Extract chemical structures from text Cambridge University Freely accessible Cheminformatics InChiGoogle Uses Google to search for an InChI Cambridge University Freely accessible Cheminformatics CMLRSSServer Generates a CMLRSS feed from CML data Cambridge University Freely accessible Cheminformatics OpenBabel Converts chemical file formats Cambridge University Freely accesible Database Source License 6 Cheminformatics ToxTr eeServer Indiana University & Obtains toxicity European hazard predictions Chemical Bure au Freely accessible Freely accessible DBUtil Generates 166 bit MACCS keys Indiana University & gNo va Consulting Molecular Similarity Evaluates 2D/3D similarity and evaluate distance moments for 3D similarity calculations Indiana University & CDK Freely accessible Cheminformatics Molecular Descriptors Generatesarious descriptors including TPSA, XLogP, surface areas Indiana University & CDK Freely accessible Cheminformatics 2D Structure Diagrams Generates 2D Indiana structure diagram s University & from SMILES CDK Freely accessible Cheminformatics Druglikeness Methods Evaluates measures of druglikeness Indiana University & CDK Freely accessible Utility Methods Generates hashed fingerprints, 2D coordinate generation etc. Indiana University & CDK Freely accessible Cheminformatics Cheminformatics Cheminformatics 7 Statistics Sampling Distributions Samples from several distributions (norm al, uniform, Weibull e tc) Statistics Linear Regression Builds line ar regression models Indiana University Freely accessible CNN Regression Builds neural Indiana network regression University models Freely accessible Statistics RF Regression Builds random forest regression models Indiana University Freely accessible Statistics LDA Builds line ar discriminant analysis models Indiana University Freely accessible Statistics K-Means Performs K-means Indiana clustering University Freely accessible Statistics Feature Selection Performs feature selection using stepwise regression Indiana University Freely accessible Statistics XY Plots Generates 2D scatter plots Indiana University Freely accessible Statistics Histogram Plots Generates histogram s Indiana University Freely accessible Statistics Indiana University Freely accessible 8 TabToVOTables Converts tab delimited files to VOTables Indiana University Freely accessible Data Exchange VOTablesToTab Converts VOTables Indiana to tab delimited University files Freely accessible Data Exchange VOTablesToXLS Converts VOTables Indiana to Excel University spreadsheet Freely accessible VOTable Retrieve Retrieves field names and data types fro m a VOTables document Indiana University Freely accessible Data Exchange VOTableExtract Extracts columns from a VOTables document Indiana University Freely accessible Computational Chemistry Varun a File Format Handles file formats for QM/MM packages Indiana University Freely accessible Computational Chemistry Varun a Analysis Performs analysis of re sults from Jaguar and ADF Indiana University Freely accessible Computational Chemistry Varun a Query Searches the Varun a database Indiana University Freely accessible Computational Chemistry Varun a Submit Submits input data Indiana for calculation on a University local cluster Freely accessible Data Exchange Data Exchange 9 Fred Performs docking Openeye Software Commercial Application Filter Property calculation and filtering Openeye Software Commercial Application Omega Generates 3D conformers Openeye Software Commercial Application BCI Fingerprint Generates 1052 Digi tal BCI st ructural keys Chemistry Commercial Application BCI Clustering Performs divisive Digi tal k-means clustering Chemistry Commercial PkC ell Evaluates pharmacokinetic parameters for druglike molecules Indiana University & University of Michigan Freely accessible Scripps MLSCN Toxicity Gets toxicity predictions for RF models built using MLSCN c ell-line data Indiana University & Scripps, FL. Freely accessible Application NTP DTP Anti cancer activity Gets anti-cancer actvity predictions Indiana for the 60 NCI cell University lines Freely accessible Application Ames Mutagenicity Gets mutagenicity predictions Freely accessible Application Application Application Indiana University 10 Web Client Interfaces Name Functionality Type Links PubDock Interface to the docking da tabase Web http://www.chembiogrid.org/cheminf o/dock/ Pub3D Interface to the 3D Web structure database http://www.chembiogrid.org/cheminf o/p3d/ Frequent Hitters Identify compounds that occur in multiple assays, with links to individual assays Web http://www.chembiogrid.org/cheminf o/freqhit/fh MLSCN T oxicity Predictions Predict whether a compound will be toxic or not Web and Pipeline Pilot http://www.chembiogrid.org/cheminf o/rws/scripps ToxTr ee Predict toxicity hazard c lass Web http://cheminfo.informatics.indiana.e du/~rguh a/code/java/cdkws/cdkws. html#tox DTP AntiCancer Predictions Predict whether a compound exhibits anti-cancer activity Web against the 60 NCI cell lines http://www.chembiogrid.org/cheminf o/ncidtp/dtp 11 More Clients… Ames Mutagenicity Predictions Predict whether a compound is Web mutagenic or not in the Ames test http://www.chembiogrid.org/cheminf o/rws/ames PkC ell Evaluate pharmacokinetic parameters Web http://www.chembiogrid.org/cheminf o/pkc ell/ Kemo Natural language interface to PubChem Web http://cheminfo.informatics.indiana.e du:8080/kemo/ RSS Feeds Generate RSS feeds for various PubChem related queries Web and RSS feed http://www.chembiogrid.org/cheminf o/rssint.html Statistical Model Download Download statistical models as R binary files Web http://www.chembiogrid.org/cheminf o/rws/mlist Web http://cheminfo.informatics.indiana.e du/~rguh a/code/java/cdkws/cdkws. html Miscellaneous functions such as Cheminformatic structure s diagram s, similarity etc. 12 More Clients… Varun a File operations and Web result analysis http://129.79.139.29/filecon/Default. aspx and http://129.79.139.29/utili tyclient/De fault.aspx VOTables Plotting data using VOTables as well Web as using Excel files via VOTables http://gf1.ucs.indiana.edu:9080/axis /VOTables.html and http://www.chembiogrid.org/cheminf o/rws/xlsvor PubChemSR .Net interface to PubChem Desktop application http://darwin.informatics.indiana.edu /juhur/To ols/PubChemSR/ rpubchem and rcdk R packages to interface wi th the CDK and access PubChem Desktop applciation http://cran.rproject.org/src/contrib/De scriptions/ rcdk.html and htt p://cran.rproject.org/src/contrib/De scriptions/ rpubchem.html Chimera plugin A plugin to allow Chimera to ut ilize the PubDo ck database Desktop application (requires Chimera) http://poincare.uits.iupui.edu/~heila nd/cicc/code/ PubChem 3D View A Gre asemonkey script that shows 3D structures when viewing Pubchem pages Web (requires Firefox and http://rna.informatics.indiana.edu/hg Greasemonkey opalak/3DStructView.user.js ) 13 Example: PubDock Database of approximately 1 million PubChem structures (the most drug-like) docked into proteins taken from the PDB Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit Several interfaces developed, including one based on Chimera (right) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target Can be used as a tool to help understand molecular basis of activity in cellular or image based assays 14 Example: R Statistics applied to PubChem data By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications Example uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines. 15 A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex) The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand. Similar structures to the ligand can be browsed using client portlets. Example assay screening workflow: finding cell-protein relationships Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. Docking results and activity patterns fed into R services for building of activity models and correlations Least Squares Regression Random Forests Neural Nets Once docking is complete, the user visualizes the highscoring docked structures in a portlet using the JMOL applet. 16 Relevance to Web 2.0 Some Web 2.0 Key Features REST Services Use of RSS/Atom feeds Client interfaces are “mashups” Gadgets, widgets for portals aggregate clients So… We provide RSS as an alternative WS format. We have experimented with RSS feeds, using Yahoo Pipes to manipulate multiple feeds. CICC Web interfaces can be easily wrapped as universal gadgets in iGoogle, Netvibes. Alternative to classic science gateways. 17 RSS Feeds/REST Services Provide access to DB's via RSS feeds Feeds include 2D/3D structures in CML Viewable in Bioclipse, Jmol as well as Sage etc. Two feeds currently available SynSearch – get structures based on full or partial chemical names DockSearch – get best N structures for a target Really hampered by size of DB and Postgres performance. Tools and mashups based on web service infrastructure http://www.chembiogrid.org/projects/proj_tools.html 19 Mining information from journal articles Until now SciFinder / CAS only chemistry-aware portal into journal information We can access full text of journal articles online (with subscription) ACS does not make full text available … but there are ways round that! RSC is now marking up with SMILES and GO/Goldbook terms! www.projectprospect.org Having SMILES or InChI means that we can build a similarity/structure searchable database of papers: e.g. “find me all the papers published since 2000 which contain a structure with >90% similarity to this one” In the absence of full text, we can at least use the abstract 20 Text Mining: OSCAR A tool for shallow, chemistry-specific natural language parsing of chemical documents (e.g. journal articles). It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections. Other entities: Things like N(5)-C(3) and so on. Part of the larger SciBorg effort See http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html) http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/O scar3 21 Create a database containing the text of all recent PubMed abstracts (2006-2007 = ~500,000) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Use OSCAR to extract all of the chemical names referred to in the abstracts and covert to SMILES DATABASE SERVICE + DOCKING SERVICE Convert molecules to 3D and dock into a protein of interest Visualize top docked molecules in a Googlelike interface Mash-Up: What published compounds might bind to this protein? E-Chemistry and Digital Libraries We can’t wait to get started…. 23 E-Chemistry and Digital Libraries Key problem with our SOA-based e-Science is information management. Where is the service that I need? What does it do? We may consider our data-centric services to be digital libraries. Data is diverse Documents Not just computational information like structures. Another point of view: how can I link together publications, results, workflows, etc? That is, I need to manage digital documents. 24 Digital Libraries Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE) Developing standardized, interoperable, and machinereadable mechanisms to express information about compound information objects on the web. Graph-based representations of connected digital objects. Objects may be encoded in (for example) RDF or XML, Retrievable via repositories with REST service interfaces (c.f. Atom Publishing Protocal) Obtain, harvest, and register 25 QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Challenges for E-Chemistry Can digital library principals be applied to data as well as documents? Can you link your workflow to your conference paper? Can we engineer a publishing framework and message formats around Web 2.0 principals? REST, Atom Publishing Protocol, Atom Syndication Format, JSON, Microformats Can we do this securely? Access control, provenance, identify federation are key problems. 28 Institution Project Focus Cambridge Retrospective Data Extraction Searching and Indexing Data M odels/Ontologies Tools and Applications Cornell Data M odels Interoperability infrastructure Project Management Publicity and outreach Indiana Infrastructure Integration T rust and Provenance Tools and Applications LANL Data M odels Interoperability infrastructure Chemical Structure Archive Results of Experimental Biological Activity T esting Cross References to BioMedical Databases Penn State Retrospective Data Extraction Searching and Indexing Analysis Southampton Prospective & Retrospective Data Provision Tools and Applications In-process capture of eChemistry data Data Linking Š in analysis and publication PuBChem More Information Project Web Site: www.chembiogrid.org Project Wiki: www.chembiogrid.org/wiki Contact me: [email protected] 30 31 CICC Chemical Informatics and Cyberinfrastucture Collaboratory Funded by the National Institutes of Health www.chembiogrid.org CICC CICC Combines Grid Computing with Chemical Informatics Large Scale Computing Challenges Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated. NIH PubMed DataBase Chemical informatics text analysis programs can process 100,000’s of abstracts of online journal articles to extract chemical signatures of potential drugs. OSCAR Text Analysis Initial 3D Structure Calculation Molecular Mechanics Calculations Cluster Grouping Toxicity Filtering Science and Cyberinfrastructure CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs. . Docking OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Topranking docked molecules can be further examined for drug potential. Quantum Mechanics Calculations NIH PubChem DataBase POVRay Parallel Rendering IU’s Varuna DataBase Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community. CICC supports the NIH mission by combining state of the art chemical informatics techniques with • World class high performance computing • National-scale computing resources (TeraGrid) • Internet-standard web services • International activities for service orchestration • Open distributed computing infrastructure for scientists world wide Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories MLSCN Post-HTS Biology Decision Support Percent Inhibition or IC50 data is retrieved from HTS Question: Was this screen successful? Workflows encoding plate & control well statistics, distribution analysis, etc Question: What should the active/inactive cutoffs be? Workflows encoding distribution analysis of screening results Question: What can we learn about the target protein or cell line from this screen? Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc Compounds submitted to PubChem PROCESS CHEMINFORMATICS Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Cheminformatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis A Grid of Grids linking collections of services at PubChem ECCR centers MLSCN centers GRIDS R Web Services 34 Why? Need access to math and stat functionality Did not want to recode algorithms Wanted latest methods Needed a distributed approach to computation Keep computation on a powerful machine Access it from a smaller machine 35 Why R? Free, open-source Many cutting edge methods avilable Flexible programming language Interfaces with many languages Python Perl Java C 36 The R Server R can be run as a remote compute server Requires the rserve package Allows authenticated access over TCP/IP Connections can maintain state Client libraries for Java & C 37 R as a Web Service On its own the R server is not a web service We provide Java frontends to specific functionalities The frontend classes are hosted in a Tomcat web container Accessible via SOAP Full Javadocs for all available WS’s 38 Flowchart 39 Functionality Two classes of functionality General functions Allows you to supply data and build a predictive model Sample from various distributions Obtain scatter plots and hisotgram Model development functions use a Java frontend to encapsulate model specific information 40 Functionality Two classes of functionality Model deployment Allows you to build a model outside of the infrastructure Place the final model in the infrastructure Becomes available as a web service Each model deployed requires its own front end class In general, these classes are identical - could be autogenerated 41 Available Functionality Predictive models - OLS, RF, CNN, LDA Clustering - k-means Statistical distributions XY plot and scatter plots Model deployment for single model types and ensemble model types 42 Deployed Models Since deployed models are visible as web services we can build a simple web front end for them Examples NCI anti-cancer predictions Ames mutagenicity predictions 43 Applications The R WS is not restricted to ‘atomic’ functionality Can write a whole R program Load it on the R compute server Provide a Java WS frontend Examples Feature selection Automated model generation Pharmacokinetic parameter calculation 44 Data Input/Output Most modeling applications require data matrices Depending on client language we can use SOAP array of arrays (2D matrices) SOAP array (1D vector form of a 2D matrix) VOTables 45 Data Input/Output Some R web services can take a URL to a VOTables document Conversion to R or Java matrices is done by a local VOTables Java library R also has basic support for VOTables directly Ignores binary data streams 46 Interacting With R WS’s Traditional WS’s do not maintain state Predictive models are different A model is built at one time May be used for prediction at another time Need to maintain state State is maintained by serialization to R binary files on the compute server Clients deal with model ID’s 47 Interacting with R WS’s Protocol Send data to model WS Get back model ID Get various information via model ID Fitted values Training statistics New predictions 48 Cheminformatics at Indiana University School of Informatics David J. Wild [email protected] Associate Director of Chemical Informatics & Assistant Professor Indiana University School of Informatics, Bloomington http://djwild.info 49 Cheminformatics education at Indiana M.S. in Chemical Informatics 2 years, 36 semester hours Includes a 6-hour capstone / research project Opportunity to work in Laboratory Informatics (IUPUI) or closely with Bioinformatics (IUB) Currently 9 students enrolled Ph.D. in Informatics, Cheminformatics Specialty 90 credit hours, including 30 hours dissertation research. Usually 4 years. Research rotations expose students to research in related areas Currently 4 students enrolled Graduate Certificate 4 courses, all available by Distance Education 50 Distance Education for Cheminformatics Uses Breeze + teleconference for live sharing of classes: all that is required is a P.C. and a telephone. Optional Polycom videoconferencing. Lectures are recorded for easy playback through a web browser Wiki or similar webpage for dissemination of course materials Also participate in CIC courseshare to give class at University of Michigan Of 75 students taking our courses since fall 51 Current research in the Wild lab Integration of cheminformatics tools and data sources A web service infrastructure for cheminformatics Compound information & aggregation web service and interface (“by the way box”) An enhanced chatbot for exploting chemical information & web services A semantically-aware workflow tools for cheminformatics Data mining the NIH DTP tumor cell line database PubDock: a docking database for PubChem 52 Current research in the Guha lab Predictive Modeling Interpretation, validation, domain applicability Generalization to other ‘models’ such as docking, pharmacophore etc Integration of multiple data types Addressing imbalanced and noisy datasets Analysis of Chemical Spaces Quantify distributions in spaces Investigation of density approaches Applications to lead hopping, model domains Methods to summarize & compare data Applications to HTS and smaller lead series type 53 Cheminformatics services Docking (FRED) 3D structure generation (OMEGA) Filtering (FRED, etc) Database Services OSCAR3 PostgreSQL + gNova Fingerprints (BCI, CDK) PubChem mirror Clustering (BCI) (augmented) Toxicity prediction Pub3D - 3D structures (ToxTree) for PubChem R-based predictive models PubDock - Bound 3D Similarity calculations structures (CDK) Compound-indexed Descriptor calculation journal article DB (CDK) Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey C. Fox and NIH Human Tumor Cell David J. Wild, Web service infrastructure for chemoinformatics, Journal of Chemical Information and Modeling, 2007; 2D structure diagrams 47(4) pp 1303-1307 Line 54 (CDK) Local PubChem mirror Cheminformatics web service infrastructure RSC Project Prospect - what can we do with the information? www.projectprospect.org >100 papers marked up with SMILES/InChI (using OSCAR3), plus Gene Ontology and Goldbook Ontology terms Created similarity searchable PostgreSQL / gNova database with paper DOIs, SMILES, and ontology terms Web service and simple HTML interfaces for searching … “which papers reference compounds similar to this one in the scope of these ontological terms?” 55 Greasemonkey / OSCAR script http://cheminfo.informatics.indiana.edu:8080/ChemGM/index.jsp 56 By the way… By the way… annotation (mock-up!) This compounds is very similar to a prescription drug, Tamoxifen. This compound is referenced in 20 journal articles published in the last 5 years Similar compounds are associated with the words “toxic” and “death” in 280 web pages It appears to be covered under 3 patents It has been shown to be active in 5 screens Computer models predict it to show some activity against 8 protein targets Here are some comments on this compound: David Wild: don’t take any notice of the computational models - they are rubbish 57 Cheminformatics aware simple lab notebook (mock up!) Plug-in allows structures to be drawn with the pen and cleaned up Some useful chemical reactions Iodoacetate a Iodoacetamide I-CH4COO- ICH2CONH2 OH OH S + H2C C C S O + I O FIND INFO ABOUT THIS REACTION This may also react, chem favored by alkaline pH Free text input can be converted to machine readable form by electrovaya …. Web service interface provides access to computation and searching. Page is marked up by what is possible Automatic detection of data fields (yield, etc) Where possible 58 Automatic workflow generation and natural language queries Develop service ontology using OWL-S or similar language Allows service interoperability, replacement and input/outut compatibility 2d similarity 3D structures are compounds We can then use generic reasoning and 2D -> 3D network analysis tools to find paths from 2D inputs to desired outputs structure crawler result Natural language can be parsed to inputs and P’phore dock search desired outputs Smart Clients <--> Agents <--> Services Possible “supercharged life science Google?”59 2D structures 2D structures 3D search 3D structures 3D structures 3D structures & complexes 2D structures are compounds 3D protein structure 3D structures are compounds dock = bind