Use of Semantic Technologies at Eli Lilly and Company J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company.
Download ReportTranscript Use of Semantic Technologies at Eli Lilly and Company J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company.
Use of Semantic Technologies at Eli Lilly and Company
J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company
Agenda
• Project Overviews • Discovery Metadata • Integrative Informatics
• POC 4 • POC 1
• Metadata Repository • External Collaborations • Conclusions • Acknowledgements
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 2
Use of Semantic Technologies at Eli Lilly and Company
Project Overview: Discovery Metadata
Discovery Metadata: Goals
Integrate Master Data throughout the pharmaceutical discovery process to enable information sharing/integration for scientific community • Model key relationships between Master Data classes • Provide ability to integrate disparate data sets quicker than the normal warehouse paradigm typically allows • Create a re-usable and sustainable semantic implementation • Allow for user-driven, manual curation of key data relationships • Develop core competencies in Semantic Web technologies within Eli Lilly • Position the Semantic Web within Eli Lilly • Strengths • Weaknesses • When to use?
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 4
Discovery Metadata: Ontology
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)
SAP REFDB NCBI Legacy
5
GSM Manual Curation
Discovery Metadata: Architecture
A P P S S O A
Application 1 Application 2 Application 3 SOA Layer/Enterprise Service Bus (WebServices, Visualizers, DataAccess Components )
…
Authentication SQL SPARQL
D A T A
Source Model 1 Source Model 2 Source Model 3 Source Model 4 Source
… Local Assertions Top Level Ontology Provenance ETL
Other Tools 5/6/2020
Rdbms Spreadsheets
Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 6
Discovery Metadata: Implementation
• Oracle Semantic Technologies 11g • TopBraid Composer, Maestro Edition v2.6.2
• Multiple Oracle models segregated by source • Top-Level Ontology • Enterprise data sources (3) • External data sources (NCBI) • Custom/Local assertions (2) • ~ 4.4M triples • Loaded triples: 2.1M
• Inferred triples: 2.3M • Custom-developed browser • Metadata-driven web service providing cross-application access to master data 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 7
Discovery Metadata: Future Work
• Implement provenance at the instance level • Integrate additional data sources (MeSH, Gene Ontology, KEGG, internal data sources) • Operationalize load processes • Finalize visualization standards • Performance reviews (scalability)
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 8
Use of Semantic Technologies at Eli Lilly and Company
Project Overview: Integrative Informatics
Integrative Informatics: Overview
• The focus of Integrative Informatics is to facilitate data integration between the discovery and medical components with Eli Lilly • Their methodology is to execute Proofs of Concept (POC) projects to identify, construct, and test various solutions for solving the integration problem • Efforts: • POC1: CATIE project • POC4: Endocrine PI Competitive Intelligence • Generic Browser efforts 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 10
Integrative Informatics: POC1 - CATIE Semantic Integration
What is the CATIE study?
• Clinical Antipsychotic Trials of Intervention Effectiveness • Was the most comprehensive independent trial ever completed to examine existing anti-psychotic therapies for schizophrenia • Provides detailed information comparing the effectiveness and side effects of five medications currently used to treat schizophrenia • Olanzapine • Quetiapine • Risperidone • Ziprasidone • Perphenazine • Greatly enhances the knowledge available to guide treatment choices for people with schizophrenia 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 11
Integrative Informatics: POC1 Goals
• Determine whether semantic integration and analysis of the CATIE data set in the context of metabolic and signal transduction pathways with receptor affinities can provide answers to specific scientific questions: • Which pathways are associated with response to the 5 different schizophrenia drugs? • How do these pathways compare between treatment arms?
• Which receptors are associated with response to the 5 schizophrenia drugs?
• How are the pathways, receptors and the drug response genes from the CATIE data set related?
Source: Chandra Ranga Gudivada 12 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)
Integrative Informatics: POC1 Data Aggregation
• CATIE Drugs: • Olanzapine • Perphenazine • Quetiapine • Risperidone • Ziprasidone • Datasets: • Entrez Gene • Pubchem (for CATIE Drugs) • Assay (Receptor Affinity Data for CATIE) • KEGG • Reactome • Biocyc • Transpath 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 13
Integrative Informatics: POC1 Architecture
Data in Multiple Formats (Flat file, Tab limited, XML) RDF conversion using Jena Programming API Top – Level Ontology
5/6/2020 Oracle 11g RDF store Allegrograph Native RDF Triple Store
Perform SPARQL Querying
Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)
Perform SPARQL Querying
Source: Chandra Ranga Gudivada 14
Integrative Informatics: POC1 Conclusions
• Efficient semantic integration can be accomplished by using RDF • Powerful complex data modeling can be achieved by using graph principles inherent in RDF • Easy translation of scientific questions to graph queries can be accomplished using SPARQL and SEM_MATCH • Customized outputs can easily be generated by making slight changes in the SPARQL query pattern 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 15
Integrative Informatics: POC4 - Endocrine PI Competitive Intelligence
Competitive Intelligence (CI) is a purposeful, ethical and co-coordinated monitoring of the competitors in any industry within a specific market place to: • Strategically gain foreknowledge of recent developments of your competitor's plans • Make calculated informed business decisions and formulate operational strategy The purpose of the Endocrine Public Information (PI) project is to provide a mechanism for actively surveying the public information for competitive intelligence on the Endocrine area 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 16
Integrative Informatics: POC4 Goals
• Does such a competitive intelligence effort significantly benefit from a semantic component?
• Does the Endocrine PI project significantly benefit from semantic integration?
• Are there pre-existing ontologies for Company and method of action (MOA) domains?
• Do natural language processing (NLP) or text mining methods work for this kind of data?
• Does “buried” knowledge exist within that datasets that can be discovered using inference and reasoning?
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 17
Integrative Informatics: POC4 Integration Challenges
Company Syntactic Variations
Merck & Co
Merck & Co Inc
Merck
Merck & Co Ltd
Alpha-glucosidase inhibitor
Glucosidase inhibitor alpha MOA
IGF binding protein-3 stimulator
IGF binding protein stimulator-3 Parent – Child Relations Company MOA Semantic Variations
Amgen Boulder Inc
Applied Molecular Genetics Inc
Synergen Inc
Amgen
Serotonin 2A receptor antagonists
5-HT 2 receptor antagonist
5-HT2a antagonist
Peroxisome proliferator-activated receptor delta antagonist
PPAR delta antagonist
Melanin concentrating hormone receptor 1 antagonists
MCH receptor-1 antagonist
Source: Chandra Ranga Gudivada 18 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)
Integrative Informatics: POC4 NLP and Semantic Integration
Raw Endocrine Data
Bayer Corp Dopamide receptor agonist SGLT inhibitor Eli Lilly STAT transcription factor stimulant Alpha-glucosidase inhibitors Peroxisome proliferator-activated receptor delta antagonist 5 Hydroxytryptamine 2C agonist Opioid kappa receptor antagonists Serotonin 1B receptor agonists
5/6/2020 Terms from Thomson – Pharma
Bayer Corp Dopamide receptor agonist SGLT inhibitor Eli Lilly & Co Ltd STAT stimulator Glucosidase inhibitor-alpha PPAR delta antagonist 5HT 1c agonist Kappa opioid antagonist 5-HT 1d beta agonist
• • • • • • NLP Methods Used: Semantic Normalization Fuzzy Distance Ignoring Stop Words Regular Expressions Tokenization Rule-based Mapping Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 19
Integrative Informatics: POC4 Knowledge Representation
MOA rdf:type Melanin-concentrating hormone receptor antagonists hasSubClass Melanin concentrating hormone receptor 1 antagonists MCH 1 antagonists hasDrug Drug Phase 2 hasStatus Blank Node Disease
5/6/2020
rdf:type Obesity hasTherapeuticArea rdf:type Company Amgen alternativeLabel GPR-24 antagonist MCH receptor-1 antagonist Applied Molecular Genetics Inc Amgen Boulder Inc Abgenix
Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)
Synergen Inc Avidia Inc
Source: Chandra Ranga Gudivada 20
Integrative Informatics: POC4 Inferencing
Given Company Name: Applied Molecular Genetics Inc Get MOA’s that this company is working on PREFIX TLO:
Where{ ?Company_Res
?Company_Res
?Company_Res
?Drug_Info
?Drug_Info
TLO:SynonymousLabels TLO:preferredLabel TLO:SynonymousLabels TLO:hasMOA "Applied Molecular Genetics Inc"^^xsd:string.
?Company_Pref_Label
?Company_All_Labels .
TLO:hasAssociatedCompany ?Company_All_Labels .
?Endo_MOA
.
} Case1 :
‘Without’
Semantic Integration and Inference ‘0’ Results Case2 :
‘With’
Semantic Integration and Inference Amgen Amgen Inc Amgen Amgen Inc Leptin stimulator Agouti related protein inhibitor Neuropeptide Y antagonist Melanocortin MC4 antagonist 18 Results
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 21
Integrative Informatics: POC4 Conclusions
• Semantic Integration (instance mapping using NLP) coupled with RDF data model was successful in answering questions in Competitive Intelligence • Ontologies provide a powerful framework in providing dictionaries and taxonomical relations that help to reason and inference the data for knowledge discovery • Manual curation is a tedious, error prone and labor intensive-task • A semi-automated intelligent computer-based solution that utilizes Ontologies, Semantic Integration and NLP could drastically reduce manual curation process and maintain high quality information 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 22
Use of Semantic Technologies at Eli Lilly and Company
Project Overview: Metadata Repository
Metadata Repository: Goals
Aggregate experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation • Provide a unified vocabulary for LSCDD scientific investigation • Avoid a complex architecture and extended development effort • Realize benefits in the near-term • Preprocess metadata to improve efficiency • Characterize the type of questions that ontology should answer • Identify stable semantic technologies, do not employ parsers • Allow semantic and relational databases to work together • Provide browser, visualization, and query access into repository 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 24
Metadata Repository: Ontology
Project hasProject Study hasStudy Experiment hasDiseaseState hasPlate Assay hasAssay hasChip DiseaseState Plate hasProtocol Protocol Chip Compound subclass subclass Reagent subclass DNA Reagent subclass subclass Software hasCompound hasReagent hasGene RNA Reagent Protein Reagent hasReagent Hardware Gene Treatment hasGene IsPartOf ViralBatch GeneList hasChipType Chip Type Probe hasPlate hasSample Plate Well hasChipType hasTreatment hasSource hasSourceTissue ClinicalData Sample Tissue hasModel hasTissue hasCellline Model CellLine hasGOId hasMESHId MESH GO 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 25
Metadata Repository: High-level Architecture
• Iterative queries on metadata define items of interest • Metadata and raw data are then aggregated to provide additional context for analysis Query Experimental Metadata Repository 5/6/2020 Visualization Annotation Services Agilent Expression aCGH RNAi Database Affy Expression Illumina Expression Screening Mutation SNP TMA Analysis Results Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 26
Metadata Repository: Implementation
• Protégé Ontology Editor • Oracle Semantic Technologies 11g • D2R Map (Database to RDF Mapping) • C# development in Visual Studio 2005 • Current data sources include: • Expression Data : Affymetrix, Illumina, Agilent • aCGH Data • RNAi Screening Data • Reagent Data • Gene Ontology (GO) • Medical Subject Headings (MeSH) • Currently ~30 million triples 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 27
Metadata Repository: Conclusion
With the implementation of the Metadata Repository, it is now possible for users to ask questions such as: • Get all the interactions for methylases that are involved in Colon cancer. For all these genes, get the expression and aCGH values for all LSCDD colon cancer samples • Find cell lines in which RNAi data has been generated using Dharmacon reagents • Retrieve the antibodies that have been used to assess the AKT1 pathway activity in MCF7 • • Find all the experiments that were done using my sample Find all samples which are grade III colorectal cancer. For these sample, retrieve the expression, mutation and aCGH data 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 28
Metadata Repository: Future Work
• • • Ability to ask more complicated scientific queries. Query results will be integrated with raw data in relational data sources to provide the user with a single platform for detailed analysis. With user input, the ontology will evolve to include additional entities and attributes as well as links to other public ontologies.
5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 29
Semantic Technologies
(from an old DBA’s perspective)
External Collaborations
The Open Innovation Center
A non-profit organization led by Dr. Susie Stephens and focused on enabling pre-competitive collaborations across the pharmaceutical industry with the following goals: • To increase health and well-being by enabling pharmaceutical companies to make better decisions during drug discovery and development • To provide an independent non-profit center for knowledge gathering, representation and mining • To create an ecosystem of organizations that adopt the same data standards and terminology thereby simplifying collaboration • To reduce risk and minimize cost • To bring together leading technologists to enable rapid sharing of knowledge and skills • For the benefit of organizations around the world in biopharmaceuticals, healthcare, payers, information technology, and academia 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Susie Stephens 31
External Collaborations
Participation in W3C’s HCLS group RDF Access to Relational Databases - Chris Bizer, Eric Prud'hommeaux • Scalability testing of relational to RDF mapping approaches End User Semantic Web Authoring - David Karger • Enhancing the scalability and robustness of the
Exhibit
and
Potluck
integrating the tools together, supporting more file types, etc.) tools (i.e. Scientist-Driven Semantic Integration of Knowledge in Alzheimer's Disease - Tim Clark, June Kinoshita • Project to develop an integrated knowledge infrastructure for the neuro-medical research community, pairing rich digital semantic context with the ever-growing digital scientific content on the web Provenance Collection and Management - Carole Goble, Beth Plale • Project to develop a metadata taxonomy for global data at Lilly which enables the rapid integration of data and mining/analysis algorithms into dataflows which support clinical and discovery decisions 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 32
Semantic Technologies
(from an old DBA’s perspective)
Conclusions
Conclusions
• Data integration needs (and issues) abound at Lilly!
• Eli Lilly and Company is seeing tangible benefits in multiple projects from semantic integration as a means for helping to solve this problem • The trend has been to build “semantic warehouses” due to federation challenges • Thus far, data volumes are low to moderate • Areas for alignment need to be identified and aligned as necessary (both internally and externally) • Still searching for the “best” methods for accessing semantic data holistically within the enterprise • Provenance is a challenge but is required • Tools are improving, but more are needed (especially in the area of visualization) • Working to operationalize semantic processes 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 34
Acknowledgements
Rosalyn Adams-Smith Amit Aggarwal Rakhi Bhat Phil Brooks Steven Cao Hans Constandt William D Craun Mahesh Kumar Guzuva Desikan Ernst Dow AnnCatherine Downing Mark Farmen Kevin Gao Young Gong David Greenen Ranga Chandra Gudivada Jacob Koehler 5/6/2020 Srinivasulu Kota Michael Lajiness Maurice Manning Michael Martin Mamatha Naik Laura Nisenbaum Pavel Pilar James E Scherschel Sean Spillane Susie Stephens Jeffrey Sutherland Dirk Tomandl Jason Wang Bill Yan Harold Yin Yijing Zhou Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 35