Use of Semantic Technologies at Eli Lilly and Company J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company.

Download Report

Transcript Use of Semantic Technologies at Eli Lilly and Company J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company.

Use of Semantic Technologies at Eli Lilly and Company

J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company

Agenda

• Project Overviews • Discovery Metadata • Integrative Informatics

• POC 4 • POC 1

• Metadata Repository • External Collaborations • Conclusions • Acknowledgements

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 2

Use of Semantic Technologies at Eli Lilly and Company

Project Overview: Discovery Metadata

Discovery Metadata: Goals

Integrate Master Data throughout the pharmaceutical discovery process to enable information sharing/integration for scientific community • Model key relationships between Master Data classes • Provide ability to integrate disparate data sets quicker than the normal warehouse paradigm typically allows • Create a re-usable and sustainable semantic implementation • Allow for user-driven, manual curation of key data relationships • Develop core competencies in Semantic Web technologies within Eli Lilly • Position the Semantic Web within Eli Lilly • Strengths • Weaknesses • When to use?

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 4

Discovery Metadata: Ontology

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)

SAP REFDB NCBI Legacy

5

GSM Manual Curation

Discovery Metadata: Architecture

A P P S S O A

Application 1 Application 2 Application 3 SOA Layer/Enterprise Service Bus (WebServices, Visualizers, DataAccess Components )

Authentication SQL SPARQL

D A T A

Source Model 1 Source Model 2 Source Model 3 Source Model 4 Source

… Local Assertions Top Level Ontology Provenance ETL

Other Tools 5/6/2020

Rdbms Spreadsheets

Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 6

Discovery Metadata: Implementation

• Oracle Semantic Technologies 11g • TopBraid Composer, Maestro Edition v2.6.2

• Multiple Oracle models segregated by source • Top-Level Ontology • Enterprise data sources (3) • External data sources (NCBI) • Custom/Local assertions (2) • ~ 4.4M triples • Loaded triples: 2.1M

• Inferred triples: 2.3M • Custom-developed browser • Metadata-driven web service providing cross-application access to master data 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 7

Discovery Metadata: Future Work

• Implement provenance at the instance level • Integrate additional data sources (MeSH, Gene Ontology, KEGG, internal data sources) • Operationalize load processes • Finalize visualization standards • Performance reviews (scalability)

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 8

Use of Semantic Technologies at Eli Lilly and Company

Project Overview: Integrative Informatics

Integrative Informatics: Overview

• The focus of Integrative Informatics is to facilitate data integration between the discovery and medical components with Eli Lilly • Their methodology is to execute Proofs of Concept (POC) projects to identify, construct, and test various solutions for solving the integration problem • Efforts: • POC1: CATIE project • POC4: Endocrine PI Competitive Intelligence • Generic Browser efforts 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 10

Integrative Informatics: POC1 - CATIE Semantic Integration

What is the CATIE study?

• Clinical Antipsychotic Trials of Intervention Effectiveness • Was the most comprehensive independent trial ever completed to examine existing anti-psychotic therapies for schizophrenia • Provides detailed information comparing the effectiveness and side effects of five medications currently used to treat schizophrenia • Olanzapine • Quetiapine • Risperidone • Ziprasidone • Perphenazine • Greatly enhances the knowledge available to guide treatment choices for people with schizophrenia 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 11

Integrative Informatics: POC1 Goals

• Determine whether semantic integration and analysis of the CATIE data set in the context of metabolic and signal transduction pathways with receptor affinities can provide answers to specific scientific questions: • Which pathways are associated with response to the 5 different schizophrenia drugs? • How do these pathways compare between treatment arms?

• Which receptors are associated with response to the 5 schizophrenia drugs?

• How are the pathways, receptors and the drug response genes from the CATIE data set related?

Source: Chandra Ranga Gudivada 12 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)

Integrative Informatics: POC1 Data Aggregation

• CATIE Drugs: • Olanzapine • Perphenazine • Quetiapine • Risperidone • Ziprasidone • Datasets: • Entrez Gene • Pubchem (for CATIE Drugs) • Assay (Receptor Affinity Data for CATIE) • KEGG • Reactome • Biocyc • Transpath 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 13

Integrative Informatics: POC1 Architecture

Data in Multiple Formats (Flat file, Tab limited, XML) RDF conversion using Jena Programming API Top – Level Ontology

5/6/2020 Oracle 11g RDF store Allegrograph Native RDF Triple Store

Perform SPARQL Querying

Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)

Perform SPARQL Querying

Source: Chandra Ranga Gudivada 14

Integrative Informatics: POC1 Conclusions

• Efficient semantic integration can be accomplished by using RDF • Powerful complex data modeling can be achieved by using graph principles inherent in RDF • Easy translation of scientific questions to graph queries can be accomplished using SPARQL and SEM_MATCH • Customized outputs can easily be generated by making slight changes in the SPARQL query pattern 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 15

Integrative Informatics: POC4 - Endocrine PI Competitive Intelligence

Competitive Intelligence (CI) is a purposeful, ethical and co-coordinated monitoring of the competitors in any industry within a specific market place to: • Strategically gain foreknowledge of recent developments of your competitor's plans • Make calculated informed business decisions and formulate operational strategy The purpose of the Endocrine Public Information (PI) project is to provide a mechanism for actively surveying the public information for competitive intelligence on the Endocrine area 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 16

Integrative Informatics: POC4 Goals

• Does such a competitive intelligence effort significantly benefit from a semantic component?

• Does the Endocrine PI project significantly benefit from semantic integration?

• Are there pre-existing ontologies for Company and method of action (MOA) domains?

• Do natural language processing (NLP) or text mining methods work for this kind of data?

• Does “buried” knowledge exist within that datasets that can be discovered using inference and reasoning?

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 17

Integrative Informatics: POC4 Integration Challenges

Company Syntactic Variations

Merck & Co

Merck & Co Inc

Merck

Merck & Co Ltd

Alpha-glucosidase inhibitor

Glucosidase inhibitor alpha MOA

IGF binding protein-3 stimulator

IGF binding protein stimulator-3 Parent – Child Relations Company MOA Semantic Variations

Amgen Boulder Inc

Applied Molecular Genetics Inc

Synergen Inc

Amgen

Serotonin 2A receptor antagonists

5-HT 2 receptor antagonist

5-HT2a antagonist

Peroxisome proliferator-activated receptor delta antagonist

PPAR delta antagonist

Melanin concentrating hormone receptor 1 antagonists

MCH receptor-1 antagonist

Source: Chandra Ranga Gudivada 18 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)

Integrative Informatics: POC4 NLP and Semantic Integration

Raw Endocrine Data

Bayer Corp Dopamide receptor agonist SGLT inhibitor Eli Lilly STAT transcription factor stimulant Alpha-glucosidase inhibitors Peroxisome proliferator-activated receptor delta antagonist 5 Hydroxytryptamine 2C agonist Opioid kappa receptor antagonists Serotonin 1B receptor agonists

5/6/2020 Terms from Thomson – Pharma

Bayer Corp Dopamide receptor agonist SGLT inhibitor Eli Lilly & Co Ltd STAT stimulator Glucosidase inhibitor-alpha PPAR delta antagonist 5HT 1c agonist Kappa opioid antagonist 5-HT 1d beta agonist

• • • • • • NLP Methods Used: Semantic Normalization Fuzzy Distance Ignoring Stop Words Regular Expressions Tokenization Rule-based Mapping Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 19

Integrative Informatics: POC4 Knowledge Representation

MOA rdf:type Melanin-concentrating hormone receptor antagonists hasSubClass Melanin concentrating hormone receptor 1 antagonists MCH 1 antagonists hasDrug Drug Phase 2 hasStatus Blank Node Disease

5/6/2020

rdf:type Obesity hasTherapeuticArea rdf:type Company Amgen alternativeLabel GPR-24 antagonist MCH receptor-1 antagonist Applied Molecular Genetics Inc Amgen Boulder Inc Abgenix

Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective)

Synergen Inc Avidia Inc

Source: Chandra Ranga Gudivada 20

Integrative Informatics: POC4 Inferencing

Given Company Name: Applied Molecular Genetics Inc Get MOA’s that this company is working on PREFIX TLO: PREFIX xsd: Select Distinct ?Endo_MOA ?Company_All_Labels

Where{ ?Company_Res

?Company_Res

?Company_Res

?Drug_Info

?Drug_Info

TLO:SynonymousLabels TLO:preferredLabel TLO:SynonymousLabels TLO:hasMOA "Applied Molecular Genetics Inc"^^xsd:string.

?Company_Pref_Label

?Company_All_Labels .

TLO:hasAssociatedCompany ?Company_All_Labels .

?Endo_MOA

.

} Case1 :

‘Without’

Semantic Integration and Inference ‘0’ Results Case2 :

‘With’

Semantic Integration and Inference Amgen Amgen Inc Amgen Amgen Inc Leptin stimulator Agouti related protein inhibitor Neuropeptide Y antagonist Melanocortin MC4 antagonist 18 Results

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 21

Integrative Informatics: POC4 Conclusions

• Semantic Integration (instance mapping using NLP) coupled with RDF data model was successful in answering questions in Competitive Intelligence • Ontologies provide a powerful framework in providing dictionaries and taxonomical relations that help to reason and inference the data for knowledge discovery • Manual curation is a tedious, error prone and labor intensive-task • A semi-automated intelligent computer-based solution that utilizes Ontologies, Semantic Integration and NLP could drastically reduce manual curation process and maintain high quality information 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Chandra Ranga Gudivada 22

Use of Semantic Technologies at Eli Lilly and Company

Project Overview: Metadata Repository

Metadata Repository: Goals

Aggregate experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation • Provide a unified vocabulary for LSCDD scientific investigation • Avoid a complex architecture and extended development effort • Realize benefits in the near-term • Preprocess metadata to improve efficiency • Characterize the type of questions that ontology should answer • Identify stable semantic technologies, do not employ parsers • Allow semantic and relational databases to work together • Provide browser, visualization, and query access into repository 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 24

Metadata Repository: Ontology

Project hasProject Study hasStudy Experiment hasDiseaseState hasPlate Assay hasAssay hasChip DiseaseState Plate hasProtocol Protocol Chip Compound subclass subclass Reagent subclass DNA Reagent subclass subclass Software hasCompound hasReagent hasGene RNA Reagent Protein Reagent hasReagent Hardware Gene Treatment hasGene IsPartOf ViralBatch GeneList hasChipType Chip Type Probe hasPlate hasSample Plate Well hasChipType hasTreatment hasSource hasSourceTissue ClinicalData Sample Tissue hasModel hasTissue hasCellline Model CellLine hasGOId hasMESHId MESH GO 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 25

Metadata Repository: High-level Architecture

• Iterative queries on metadata define items of interest • Metadata and raw data are then aggregated to provide additional context for analysis Query Experimental Metadata Repository 5/6/2020 Visualization Annotation Services Agilent Expression aCGH RNAi Database Affy Expression Illumina Expression Screening Mutation SNP TMA Analysis Results Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 26

Metadata Repository: Implementation

• Protégé Ontology Editor • Oracle Semantic Technologies 11g • D2R Map (Database to RDF Mapping) • C# development in Visual Studio 2005 • Current data sources include: • Expression Data : Affymetrix, Illumina, Agilent • aCGH Data • RNAi Screening Data • Reagent Data • Gene Ontology (GO) • Medical Subject Headings (MeSH) • Currently ~30 million triples 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 27

Metadata Repository: Conclusion

With the implementation of the Metadata Repository, it is now possible for users to ask questions such as: • Get all the interactions for methylases that are involved in Colon cancer. For all these genes, get the expression and aCGH values for all LSCDD colon cancer samples • Find cell lines in which RNAi data has been generated using Dharmacon reagents • Retrieve the antibodies that have been used to assess the AKT1 pathway activity in MCF7 • • Find all the experiments that were done using my sample Find all samples which are grade III colorectal cancer. For these sample, retrieve the expression, mutation and aCGH data 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 28

Metadata Repository: Future Work

• • • Ability to ask more complicated scientific queries. Query results will be integrated with raw data in relational data sources to provide the user with a single platform for detailed analysis. With user input, the ontology will evolve to include additional entities and attributes as well as links to other public ontologies.

5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Maurice Manning 29

Semantic Technologies

(from an old DBA’s perspective)

External Collaborations

The Open Innovation Center

A non-profit organization led by Dr. Susie Stephens and focused on enabling pre-competitive collaborations across the pharmaceutical industry with the following goals: • To increase health and well-being by enabling pharmaceutical companies to make better decisions during drug discovery and development • To provide an independent non-profit center for knowledge gathering, representation and mining • To create an ecosystem of organizations that adopt the same data standards and terminology thereby simplifying collaboration • To reduce risk and minimize cost • To bring together leading technologists to enable rapid sharing of knowledge and skills • For the benefit of organizations around the world in biopharmaceuticals, healthcare, payers, information technology, and academia 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) Source: Susie Stephens 31

External Collaborations

Participation in W3C’s HCLS group RDF Access to Relational Databases - Chris Bizer, Eric Prud'hommeaux • Scalability testing of relational to RDF mapping approaches End User Semantic Web Authoring - David Karger • Enhancing the scalability and robustness of the

Exhibit

and

Potluck

integrating the tools together, supporting more file types, etc.) tools (i.e. Scientist-Driven Semantic Integration of Knowledge in Alzheimer's Disease - Tim Clark, June Kinoshita • Project to develop an integrated knowledge infrastructure for the neuro-medical research community, pairing rich digital semantic context with the ever-growing digital scientific content on the web Provenance Collection and Management - Carole Goble, Beth Plale • Project to develop a metadata taxonomy for global data at Lilly which enables the rapid integration of data and mining/analysis algorithms into dataflows which support clinical and discovery decisions 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 32

Semantic Technologies

(from an old DBA’s perspective)

Conclusions

Conclusions

• Data integration needs (and issues) abound at Lilly!

• Eli Lilly and Company is seeing tangible benefits in multiple projects from semantic integration as a means for helping to solve this problem • The trend has been to build “semantic warehouses” due to federation challenges • Thus far, data volumes are low to moderate • Areas for alignment need to be identified and aligned as necessary (both internally and externally) • Still searching for the “best” methods for accessing semantic data holistically within the enterprise • Provenance is a challenge but is required • Tools are improving, but more are needed (especially in the area of visualization) • Working to operationalize semantic processes 5/6/2020 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 34

Acknowledgements

Rosalyn Adams-Smith Amit Aggarwal Rakhi Bhat Phil Brooks Steven Cao Hans Constandt William D Craun Mahesh Kumar Guzuva Desikan Ernst Dow AnnCatherine Downing Mark Farmen Kevin Gao Young Gong David Greenen Ranga Chandra Gudivada Jacob Koehler 5/6/2020 Srinivasulu Kota Michael Lajiness Maurice Manning Michael Martin Mamatha Naik Laura Nisenbaum Pavel Pilar James E Scherschel Sean Spillane Susie Stephens Jeffrey Sutherland Dirk Tomandl Jason Wang Bill Yan Harold Yin Yijing Zhou Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 35