Using ontologies to make sense of unstructured medical data Nigam Shah, MBBS, PhD [email protected].
Download ReportTranscript Using ontologies to make sense of unstructured medical data Nigam Shah, MBBS, PhD [email protected].
Using ontologies to make sense of unstructured medical data Nigam Shah, MBBS, PhD [email protected] NCBO: Key activities • We create and maintain a library of biomedical ontologies. • We build tools and Web services to enable the use of ontologies and their derivatives. • We collaborate with scientific communities that develop and use ontologies. • • • • Download Traverse Search Comment Mapping Services • • • Create Download Upload Widgets • • • Tree-view Auto-complete Graph-view Ontology Services http://rest.bioontology.org Views Annotation Term recognition Data Access Fetch “data” annotated with a given term http://bioportal.bioontology.org Annotation service Process textual metadata to automatically tag text with as many ontology terms as possible. 90 million calls, ~700 GB of data Resource index Pubmed Abstracts Adverse Events (AERS) GEO : Clinical Trials Drug Bank Won 1st prize at the 2010 Semantic Web Challenge @ ISWC Creating Lexicons Sentence in Clinical Note – 1 : : : Sentence in Clinical Note – m Frequency counter Term – 1 : : : Term – n Frequency … VP … 150,879 90,000 0.90 0.05 … 0.03 … tf ID Term-1 ID : ID Term-n Syntactic types df NN JJ Annotation Analytics Analyzing tagged data for hypothesis generation in bioinformatics Generic GO based analysis routine Genome • Get annotations for each gene in a set Study Set • Count the occurrence of each annotation term in the study set • Count the occurrence of that term in some reference set (whole genome?) • P-value for how surprising their overlap is. Reference set Annotation Analytics Landscape SNOMED-CT NCIT ICD-9 MeSH ? Genes2MSH : Drugs, Chemicals Cell Type Human Disease Gene Ontology GOPubMed Gene Sets Patient Sets Paper Sets Grant Sets Drug Sets Health Indicator Warehouse datasets Open questions 1. Can we use something other than the GO? 2. Lack of annotations—even today, roughly 20% of genes lack any GO annotation. 3. Annotation bias—annotation with certain ontology terms is not independent of each other. 4. Lack of a systematic mechanism to define a level of abstraction. Profiling a set of Aging genes 261 Age-related genes Genome Disease Ontology ~ 30% of genome Using ontologies other than GO ERCC6 nucleoplasm PARP1 protein N-terminus binding ERCC6 <disease term?> PARP1 <disease term?> Enrichment Analysis with the DO www.ncbi.nlm.nih.gov/pubmed/16107709 http://www.geneontology.org/GO.downloads.annotations.shtml {ERCC6, PARP1} PMID:16107709 {ERCC6, PARP1} {Cockayne syndrome, DNA damage} NCBO Annotator: http://bioportal.bioontology.org ERCC6 ERCC6 PARP1 ERCC6 PARP1 GO:0005654 GO:0008094 GO:0047485 GO:0005730 GO:0003950 PMID:16107709 PMID:16107709 PMID:16107709 PMID:16107709 PMID:16107709 Annotation Analytics on EMR data Analysis of tagged data from electronic health records Profiling patient sets ICD9 789.00 (Abdominal pain, unspecified site) 86k patient Reports Patient records processed from U. Pittsburgh NLP Repository with IRB approval. Annotation (Clinical Text) Generation of tagged data Text clinical note BioPortal – knowledge graph Creating clean lexicons Term – 1 : : : Term – n Frequency Diseases Procedures Syntactic types Term recognition tool NCBO Annotator Annotation Workflow NegEx Patterns Drugs P1 ICD9 P1 T1, T2, no T4 ICD9 … T5, T4, T3 … ICD9 ICD9 T4, T3, T1 T8, T9, T4 … ICD9 ICD9 T6, T8, T10 T1, T2, no T4 NegEx Rules – Negation detection P2 P2 Negation detection P3 Cohort of Interest Further Analysis Terms Recognized P3 : : Pn Pn Terms form a temporal series of tags Detecting the Vioxx Risk Signal Vioxx Patients (1,560) VioxxMI (339) ROR of 2.058, CI of [1.804, 2.349] The X2 statistic has p-value < 10-7 ROR=1.524, CI=[0.872, 2.666] X2 p-value = 0.06816. MI Patients (1,827) RA Patients (14,079) p-value < 1.3x10-24 Detecting Adverse Events Detecting Adverse Events Linear Space Features Drug frequency Logarithmic Features Drug frequency Space Disease frequency Disease frequency Observed drug-first fraction Observed co-mention count Drug-first fraction z-score Co-mention count z-score (fixed drug) (fixed drug) Drug-first fraction z-score Co-mention count z-score (fixed disease) (fixed disease) Detecting Adverse Events Detecting Off-label use Annotation Analytics Landscape SNOMED-CT NCIT What questions can we ask? ICD-9 MeSH Genes2MSH : Drugs, Chemicals Cell Type Human Disease Agin EMRs g Gene Ontology GOPubMed Gene Sets Patient Sets Paper Sets Grant Sets Drug Sets Health Indicator Warehouse datasets Associations and outcomes Gene Gene Disease Device Procedure Environment Device Procedure Environment Enrichment Off-label Indications Disease Drug Drug Side effects What questions can we ask? Acknowledgements • • • • • • • Paea LePendu Yi Liu Srinivasan Iyer Steve Racunas Anna Bauer-Mehren Clement Jonquet Rong Xu • Mark Musen • NIH – NCBO funding • Mayo Team • Hongfang Liu • Stephen Wu • Sylvia Holland • Alex Skrenchuk Mining Annotations of Grants, Publications Grants from 1972 to 2007 30 funding agencies Publications from Medline Only “Journal articles” Sponsorship and Allocation Who funds what