Molecular Biology Databases

Download Report

Transcript Molecular Biology Databases

Research in the Verspoor Lab
Karin Verspoor, Ph.D.
Faculty, Computational Bioscience Program
University of Colorado School of Medicine
[email protected]
http://compbio.ucdenver.edu/Hunter_lab/Verspoor
Text Mining
• Information extraction from the biomedical
literature
– Entity recognition and normalization
– Relation and event extraction
• Last time, I promised that we would look at:
– Ontologies as constraints for information
extraction
Making BioNLP relevant
• Recognition of OBO terms, relations
• CRAFT corpus (first release later this year)
OpenDMAP extracts typed
relations from the literature
• Concept recognition tool
– Connect ontological terms to literature instances
– Built on Protégé knowledge representation system
• Language patterns associated with concepts and slots
– Patterns can contain text literals, other concepts, constraints
(conceptual or syntactic), ordering information, or outputs of
other processing.
– Linked to many text analysis engines via UIMA
• Best performance in BioCreative II IPS task
• >500,000 instances of three predicates (with
•
arguments) extracted from Medline Abstracts
[Hunter, et al., 2008] http://bionlp.sourceforge.net
OpenDMAP
freetext
ontology
patterns
OpenDMAP
extracted
information
OpenDMAP
Cyclin E2 interacts with Cdk2 in a
functional kinase complex.
freetext
ontology
patterns
OpenDMAP
extracted
information
<ontology>
Protein protein interaction :=
[int1] interacts with [int2]
protein protein interaction:
interactor1: cyclin E2
interactor2: cdk2
OpenDMAP
PROTÉGÉ ONTOLOGY
CLASS: protein protein interaction
SLOT: interactor1
TYPE: molecule
SLOT: interactor2
TYPE: molecule
PATTERNS
{c-interact} := [interactor1] interacts with [interactor2]
{c-interact} := [interactor1] is bound by [interactor2]
…
OpenDMAP
BioCreative II Example
•
•
Some BioCreative patterns for interact
{c-interact} := [interactor1] {w-is} {w-interact-verb1} {w-preposition} the?
[interactor2];
{w-is} := is, are, was, were;
{w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates, coimmunoprecipitated, co-localize, co-localizes, co-localized;
{w-preposition} := among, between, by, of, with, to;
Matched text:
PMID 16494873, SENT_ID 16494873_114
Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot
detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6),
indicating that {UBC9 was co-immunoprecipitated with SOX10}.
INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT
INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT
{c-interact} := [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2
BioCreative Results
• 359 full-text articles in the test set
• 385 interaction assertions produced
• Performance averaged per article (to avoid
dominance of a few assertion-heavy articles)
P = 0.39, R = 0.31, F = 0.29
• Best result in the evaluation!
– F score 10% higher than next-scoring system
– F score > 3 standard deviations above mean
– Recall 20% higher than next-scoring system
BioCreative conclusions
• Information extraction in biomedical text is hard
– Linguistic variability in how concepts are expressed
– Complex concepts with multiple “slots”
• OpenDMAP advances the state of the art
– Use of an ontology grounds the search for information
– Flexibility of the pattern language to incorporate
constraints at different levels (conceptual, lexical, word
order, linguistic)
BioNLP’09: Methods
Protein_transport :=
[TRANSPORTED-ENTITY] translocation
@(from {DET}? [TRANSPORT-ORIGIN])
@(to {DET}? [TRANSPORT-DESTINATION])
Bax translocation to mitochondria from the cytosol
Bax translocation from the cytosol to the
mitochondria
Slide credit: Kevin B. Cohen
BioNLP’09: Methods
Protein_transport :=
[TRANSPORTED-ENTITY] translocation
@(from {DET}? [TRANSPORT-ORIGIN])
@(to {DET}? [TRANSPORT-DESTINATION])
Protein
(Sequence Ontology)
Slide credit: Kevin B. Cohen
Cellular Component
(Gene Ontology)
BioNLP’09: Methods
Slide credit: Kevin B. Cohen
BioNLP’09: Methods
•
All event types represented as frames
– Elements from ontology constrain every slot
Sequence Ontology
EVENT TYPE: REGULATION
AtLoc: instance of biological_entity
Cause: instance of protein
CSite: instance of biological_concept or polypeptide_region
Event_action: instance of trigger_word or detection_method
Site: instance of biological_concept or polypeptide_region
Theme: instance of protein or biological_process
ToLoc: instance of biological_entity
Cell Cycle Ontology
Slide credit: Kevin B. Cohen
Gene Ontology
Molecular Interaction
Ontology
BioNLP’09: Methods
Biological Concept
Gene or gene
product
(Sequence
Ontology)
Cellular
Component (Gene
Ontology)
Slide credit: Kevin B. Cohen
Partial view of ontology—reality
is a little bit less clean
Anatomical Part
Cell (Cell Type
Ontology)
Tissue (BRENDA
Tissue Ontology)
Organ
(Foundational
Model of Anatomy)
BioNLP’09: Methods
Event type
Site
Binding
protein domain
(SO), binding site
(SO), DNA (SO),
chromosome (SO)
Gene expression
gene (SO),
biological entity
(CCO)
Localization
Phosphorylation
AtLoc
tissue (BTO), cell
type (CTO), cellular
component (GO)
cellular component
(GO)
amino acid (FMA),
polypeptide region
(SO)
Protein catabolism
cellular component
(GO)
Transcription
gene (SO),
biological entity
(CCO)
ToLoc
cellular component
(GO)
BTO: BRENDA Tissue Ontology
CCO: Cell Cycle Ontology
CTO: Cell Type Ontology
GO: Gene Ontology
SO: Sequence Ontology
Slide credit: Kevin B. Cohen
BioNLP’09: Methods
• Manual pattern-writing
– Before availability of training data: based on native speaker
intuitions, examples from PubMed, and variations on same,
as in Cohen et al. (2004)
– After release of training data: based on examination of
corpus data, targeting high-frequency predicates only
– Nominalizations predominated; used insights from Cohen
et al. (2008) regarding Theme placement
– Protein binding rules re-used from BioCreative II proteinprotein interaction task
– Eschewed use of wildcards
Slide credit: Kevin B. Cohen
BioNLP’09: Results
Our system
Best team
Best P/R/F
P
R
F
P
R
F
P
R
F
Task 1
71.81
13.45
22.66
58.48
46.73
51.95
71.81
46.73
51.95
Task 2
70.97
13.25
43.12
54.08
35.86
43.12
70.97
35.86
43.12
Task 3
57.40
12.33
20.30
60.83
32.68
42.52
60.83
32.68
42.52
Task 1: P 10 points higher than second-highest
Task 2: P 14 points higher than second-highest
Task 3: P 3.4 points lower than highest (3/6)
Slide credit: Kevin B. Cohen
BioNLP’09: Results
Unofficial results: contribution of bug repairs
P
R
F
Official results
71.81
13.45
22.66
With bug fixes
67.19
17.38
27.10
Still the highest precision (#2
was 62.21)
Slide credit: Kevin B. Cohen
BioNLP’09: Results
• Contribution of coördination-handling
– Bug-fixed results: F 27.62 (Task 1)
– Without coordination-handling: F 24.72
– Decrease in F of 2.9 without coördinationhandling
Slide credit: Kevin B. Cohen
Syntax helps
•
125I-labeled C3b was covalently deposited on CR2, when
hemolytically active 125I-labeled C3 was added to Raji cells
preincubated with iC3, factor B, properdin, and factor D,
thus proving functionality of CR2-bound C3 convertase.
<cr2> BINDS <c3 convertase>
•
CD8alpha(alpha) binds one HLA-A2/peptide molecule,
interfacing with the alpha2 and alpha3 domains of HLA-A2 and
also contacting beta2-microglobulin.
<cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>
•
•
The binding of 109Cd to metallothionein and the thiol density
of the protein were determined after incubation of a purified
Zn/Cd-metallothionein preparation with either hydrogen
peroxide alone, or with a number of free radical generating
systems.
<109cd> BINDS <metallothionein>
Although these shifts in alpha3 may provide a synergistic
modulation of affinity, the binding of CD8 to MHC is clearly
consistent with an avidity-based contribution from CD8 to
TCR- peptide-MHC interactions.
<Cd8> BINDS <major histocompatibility complex>
More complex examples
• Complex noun phrases
•
•
The inactive C3 (iC3), which forms spontaneously in serum in low
amounts by reaction of native C3 with H2O, binds noncovalently to
the N-terminal part of CR2.
<inactive c3> BINDS <cr2>
RelB binds transcriptionally active kappaB motifs in the TNF-alpha
promoter in normal cells, and in vitro studies with macrophages
isolated from RelB- deficient animals revealed impaired production
of TNF-alpha in response to LPS and IFN-gamma.
<relb> BINDS <tnf - alpha promoter>
• Negation
•
•
TNP-BSA, however, did not bind to the CD4 receptor.
<trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor>
Similarly, when cells expressing the wild type FSHR were treated
with tunicamycin to prevent N-linked glycosylation, the resulting
nonglycosylated FSHR was not able to bind FSH.
<resulting nonglycosylated fsh receptor> DOES_NOT_BIND <folliclestimulating hormone>
Coordination is
particularly hard
In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.
<mannose receptor> BINDS <man bsa>
<s4ggnm - r> BINDS <man bsa>
Purified recombinant NC1, like authentic NC1, also bound specifically
to fibronectin, collagen type I, and a laminin 5/6 complex.
<authentic nc1> BINDS <laminin 5 / 6 complex>
<authentic nc1> BINDS <collagen type I>
<authentic nc1> BINDS <fibronectin>
<purified recombinant nc1> BINDS <laminin 5 / 6 complex>
<purified recombinant nc1> BINDS <collagen type I>
<purified recombinant nc1> BINDS <fibronectin>
The nonvisual arrestins, beta-arrestin and arrestin3, but not visual
arrestin, bind specifically to a glutathione S-transferase-clathrin
terminal domain fusion protein. *
<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain
fusion protein>
<beta arrestin> BINDS <glutathione s-transferase-clathrin terminal
domain fusion protein>
<nonvisual arrestin> BINDS <glutathione s-transferase-clathrin
terminal domain fusion protein>
BioNLP Shared Task ‘11
• Extension of BioNLP’09 tasks
– Generalization to full text (from abstracts)
– Additional event types: post-translational
modifications and catalysis
• Methods:
– Based on empirically derived patterns
– Derived from training data + manual refinement
– Using dependency relations (syntax)
– Work of Haibin Liu (postdoc)
Integrating background
knowledge
• Can improve OpenDMAP precision with
minimal cost to recall
– Take advantage of background knowledge
– Tighten constraints on slot fillers in the ontology
– No change to existing patterns
• Proof of concept:
– Distinguish among several types of protein
activation (enzyme and receptor) in GeneRIFs
– Utilize Gene Ontology annotations
Refining selectional restrictions
TP: [GeneRIF 104155 ]
an ER stress induces the activation of [caspase-12_protein catalytic activity]activated_entity via [caspase-3_protein]activator
prevented FP: [GeneRIF 105594]
factor Xa can induce mesangial cell proliferation through the
activation of ERK_protein via PAR2_protein in mesangial cells
Results
Additiona
l
Differenc
Original
e
Memory
Enzyme
Events
Precision
Recall
F-measure
Receptor
Events
Precision
Recall
F-measure
Precision
Total
Recall
F-measure
0.24
0.27
0.26
0.08
0.17
0.11
0.16
0.24
0.19
0.37
0.20
0.26
0.34
0.12
0.18
0.36
0.18
0.24
0.13
-0.07
0.00
0.26
-0.05
0.07
0.20
-0.06
0.05
Biological entities
• Genes (and their products) are particularly
valuable to recognize, but are not the only
entities of interest:
– Diseases
– Drugs, Chemicals, and other treatments
– Anatomical and other locations
– Time and temporal relationships
– Methods and evidence
– Molecular functions, biological processes
Biological Concept Recognition
Entities
on
r
t
argon
neu
nucleus
calc
ium
ell
c
t
e
naiv
bas
ontology terms as
al c
ort
string literals
ex
ChEBI molecular structure
GO cellular component
Cell Type
Events
t
n
e
l opm
eve
d
regulat
ion of
lo
a
wth
o
r
lg
comot
i on
indin
alcohol b
ontology terms as
abstract concepts
GO molecular function
GO biological process
g
Two dictionary-based tools
tested against CRAFT
• UIMA ConceptMapper
http://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator
– stemming and case matching relaxation
– non-contiguous spans
– ignore stopwords
– order-independent lookup
• Open Biomedical Annotator
http://bioportal.bioontology.org/annotator
– ignore stopwords
– partial word matches
Best run results
•
•
•
•
CM/CTO: stemming + FindAllMatches: false
OBA/CTO: using default stop words
CM/GO_CC: stemming + caseMatch: insensitive
CM/ChEBI: caseMatch: sensitive
Concept Matching Conclusions
• The kinds of terms in the ontology matter
• The strategies used in the dictionary matching
tools matter
• OpenDMAP will support strategies that go
beyond dictionary matching …
Evaluation via Test Suite
•
•
Big picture: How to evaluate ontology concept recognition systems?
Traditional approach: “corpus”
•
Immediate (narrow) goal of this work: Use techniques from
software testing and descriptive linguistics to build test suites that:
•
•
•
•
•
Expensive
Time-consuming to produce
Redundancy for some things…
…underrepresentation of others
– Control test data
– Eliminate redundancy
– Systematic coverage (Oepen 1998)
Immediate (broad) goal of this work: Are there general principles
for test suite design?
Slide credit: Kevin B. Cohen
Methods
• Steps: develop “catalogue” of dimensions along
•
which terms vary
Use insights from linguistics and from how we
know concept recognition systems work
–Structural aspects: length
–Content aspects: typography, orthography, lexical
contents (function words)…
• …to build a structured set of test cases
• Also compare to other test suite work (Cohen et
al. 2004) to look for common principles
Slide credit: Kevin B. Cohen
Structured test suite
Canonical
•
•
•
•
•
•
•
•
•
GO:0000133
GO:0000108
GO:0000786
GO:0001660
GO:0001726
GO:0005623
GO:0005694
GO:0005814
GO:0005874
Non-canonical
Polarisome
Repairosome
Nucleosome
Fever
Ruffle
Cell
Chromosome
Centriole
Microtubule
•
•
•
•
•
•
•
•
•
GO:0000133
GO:0000108
GO:0000786
GO:0001660
GO:0001726
GO:0005623
GO:0005694
GO:0005814
GO:0005874
indution of apoptosis -> apoptosis induction (Syntax)
cell migration -> cell migrated (Part of speech)
ensheathment of neurons -> ensheathment of some neurons
Slide credit: Kevin B. Cohen
Polarisomes
Repairosomes
Nucleosomes
Fevers
Ruffles
Cells
Chromosomes
Centrioles
Microtubules
Methods/Results
• Gene Ontology, revision 9/24/2009
• Canonical: 188
• Non-canonical: 117
• Observation:
–5:1 “dirty” versus 5:1 “clean” is mark of “mature”
testing
• Applied publicly available concept recognition
system
Slide credit: Kevin B. Cohen
Results
• 97.9% of canonical terms were recognized
–All exceptions contain the word in
• No non-canonical terms were recognized
• What would it take to recognize the error
•
pattern with canonical terms with a corpusbased approach??
General principles: Length, ortho/typography
(numerals/punctuation), function/stopwords,
syntactic context
Slide credit: Kevin B. Cohen