Transcript 幻灯片 1

Effective Reranking for Extracting
Protein-protein Interactions from
Biomedical Literature
Deyu Zhou, Yulan He and Chee Keong Kwoh
School of Computer Engineering
Nanyang Technological University, Singapore
30 August 2007
Outline
• Protein-protein interactions (PPIs) extraction
• Hidden Vector State (HVS) model for PPIs extraction
• Reranking approaches
• Experimental results
• Conclusions
Protein-Protein Interactions Extraction
Interact
Protein
Protein
Protein
Spc97p interacts with Spc98 and Tub4 in the
two-hybrid system
Spc97p interact Spc98
Spc97p interact Tub4
Existing Approaches
Statistics
Methods
Pattern
Matching
Simple to Complicated
ParsingBased
An example
However, unlike another tumor suppressor protein, p53, Rb did not have any significant effect
on basal levels of transcription, suggesting that Rb specifically interacts with IE2 rather ...
Part-of-speech tagging
However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, p53/NN ,/, Rb/
NN did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/IN basal/JJ levels/NNS of/IN
transcription/NN ,/, suggesting/VBG that/IN Rb/NN specifically/RB interacts/VBZ with/IN IE2/
NN rather/RB ...
Protein name identification
However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, PROTEIN(p53/
NN) ,/, PROTEIN(Rb/NN) did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/IN
basal/JJ levels/NNS of/IN transcription/NN ,/, suggesting/VBG that/IN PROTEIN(Rb/NN)
specifically/RB interacts/VBZ with/IN PROTEIN(IE2/NN) rather/RB ...
Statistics-Based Approaches
Sentence level statistic
Corpus level statistic
Relation
Occurrence
Relation
Occurrence
(Rb, IE2)
+1
...
1
(p53, IE2)
+1
(p53, IE2)
8
(p53, Rb)
+1
...
6
Predefined threshold a = 7
Relation
Confidence
...
...
(p53, IE2)
75%
...
...
Pattern Matching Approaches
Pattern matching
Pattern 1
Pattern 2
Protein [*] interact[s] with protein
protein RB VBZ WITH protein
Rb interact IE2
p53 interact IE2
Rb interact IE2
Parsing-Based Approaches
…...
Syntactic processing
VP
VP
NP
N
Semantic processing
Rb interact IE2
PP
ADV
V
P
N
...Rb specifically interacts with IE2
...
(<INTERACT>
<THE Rb PROTEIN>
<THE IE2 PROTEIN>)
Semantic Parser
For each candidate word string Wn, need to compute most
likely set of embedded concepts
Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) }
c
c
semantic
model
lexical
model
We could use a simple finite state tagger …
P(C)
P(Wn|C)
SS
<s>
PROTEIN
INTERACT
Spc97p
interacts
DUMMY
PROTEIN
DUMMY
PROTEIN
DUMMY
SE
with
Spc98
and
Tub4
in the
</s>
two-hybrid system
… can be robustly trained using EM, but model is too
weak to represent embeddings in natural language
Perhaps use some form of hierarchical HMM in which each
state is a terminal or a nested HMM …
S
INTERACTION
P(C)
SUBJECT
INTERACT
PROTEIN
P(Wn|C)
Spc97p
interacts
OBJECT
OBJECT
PREP
PROTEIN
AND
with
Spc98
and
PROTEIN
DUMMY
Tub4 in the two-hybrid system
… but when using EM, models rarely converge
on good solutions and, in practice, direct
maximum-likelihood from “tree-bank” data are
needed to train models
Hidden Vector State Model
SS
PROTEIN
INTERACT
DUMMY PROTEIN DUMMY PROTEIN
<s> Spc97p interacts with
SS
PROTEIN
SS
INTERACT
PROTEIN
SS
DUMMY
INTERACT
PROTEIN
SS
Spc98
PROTEIN
INTERACT
PROTEIN
SS
and
DUMMY
INTERACT
PROTEIN
SS
DUMMY
SE
Tub4 in the two-hybrid system </s>
PROTEIN
INTERACT
PROTEIN
SS
DUMMY
SS
SE
SS
The HVS model is an HMM in which the states correspond to the
stack of a push-down automata with a bounded stack size …
P(C)
SS
PROTEIN
SS
INTERACT
PROTEIN
SS
P(Wn|C) <s> Spc97p interacts
DUMMY
INTERACT
PROTEIN
SS
with
PROTEIN
INTERACT
PROTEIN
SS
Spc98
DUMMY
INTERACT
PROTEIN
SS
PROTEIN
INTERACT
PROTEIN
SS
and
Tub4
DUMMY
SS
SE
SS
in the two </s>
-hybrid system
… this is a very convenient framework for applying constraints
HVS model transition constraints:
• finite stack depth – D
• push only one non-terminal semantic onto the stack at
each step
Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) }
c,N
t
… model defined by three simple probability tables
Parsing with the HVS model
1) POP 1 elements
from the previous
stack state, n =1
INTERACT
PROTEIN
SS
2) Push 1 pre-terminal
semantic concept into stack
P(nt|Ct-1)
P(Ct[1]|Ct [2..Dt])
PROTEIN
INTERACT
PROTEIN
SS
…
with
Spc98
DUMMY
INERACT
PROTEIN 3) Generate the next word
SS
P(Wt|Ct)
and
Tub4 …
Train using EM and apply constraints
Training text
Abstract semantic annotation
CUL-1 was found to interact with
SKR-1, SKR-2, SKR-3, and SKR-7
in yeast two-hybrid system
Constraints
Data
Limit forwardbackward search to
only include states
which are consistent
with the constraints
PROTEIN (
INTERACT (
PROTEIN) )
EM Parameter
Estimation
Parse Statistics
HVS Model
Parameters
Reranking Methodology
• Reranking approaches attempts to improve upon an
existing probabilistic parser by reranking the output of
the parser.
• It has benefited applications such as name-entity
extraction, semantic parsing and semantic labeling.
• To rerank parses generated by the HVS model for
protein-protein interactions extraction
Architecture
Training
Annotated
Corpus E
Training
Semantic
Parsing
HVS model
Test Data
Parse results
Reranking
Model
Reranking
Ranked 1st parse
Parsing Information IP
Structure Information IS
Complexity Information IC
...
Extracted proteinprotein
Interactions
Features:
Reranking approaches
• Features for Reranking
Suppose sentence Si has its corresponding parse set Ci = {Cij, j = 1,.. N}
– Parsing Information
– Structure Information
– Complexity Information
Reranking approaches
Score is defined as
• log-linear regression model
• Neural Network
• Support Vector Machines
Experiments
• Setup
– Corpus I
• comprises of 300 abstracts randomly retrieved from
the GENIA corpus
• GENIA is a collection of research abstracts selected
from the search results of MEDLINE database with
keyword (MeSH terms) “human, blood cells and
transcription factors”
• split into two parts:
–
Part I contains 1500 sentences (training data)
–
Part II consists of 1000 sentences (test data)
Experimental Results
Figure 1: F-measure vs number of candidate parses.
Experimental Results
(cont’d)
Experime
nts
Recall
(%)
Precision
(%)
F-Score
(%)
Baseline
SVM
NN
LLR
55.8
59.1
57.9
58.5
55.6
60.2
61.8
61.2
55.7
59.7
59.8
59.8
Table 3: Results based on the interaction category.
Conclusions
• Three reranking methods for the HVS model in the
application of extracting protein-protein interactions from
biomedical literature.
• Experimental results show that 4% relative improvement
in F-measure can be obtained through reranking on the
semantic parse results
• Incorporating other semantic or syntactic information
might be able to give further gains.