When Corpus Meets Theory

Download Report

Transcript When Corpus Meets Theory

Models and Data
When Corpus Meets Theory
James Pustejovsky
TSD 2002
September 10, 2002
Talk Outline
• Goals for Language Modeling
• The Role of Corpus in Theory
–
–
–
–
–
Disambiguation
Selection discovery
Clustering
Category modification and formation
Grammar induction
• The Role of Theory in Corpus
Goals of Language Modeling
• Statistically informed models improve application
performance
–
–
–
–
–
–
–
Speech
Search
Clustering
Parsing
Machine translation
Summarization
Question answering
Theory Drives the Model
• Corpus Behavior of words is determined by their
type.
• You can’t find what you can’t model.
• But, you don’t want to find only what you model!
• Theory allows a model of reality, but …
• Corpus brings reality to the model.
Language Modeling with
Generative Lexicon
• Selection integrates paradigmatics and
syntagmatics
• Models the relationship between selectional
contexts
– Coercion in typing
– Complex type (Dot Objects)
• All major categories behave functionally
– Qualia structure models much of this behavior
• Semantic Types are differentiated and ranked:
– Grammatical behavior follows (generally) from type
Quine’s Gambit in Corpora
• Co-occurrence reveals surface relations.
– Paradigmatics is first order.
– Syntagmatics is first order.
• LSA and other techniques create nonsuperficial associations.
• Model Bias is necessary to create decision
procedures
Example: Complex Types
Recognizing Selection
1. a. The man fell/died.
b. The rock fell/!died.
2. a. John forced/!convinced the door to open.
b. John forced/convinced the guests to leave.
3. a. John poured milk into /!on his coffee.
b. John poured milk into/on the bowl.
Modeling Paradigmatic Systems
Integrating Selection into Grammars
Qualia Structure
• Qualia are used to create new types:
• They are generative coherence relations between
types.
Three Ranks of Type
Entities
Events
System of Generating Types
Qualia are incorporated into Type Itself
Qualia as Types
Functional Selection
Functional Type Coercion
Co-composition
Coercion in Function Composition
Selection and Coercion
Type Specification
Type Determines Grammatical Behavior
Corpus Distribution of different types should
correlate strongly with their type.
Behavior is measurable in corpus
Corpus Analysis provides probable
values for Coercion
Drinking, sipping, cooling,?pouring,?spilling,
Complements of “begin” in AP:
(Pustejovsky and Rooth, 1991 ms)
Complements of “veto” in AP
Limitations of this Approach:
Fuzzy Selection
Dependencies that require Models:
Complex Types
Complex Types
Contexts Introducing Complex Types
1. a. John read the story/the book.
b. John told the story/!the book.
2. Mary read the subway wall.
When Paradigmatic systems are modeled,
Syntagmatic Processes are affected
•The specificity of argument selection by a predicate;
•The treatment of verbal polysemy and multiple
subcategorization
• The treatment of type mismatches and the semantics of
solidarities
Types of Properties
Natural Binary Predicate
Polar Predicates
•hot/cold
•big/small
•short/tall
•clean/dirty
Lexical Asymmetries
•Preferences and Defaults:
• clean/dirty, empty/full, pretty/ugly
•Lexical Gaps:
•bald/(hairy), toothless/(toothed)
•Lexical Perfectives:
•dead/alive
Sortal Opposition:
• External Negation points up in the Type system:
• Internal Negation points down in the Type System:
(1) a. Rocks are not alive.
b. !Rocks are dead.
(2) a. The Pope is not married.
b. !The Pope is a bachelor.
(3) a. Bill did not run the race.
b. Hence, Bill did not win the race.
c. !Bill lost the race.
Case Study I:
Corpus Drives Lexical Acquisition
Text Mining the Biobibliome
• 40,000 papers published each month in
Medline
• 11 million abstracts currently in Medline
Database
• 36 GB of text
Robust Extraction of Relations
from Biomedical Texts
• Statistical techniques are too course-grained
– “SU6656 does not inhibit the PDGF receptor.”
• ‘Local’ Named Entity Extraction is not informative enough
– “This protein binds to Src.”
• ‘Bag of words’ and ‘bag of entities’ approaches are too
weak
– “p16 inhibits Cdk4.”
– “Cdk4 is inhibited by p16.”
Parsing Methodology
• Identify Targets of Interest
– Entities and relations to be extracted
• Perform Corpus Analysis over targets
– Cluster corpus occurrences by syntactic behavior and semantic
type
• Generate Patterns for extraction
– Test and modify patterns against development corpus
Possible Selectional Frames
Arg-types
Obj = Bio-entity
Obj = Process
Subj = Bioentity
(entity,entity)
(entity,process)
(process,entity)
(process,process)
Subj = Process
• “p16 inhibits Cdk4.” (entity,entity)
• “p16 inhibits cell growth.” (entity,process)
• “Methylation inhibits HDAC1.” (process,entity)
• “Cell growth inhibits apoptosis.” (process,process)
Corpus Pattern Analysis
• Create concordances over target elements
• Automatically cluster complementation
patterns
• Semi-automatically verify patterns and amend
grammar rules accordingly.
Getting the Lexicon out of the Corpus
•
•
•
•
Preliminary examination of the text
Sort concordances according to semantics patterns
One-sense-per-domain doesn't cut it
Complementation patterns emerge from the corpus, with
and without realization
• Semantic patterns are a first step towards identifying
lexical sets
• Semantic patterns identified with specific lexical sets
yields co-specifications
• Implicatures can be identified with co-specifications for a
very high proportion of uses of all predicators.
Corpus-derived Grammars distinguish
Textual Function
• Tensed Sentence-based relational information
conveys new information.
– A peptide representing the carboxyl-terminal tail of the
met receptor inhibits kinase activity.
• Nominalization functions to:
– Allow further predication and modification;
– Bridge the new information with acceptance as given.
– Provide economy of expression in text;
• Agentive Nominal conveys a relation as a given fact.
– The protein kinase C inhibitor staurosporine , inhibited
actin assembly
Probable Syntactic Patterns:
Sentential Forms
• A peptide representing the carboxyl-terminal tail of the met
receptor inhibits kinase activity.
• Whereas phosphorylation of the IRK by ATP is inhibited by
the nonhydrolyzable competitor adenylyl-imidodiphosphate,
...
• The Met tail peptide inhibits the closely related Ron
receptor but does not affect …
•Although the ability of individual trichothecenes to inhibit protein synthesis and
activate JNK/p38 kinases are dissociable , both effects contribute to the induction of
apoptosis .
Probable Syntactic Patterns:
Nominal Forms
• 12S E1A , an inhibitor of p300-dependent transcription ,
reduces the binding of TFIIB , but not that of cyclin E- Cdk2 , to
p300.
• The protein kinase C inhibitor staurosporine , inhibited actin
assembly and platelet aggregation induced by thrombin or PMA.
Probable Syntactic Patterns:
Nominalizations
•Structural basis for inhibition of protein tyrosine phosphatases by
Keggin compounds phosphomolybdate and phosphotungstate.
• Previous reports raised question as to whether 8-Cl-cAMP is a
prodrug for its metabolite, 8-Cl-adenosine which exerts growth
inhibition in a broad spectrum of cancer cells.
Case Study II:
Theory Drives Corpus Analysis
Semantic Rerendering
• A general technique for adapting and
modifying an existing ontology
• Types are extended and created through:
– corpus analysis of patterns implicated with
type structures
– Ad hoc database projections over a relational
database
Specialized Ontologies in the
Biomedical Domain
The UMLS from National Library of Medicine
– wide coverage
– shallow semantic type structure
• 180,998 instances of Amino Acid, Peptide, or
Protein in UMLS
• Chemical Viewed Functionally and Chemical Viewed
Structurally
– These 2 subtrees cover a large number of all types in the
UMLS
– The UMLS gives semantic type bindings to 1.5 million
entities
NLP Applications using Semantic Typing
• Statistical Categorization and Disambiguation Tasks
– Resolution of Prepositional Attachment
– Relations between Constituents in Nominal Compounds
Generalizing across semantic classes
= make up for the sparseness of data
• IR Tasks
– Query Reformulation
– Filtering & Ranking of Retrieved Results
• Information Extraction Tasks
– Coreference Resolution
– Relation Extraction (via Anaphora Resolution)
– Entity Identification
GL as Modeling Bias in Rerendering
•
•
•
•
Structural subtyping (Formal)
Functional subtyping (Telic)
Activation relations (Agentive)
Molecular analysis (Const)
Syntactic Rerendering Algorithm
(I)
Syntactic Rerendering Algorithm
(II)
Syntactic Rerendering Algorithm
(III)
Evaluating Results
• Comparison against Existing Ontologies
– overlap with Gene Ontology (GO) for select
categories
• Receptor:
17.5% of 2nd level extension phrases
are in GO
• Improved P&R for the client NLP Applications
– Coreference Resolution Application
• Sortal Anaphora:
– “the enzyme”, “the protease”, “the same
solvent”, etc.
Derivation of Instances for the
Proposed Subtypes
Syntactic templates (inhibitor, solvent) :
– definitional constructions: “X is a Y inhibitor”
– aliasing constructions: “X (the solvent)”
– appositions: “X, the inhibitor of Y,”
– nominal compounds: “the solvent X”
– enumerations: “the following solvents: X, Y, ..”
– relative clauses
– adjuncts: “X and Y as solvents”
Semantic (Database) Rerendering
• Database of relations
– extracted from the Medline corpus
– inhibit, block, phosphorylate
• Typed projection from relations table
– induces an ad hoc category
subtype of T1
X = {X : T1| R(X,Y) T1UMLS1}
Syntactic vs. Semantic Rerendering
• Sortals with no corresponding relational form
solvent
• Sortal and relation predicates
inhibitor/inhibit
kinase/phosphorylate
• Relation predicates with no corresponding nominal forms
bind with
increase
Syntactic vs. Semantic Rerendering (II)
• Overlap of derived subtypes
– CDK inhibitor
– p21(WAF-1) inhibited CDK2 and CDK4
• Recover different types of information
– Syntactic templates for sortal predicates :
old information
– Typed projections of database relations :
new information
Case Study III:
Applying Lexical Semantic Knowledge
TERQAS: Time and Event Recognition for
Question Answering Systems
Relevance to Question Answering Systems
•
Is Gates currently CEO of Microsoft?
•
Were there any meetings between the terrorist hijackers and Iraq before the
WTC event?
•
Did the Enron merger with Dynegy take place?
•
How long did the hostage situation in Beirut last?
Questions over TIMBANK Corpus
• When did the war between Iran and Iraq end?
• When did John Sununu travel to a fundraiser for John Ashcroft?
• How many Tutsis were killed by Hutus in Rwanda in 1994?
• Who was Secretary of Defense during the Gulf War?
• What was the largest U.S. military operation since Vietnam?
• When did the astronauts return from the space station on the
last shuttle flight?
Workshop Goals
• TimeML: Define and Design a Metadata Standard for
Markup of events, their temporal anchoring, and how they
are related to each other in News articles.
• TIMEBANK: Given the specification of TimeML,
create a gold standard corpus of 300 articles marked up for
temporal expressions, events, and basic temporal relations.
TERQAS Participants
–
–
–
–
–
–
–
–
–
–
–
–
James Pustejovsky, PI
Rob Gaizauskas
Graham Katz
Bob Ingria
José Castaño
Inderjeet Mani
Antonio Sanfilippo
Dragomir Radev
Patrick Hanks
Marc Verhagen
Beth Sundheim
Andrea Setzer
Supported by
–
–
–
–
–
–
–
–
–
–
–
–
Jerry Hobbs
Bran Boguraev
Andy Latto
John Frank
Lisa Ferro
Marcia Lazo
Roser Saurí
Anna Rumshisky
David Day
Luc Belanger
Harry Wu
Andrew See
How TimeML Differs from Previous
Markups
•
Extends TIMEX2 annotation;
–
–
•
Temporal Functions: three years ago
Anchors to events and other temporal expressions:
Identifies signals determining interpretation of temporal expressions;
–
–
•
Temporal Prepositions: for, during, on, at;
Temporal Connectives: before, after, while.
Identifies event expressions;
–
–
–
•
tensed verbs; has left, was captured, will resign;
stative adjectives; sunken, stalled, on board;
event nominals; merger, Military Operation, Gulf War;
Creates dependencies between events and times:
–
–
–
Anchoring; John left on Monday.
Orderings; The party happened after midnight.
Embedding; John said Mary left.
<EVENT>
attributes ::= eid class tense aspect
eid ::= ID
{eid ::= EventID
EventID ::= e<integer>}
class ::= 'OCCURRENCE' | 'PERCEPTION' | 'REPORTING' |
'ASPECTUAL' | 'STATE' | 'I_STATE' |
'I_ACTION'
| 'MODAL'
tense ::= 'PAST' | 'PRESENT' | 'FUTURE' | 'NONE'
aspect ::= 'PROGRESSIVE' | 'PERFECTIVE' |
'PERFECTIVE_PROGRESSIVE' | 'NONE'
TimeML Event Classes
• Occurrence:
– die, crash, build, merge, sell, take advantage of, ..
• State:
– Be on board, kidnapped, recovering, love, ..
• Reporting:
– Say, report, announce,
• I-Action:
– Attempt, try,promise, offer
• I-State:
– Believe, intend, want, …
• Aspectual:
– begin, start, finish, stop, continue.
• Perception:
– See, hear, watch, feel.
The young industry's rapid growth also is attracting regulators eager to
police its many facets.
The young industry's rapid
<EVENT eid="e1" class="OCCURRENCE">
growth
</EVENT>
also is
<EVENT eid="e2" class="OCCURRENCE">
attracting
</EVENT>
regulators
<EVENT eid="e4" class="I_STATE">
eager
</EVENT>
to
<EVENT eid="e5" class="OCCURRENCE">
police
</EVENT>
its many facets.
Israel will ask the United States to delay a military strike against Iraq
until the Jewish state is fully prepared for a possible Iraqi attack.
Israel will
<EVENT eid="e1" class="I_ACTION">
ask
</EVENT>
the United States to
<EVENT eid="e2" class="I_ACTION">
delay
</EVENT>
a military
<EVENT eid="e3" class="OCCURRENCE">
strike
</EVENT>
against Iraq until the Jewish state is fully
<EVENT eid="e4" class="I_STATE">
prepared
</EVENT>
for a possible Iraqi
<EVENT eid="e5" class="OCCURRENCE">
attack
</EVENT>
<TIMEX3>
• Fully Specified Temporal Expressions
– June 11, 1989
– Summer, 2002
• Underspecified Temporal Expressions
–
–
–
–
Monday
Next month
Last year
Two days ago
• Durations
– Three months
– Two years
functionInDocument allows for relative anchoring of
temporal expression values
TLINK
TLINK or Temporal Link represents the temporal relationship holding between events or
between an event and a time, and establishes a link between the involved entities,
making explicit if they are:
1.
Simultaneous (happening at the same time)
2.
Identical: (referring to the same event)
John drove to Boston. During his drive he ate a donut.
3.
One before the other:
The police looked into the slayings of 14 women. In six of the cases suspects
have already been arrested.
4.
One after the other:
5.
One immediately before the other:
All passengers died when the plane crashed into the mountain.
6.
One immediately after than the other:
7.
One including the other:
John arrived in Boston last Thursday.
8.
One being included in the other:
9.
One holding during the duration of the other:
10. One being the beginning of the other:
John was in the gym between 6:00 p.m. and 7:00 p.m.
11. One being begun by the other:
12. One being the ending of the other:
John was in the gym between 6:00 p.m. and 7:00 p.m.
13. One being ended by the other:
SLINK
SLINK or Subordination Link is used for contexts introducing relations between two
events, or an event and a signal, of the following sort:
1. Modal: Relation introduced mostly by modal verbs (should, could, would, etc.) and
events that introduce a reference to a possible world --mainly I_STATEs:
John should have bought some wine.
Mary wanted John to buy some wine.
2. Factive: Certain verbs introduce an entailment (or presupposition) of the argument's
veracity. They include forget in the tensed complement, regret, manage:
John forgot that he was in Boston last year.
Mary regrets that she didn't marry John. John managed to leave the party.
3. Counterfactive: The event introduces a presupposition about the non-veracity of its
argument: forget (to), unable to (in past tense), prevent, cancel, avoid, decline, etc.
John forgot to buy some wine.
Mary was unable to marry John.
John prevented the divorce.
4. Evidential: Evidential relations are introduced by REPORTING or PERCEPTION:
John said he bought some wine.
Mary saw John carrying only beer.
5. Negative evidential: Introduced by REPORTING (and PERCEPTION?) events
conveying negative polarity:
John denied he bought only beer.
6. Negative: Introduced only by negative particles (not, nor, neither, etc.), which will be
marked as SIGNALs, with respect to the events they are modifying:
John didn't forgot to buy some wine.
John did not wanted to marry Mary.
ALINK
ALINK or Aspectual Link represent the relationship between an aspectual event
and its argument event. Examples of the possible aspectual relations we will
encode are:
1. Initiation:
John started to read.
2. Culmination:
John finished assembling the table.
3. Termination:
John stopped talking.
4. Continuation:
John kept talking.
SLINK
(15) Bill wants to teach on Monday.
Bill
<EVENT eid="e1" class="I_STATE" tense="PRESENT" aspect="NONE">
wants
</EVENT>
<MAKEINSTANCE eiid="ei1" eventID="e1"/>
<SLINK eventInstanceID="ei1" signalID="s1" subordinatedEvent="e2" relType="MODAL"/>
<SIGNAL sid="s1">
to
</SIGNAL>
<EVENT eid="e2" class="OCCURRENCE" tense="NONE" aspect="NONE">
teach
</EVENT>
<MAKEINSTANCE eiid="ei2" eventID="e2"/>
<SIGNAL sid="s2">
on
</SIGNAL>
<TIMEX3 tid="t1" type="DATE" temporalFunction="true" value="XXXX-WXX-1">
Monday
</TIMEX3>
<TLINK eventInstanceID="ei2" relatedToTime="t1" relType="IS_INCLUDED"/>
ALINK
(18) The search party stopped looking for the survivors.
The search party
<EVENT eid="e1" class="ASPECTUAL" tense="PAST" aspect="NONE">
stopped
</EVENT>
<MAKEINSTANCE eiid="ei1" eventID="e1"/>
<EVENT eid="e2" class="OCCURRENCE" tense="NONE" aspect="PROGRESSIVE">
looking
</EVENT>
<ALINK eventInstanceID="ei1" relatedToEvent="e2" relType="TERMINATES"/>
for the survivors
Multi-Document TimeML
Annotation for Summarization
Even this simple summary
is only possible using TimeML
Multi-doc TimeML anchors single-doc events, and
merges events across multiple docs (via TimeML graphs)
TimeML for Multi-lingual
Information Access
• Extend to multilingual annotation (re:
TIMEX2 results on Spanish, French, and
Korean)
• Address translation of specialized TimeML
constructs
Open Problems in LKB Design
• Robust acquisition of semantic classes
– Classes modifiable by composition/context
• Persistence and Entailed Events:
– The terrorists kidnapped the journalist.
– The President resigned.
• Event Normalization and Quantification:
– Three deaths occurred.
– Three people died.
• Generalizing the Treatment of Negation:
– No survivors were found.
– The plane did not crash.
Conclusion
The Open Texture of Words
• Language is constructed by partial generating
functions.
• There is inherent incompleteness of terms in
language
• Richer modes of composition are used in
determining sense and fixing reference
• Corpus data and statistical techniques determine
the texture and completeness of the language in
use.
Acknowledgements
• Brandeis University
–
–
–
–
–
José Castaño
Wei Luo
Roser Saurí
Anna Rumshisky
James Pustejovsky
[email protected]
– medstract.org
Supported by
• Tufts University
– Maciej Kotecki
– Brent Cochran
• TERQAS Workshop
– time2002.org