Annotation Types for UIMA

Download Report

Transcript Annotation Types for UIMA

Annotation Types for UIMA
Edward Loper
UIMA
• Unified Information Management Architecture
• Analytics framework
– Consists of components that perform specific tasks
(tagging, parsing, etc.)
– Each component declares its own interface
(input/output, requirements, work flow metadata, etc)
– All information is communicated using a single
standard data format: CAS
– Built-in support for network distribution, clustering,
etc.
CAS
• Common Analysis Structure
• Tends to fall on the “weakly-merged” side of the
spectrum (does not require annotations to be
modified to ensure consistency).
• Annotations are encoded using typed feature
structures.
• But the type definitions are left unspecified.
• C.f.: XML
• Components can only work together if they use
the same type system.
Standard CAS Types
• Goal: design standard CAS types for ULA
annotations.
– In particular, we’re currently looking at
Treebank, Propbank, & Timebank.
• Issues:
– Redundancy of information
– Coupling between annotations
– Discontinuous constituents
CAS Types: background
• UIMA does provide a couple of top-level
types. (e.g. Annotation)
• These make it clear that UIMA intends:
– Standoff annotations…
defined using spans…
with character-based offsets
• C.f. AGTK
Treebank
• Typical representation for treebank:
<TreeBankConstituent id=“8”
start=“5” end=“23” type=“NP”
children=“12 28 38” parent=“94”>
• Questions:
– Should children be explicitly marked?
– Should parents be explicitly marked?
• These questions have consequences…
Treebank: Explicit children?
• How could we not mark children?
• They can be mostly reconstructed, if we assume…
– All constituents are properly nested
– Unary branch direction can be determined based on
node type.
• Not quite true: SBAR/FRAG; S/NP; NP/FRAG; NP/PRN.
• Theoretical consequences of (not) marking
children.
– Have to assume proper nesting of constituents
– Alternatively, allow for multiple coexisting bracketings
(a la chart parse) -- probably not what we want.
Treebank: Explicit parents?
• Parent pointers are redundant -- it can be
reconstructed.
• But it can be very handy to have when
working with structures.
• Theoretical consequence of marking
parents:
– Every constituent has exactly one parent.
– Rules out multi-parented trees. (fine.)
Propbank
• Probank’s current annotation…
– Is strongly coupled to treebank
• Argument locations are specified using “tree
pointers”
– Includes trace chain information
Propbank: Tree Pointers
• Each propbank argument is specified using a
tree pointer w:h
– The hth constituent above the wth word.
• Problems with this strong coupling:
– Propbank can’t be used without trees.
– New propbanking can’t be done unless parsing has
been done.
– Changes to trees are annoying to propagate to propank.
Propbank: spans
• Can we get away with using spans instead
(UIMA’s preferred approach)?
• Do we lose any information?
– Potentially yes -- for binary branching nodes.
– In practice:
•
•
•
•
99.92% of non-trace args select the low constituent.
97.9% of trace args select the high constituent.
The differences appear to just be errors.
… so no (important) lost info!
• About 50-55% of split arguments go away.
Propbank: trace chains
• For arguments that have undergone
movement, propbank explicitly marks the
trace chain.
– But isn’t this something the tree should give us
anyway?
– Treebank & propbank have somewhat different
notions of what gets included in trace chains.
• 1/3 of the Propbank annotation guidelines talk about
null elements.
Propbank: trace chains
• How much can we recover?
– Using very simple heuristics (e.g., link “NP-2
with *t*-2), ~60%
– Using more advanced heuristics, maybe 80%.
– Not close enough to 100% to throw them away.
– Some differences harder to automate: e.g.,
propbank (usually) only marks traces that
interact with the predicate in some way.
• “Asbestosi was used ti … and replaced ti …”
Propbank: trace chains
(?s for discussion)
• Should marking trace chains be part of the
propbanking task?
– Or should we leave it up to the treebankers?
• If it should be part of propbanking, should it
be split off as a separate subtask?
– Would that help annotation speed any?
• Should the annotation be split off as a
separate layer?
Discontinuous constituents
• Propbank has provisions for discontinuous
constituents: w1:h1,w2:h2
• Discontinuous constituents can appear
almost anywhere
– Temporal expressions
– Named entities
– Parse constituents (?)
• Want: a uniform way to handle them.
Discontinuous constituents
• Goals:
– Make the common case easy
– Make the uncommon case possible
• Preferred approach:
– Add an optional property (eg “pieces”) that can be used
to specify discontinuous chunks.
– If used, then the start/end properties should be treated
with appropriate care
• Open question:
– Should this property be defined on the top-level type, or
on individual types (eg PropBankArgument)?
A note on consistency
• CAS is “weakly merged” -- it doesn’t enforce
consistency.
• But that doesn’t mean we can’t enforce consistency
ourselves.
• For weakly merged formats, it will be important to:
– Define consistencies that we want
• Both within annotations & between annotations
– Actively check those consistencies during annotation.
• Weakly coupled annotations are a good thing.
– But the more weakly coupled the annotations are, the
more we’ll need to check consistency
Questions/discussion
•
•
•
•
•
Strongly vs weakly merged
(when) is redundancy good?
How strongly coupled should annotations be?
Handling discontinuous constituents?
Where is there information overlap between
annotations (e.g. coref chains)? What should be
done about it?
• Any principled way to decide when to mark heads
vs spans?
• Token offset vs character offset vs tree pointer