Transcript Slide 1

MASC
The Manually Annotated SubCorpus of American English
Nancy Ide, Collin Baker, Christiane Fellbaum,
Charles Fillmore, Rebecca Passonneau
MASC
• Manually Annotated Sub-Corpus
• NSF-funded project to provide a sharable, reusable annotated
resource with rich linguistic annotations
• Vassar, ICSI, Columbia, Princeton
• texts from diverse genres
• manual annotations or manually-validated annotations for multiple
levels
–
–
–
–
WordNet senses
FrameNet frames and frame
shallow parses
named entities
• Enables linking WordNet senses and FrameNet frames into more complex
semantic structures
• Enriches semantic and pragmatic information
• detailed inter-annotator agreement measures
Contents
• Texts drawn from the Open ANC
– Several genres
• Written (travel guides, blog, fiction, letters, newspaper,
non-fiction, technical, journal, government documents)
• Spoken (face-to-face, academic, telephone)
– Free of license restrictions, redistributable
– Download from ANC website
• All MASC data and annotations will be freely
downloadable
Annotation Process
• Smaller portions of the sub-corpus manually annotated for specific
phenomena
– Maintain representativeness
– Include as many annotations of different types as possible
• Apply (semi)-automatic annotation techniques to determine the reliability
of their results
• Study inter-annotator agreement on manually-produced annotations
– Determine benchmark of accuracy
– Fine-tune annotator guidelines
• Consider if accurate annotations for one phenomenon can improve
performance of automatic annotation systems for another
– E.G., Validated WN sense tags and noun chunks may improve automatic
semantic role labeling
Process (continued)
• Apply iterative process to maximize
performance of automatic taggers ;
– Manual annotation
– Retrain automatic annotation software
• Improved annotation software can later be
applied to the entire ANC
– Provide more accurate automatically-produced
annotation of full corpus
Composition Relative to Whole OANC
Training examples
WordNet annotations
FrameNet and WordNet
full annotation
Genre-representative core with
validated entity, shallow parse
annotations
WSJ with PropBank,
NomBank, PTB,TimeBank and
PDTB annotations
MASC Core
• Includes
– 25K fully annotated (“all words”) for FrameNet
frames and WordNet senses
– ~40K corpus annotated by Unified Linguistic
Annotation project
• PropBank, NomBank, Penn Treebank, Penn Discourse
Treebank, TimeBank
– Small subset of WSJ with many annotation
• Other annotations rendered into GrAF for
compatibility
Representation
• ISO TC37 SC4 Linguistic Annotation Framework
– Graph of feature structures (GrAF)
– isomorphic to other feature structure-based representations
(e.g. UIMA CAS)
• Each annotation in a separate stand-off document linked to
primary data or other annotations
• Merge annotations with ANC API
– Output in any of several formats
•
•
•
•
•
XML
non-XML for use with systems such as NLTK and concordancing tools
UIMA CAS
Input to GraphViz
…
WordNet annotation
• Updating WSD systems to use WordNet version 3.0
– Pederson’s SenseRelate
– Mihalcea et al.’s SenseLearner
• Apply to automatically assign WN sense tags to all
content words (nouns, verbs, adjectives, and
adverbs) in the entire OANC
• Manually validate a set of words from whole OANC
• Manually validate all words in 25K FN-annotated
subset
FrameNet Annotation
• Full manual annotation of 25K in FrameNet fulltext manner
• Application of automatic semantic role labeling
software over entire MASC
• Improve automatic semantic role labeling (ASRL)
– Use active learning
• ASRL system results evaluated to determine where the most
errors occur
• Extra manual annotation done to improve performance
– Draw from entire OANC, possibly even other sources for
examples
Alignment of Lexical Resources
• Concurrent project investigating how and to
what extent WordNet and FrameNet can be
aligned
• MASC annotations of 25K for FrameNet
frames and frame elements and WordNet
senses provide a ready-made testing ground
Interannotator agreement
• Use a suite of metrics that measure
different characteristics
– Interannotator agreement coefficients such
as Cohen’s Kappa
– Average F-measure to determine
proportion of the annotated data all
annotators agree on
IAA
• Determine impact of these two measures
– consider the relation between the agreement
coefficient values / F-measure and potential users of
the planned annotations
• Simultaneous investigations of interannotator
agreement and measurable results of using
different annotations of the same data provide a
stronger picture of the integrity of annotated
data (Passonneau et al. 2005; Passonneau et al.
2006 )
Overall Goal
• Continually augment MASC with contributed
annotations from the research community
• Discourse structure, additional entities, events,
opinions, etc.
• Distribution of effort and integration of currently
independent resources such as the ANC, WordNet, and
FrameNet will enable progress in resource
development
–
–
–
–
Less cost
No duplication of effort
Greater degree of accuracy and usability
Harmonization
Conclusion
• MASC will provide a much-needed resource for computational
linguistics research aimed at the development of robust
language processing systems
• MASC’s availability should have a major impact on the speed
with which similar resources can be reliably annotated
• MASC will be the largest semantically annotated corpus of
English in existence
• WN and FN annotation of the MASC will immediately create a
massive multi-lingual resource network
– Both WN and FN linked to corresponding resources in
other languages
– No existing resource approaches this scope