The Manually Annotated SubCorpus of American English
Nancy Ide, Collin Baker, Christiane Fellbaum,
Charles Fillmore, Rebecca Passonneau
• Manually Annotated Sub-Corpus
• NSF-funded project to provide a sharable, reusable annotated
resource with rich linguistic annotations
• Vassar, ICSI, Columbia, Princeton
• texts from diverse genres
• manual annotations or manually-validated annotations for multiple
WordNet senses
FrameNet frames and frame
shallow parses
named entities
• Enables linking WordNet senses and FrameNet frames into more complex
semantic structures
• Enriches semantic and pragmatic information
• detailed inter-annotator agreement measures
• Texts drawn from the Open ANC
– Several genres
• Written (travel guides, blog, fiction, letters, newspaper,
non-fiction, technical, journal, government documents)
• Spoken (face-to-face, academic, telephone)
– Free of license restrictions, redistributable
– Download from ANC website
• All MASC data and annotations will be freely
Annotation Process
• Smaller portions of the sub-corpus manually annotated for specific
– Maintain representativeness
– Include as many annotations of different types as possible
• Apply (semi)-automatic annotation techniques to determine the reliability
of their results
• Study inter-annotator agreement on manually-produced annotations
– Determine benchmark of accuracy
– Fine-tune annotator guidelines
• Consider if accurate annotations for one phenomenon can improve
performance of automatic annotation systems for another
– E.G., Validated WN sense tags and noun chunks may improve automatic
semantic role labeling
Process (continued)
• Apply iterative process to maximize
performance of automatic taggers ;
– Manual annotation
– Retrain automatic annotation software
• Improved annotation software can later be
applied to the entire ANC
– Provide more accurate automatically-produced
annotation of full corpus
Composition Relative to Whole OANC
Training examples
WordNet annotations
FrameNet and WordNet
full annotation
Genre-representative core with
validated entity, shallow parse
WSJ with PropBank,
NomBank, PTB,TimeBank and
PDTB annotations
• Includes
– 25K fully annotated (“all words”) for FrameNet
frames and WordNet senses
– ~40K corpus annotated by Unified Linguistic
Annotation project
• PropBank, NomBank, Penn Treebank, Penn Discourse
Treebank, TimeBank
– Small subset of WSJ with many annotation
• Other annotations rendered into GrAF for
• ISO TC37 SC4 Linguistic Annotation Framework
– Graph of feature structures (GrAF)
– isomorphic to other feature structure-based representations
(e.g. UIMA CAS)
• Each annotation in a separate stand-off document linked to
primary data or other annotations
• Merge annotations with ANC API
– Output in any of several formats
non-XML for use with systems such as NLTK and concordancing tools
Input to GraphViz
WordNet annotation
• Updating WSD systems to use WordNet version 3.0
– Pederson’s SenseRelate
– Mihalcea et al.’s SenseLearner
• Apply to automatically assign WN sense tags to all
content words (nouns, verbs, adjectives, and
adverbs) in the entire OANC
• Manually validate a set of words from whole OANC
• Manually validate all words in 25K FN-annotated
FrameNet Annotation
• Full manual annotation of 25K in FrameNet fulltext manner
• Application of automatic semantic role labeling
software over entire MASC
• Improve automatic semantic role labeling (ASRL)
– Use active learning
• ASRL system results evaluated to determine where the most
errors occur
• Extra manual annotation done to improve performance
– Draw from entire OANC, possibly even other sources for
Alignment of Lexical Resources
• Concurrent project investigating how and to
what extent WordNet and FrameNet can be
• MASC annotations of 25K for FrameNet
frames and frame elements and WordNet
senses provide a ready-made testing ground
Interannotator agreement
• Use a suite of metrics that measure
different characteristics
– Interannotator agreement coefficients such
as Cohen’s Kappa
– Average F-measure to determine
proportion of the annotated data all
annotators agree on
• Determine impact of these two measures
– consider the relation between the agreement
coefficient values / F-measure and potential users of
the planned annotations
• Simultaneous investigations of interannotator
agreement and measurable results of using
different annotations of the same data provide a
stronger picture of the integrity of annotated
data (Passonneau et al. 2005; Passonneau et al.
2006 )
Overall Goal
• Continually augment MASC with contributed
annotations from the research community
• Discourse structure, additional entities, events,
opinions, etc.
• Distribution of effort and integration of currently
independent resources such as the ANC, WordNet, and
FrameNet will enable progress in resource
Less cost
No duplication of effort
Greater degree of accuracy and usability
• MASC will provide a much-needed resource for computational
linguistics research aimed at the development of robust
language processing systems
• MASC’s availability should have a major impact on the speed
with which similar resources can be reliably annotated
• MASC will be the largest semantically annotated corpus of
English in existence
• WN and FN annotation of the MASC will immediately create a
massive multi-lingual resource network
– Both WN and FN linked to corresponding resources in
other languages
– No existing resource approaches this scope