Transcript Slide 1
MASC The Manually Annotated SubCorpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau MASC • Manually Annotated Sub-Corpus • NSF-funded project to provide a sharable, reusable annotated resource with rich linguistic annotations • Vassar, ICSI, Columbia, Princeton • texts from diverse genres • manual annotations or manually-validated annotations for multiple levels – – – – WordNet senses FrameNet frames and frame shallow parses named entities • Enables linking WordNet senses and FrameNet frames into more complex semantic structures • Enriches semantic and pragmatic information • detailed inter-annotator agreement measures Contents • Texts drawn from the Open ANC – Several genres • Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents) • Spoken (face-to-face, academic, telephone) – Free of license restrictions, redistributable – Download from ANC website • All MASC data and annotations will be freely downloadable Annotation Process • Smaller portions of the sub-corpus manually annotated for specific phenomena – Maintain representativeness – Include as many annotations of different types as possible • Apply (semi)-automatic annotation techniques to determine the reliability of their results • Study inter-annotator agreement on manually-produced annotations – Determine benchmark of accuracy – Fine-tune annotator guidelines • Consider if accurate annotations for one phenomenon can improve performance of automatic annotation systems for another – E.G., Validated WN sense tags and noun chunks may improve automatic semantic role labeling Process (continued) • Apply iterative process to maximize performance of automatic taggers ; – Manual annotation – Retrain automatic annotation software • Improved annotation software can later be applied to the entire ANC – Provide more accurate automatically-produced annotation of full corpus Composition Relative to Whole OANC Training examples WordNet annotations FrameNet and WordNet full annotation Genre-representative core with validated entity, shallow parse annotations WSJ with PropBank, NomBank, PTB,TimeBank and PDTB annotations MASC Core • Includes – 25K fully annotated (“all words”) for FrameNet frames and WordNet senses – ~40K corpus annotated by Unified Linguistic Annotation project • PropBank, NomBank, Penn Treebank, Penn Discourse Treebank, TimeBank – Small subset of WSJ with many annotation • Other annotations rendered into GrAF for compatibility Representation • ISO TC37 SC4 Linguistic Annotation Framework – Graph of feature structures (GrAF) – isomorphic to other feature structure-based representations (e.g. UIMA CAS) • Each annotation in a separate stand-off document linked to primary data or other annotations • Merge annotations with ANC API – Output in any of several formats • • • • • XML non-XML for use with systems such as NLTK and concordancing tools UIMA CAS Input to GraphViz … WordNet annotation • Updating WSD systems to use WordNet version 3.0 – Pederson’s SenseRelate – Mihalcea et al.’s SenseLearner • Apply to automatically assign WN sense tags to all content words (nouns, verbs, adjectives, and adverbs) in the entire OANC • Manually validate a set of words from whole OANC • Manually validate all words in 25K FN-annotated subset FrameNet Annotation • Full manual annotation of 25K in FrameNet fulltext manner • Application of automatic semantic role labeling software over entire MASC • Improve automatic semantic role labeling (ASRL) – Use active learning • ASRL system results evaluated to determine where the most errors occur • Extra manual annotation done to improve performance – Draw from entire OANC, possibly even other sources for examples Alignment of Lexical Resources • Concurrent project investigating how and to what extent WordNet and FrameNet can be aligned • MASC annotations of 25K for FrameNet frames and frame elements and WordNet senses provide a ready-made testing ground Interannotator agreement • Use a suite of metrics that measure different characteristics – Interannotator agreement coefficients such as Cohen’s Kappa – Average F-measure to determine proportion of the annotated data all annotators agree on IAA • Determine impact of these two measures – consider the relation between the agreement coefficient values / F-measure and potential users of the planned annotations • Simultaneous investigations of interannotator agreement and measurable results of using different annotations of the same data provide a stronger picture of the integrity of annotated data (Passonneau et al. 2005; Passonneau et al. 2006 ) Overall Goal • Continually augment MASC with contributed annotations from the research community • Discourse structure, additional entities, events, opinions, etc. • Distribution of effort and integration of currently independent resources such as the ANC, WordNet, and FrameNet will enable progress in resource development – – – – Less cost No duplication of effort Greater degree of accuracy and usability Harmonization Conclusion • MASC will provide a much-needed resource for computational linguistics research aimed at the development of robust language processing systems • MASC’s availability should have a major impact on the speed with which similar resources can be reliably annotated • MASC will be the largest semantically annotated corpus of English in existence • WN and FN annotation of the MASC will immediately create a massive multi-lingual resource network – Both WN and FN linked to corresponding resources in other languages – No existing resource approaches this scope