Nancy Ide • Vassar College Catherine Macleod • New York University

Download Report

Transcript Nancy Ide • Vassar College Catherine Macleod • New York University

Nancy Ide • Vassar College
Catherine Macleod • New York University
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Why we need an ANC
• Brown Corpus of American English
– Too small to provide representative examples
– Pre-1960 only
– No spoken data
• British National Corpus
– Not representative of American English
– Texts up to 1993 only
Corpus Linguistics 2000
American National Corpus
Lancaster, England
British vs. American English
• Lexical Items
• Bobby vs. cop, underground vs. subway, lorry vs. truck, pavement
vs. sidewalk, football vs. soccer…
• Grammatical structures
• “She could not endure to live with him” vs. “She could not endure
living with him.”
• “Have you a pen?” vs. “Do you have a pen?”
•
Modals
• “shall” vs. “should” vs. “ought” vs. “will” vs. “would” vs. “should”
• Adverbial Usage
• “Immediately I get home” vs. “As soon as I get home”
• Support Verbs
• “take a decision” vs. “make a decision”
Corpus Linguistics 2000
American National Corpus
Lancaster, England
ANC Background
• June 1998
– ANC proposed at LREC’98 by Charles Fillmore, Nancy Ide,
Daniel Jurafsky, Catherine Macleod
• May 1998
– Publisher’s Day in Berkeley in conjunction with DSNA
• November 1999
– Organizational meeting, New York University
Corpus Linguistics 2000
American National Corpus
Lancaster, England
ANC Consortium
Pearson Education
Random House Publishers
Langenscheidt Publishing Group
Harper Collins Publishers
Cambridge University Press
LexiQuest
Microsoft Corporation
Shogakukan,Inc.
Associated Liberal Creators Press
Taishukan Publishers
Oxford University Press
Kenkyusha Publishers
IBM Corporation
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Contributors
• “Founding” consortium members
– $21,000 over 3 years
– Texts
• Linguistic Data Consortium
– Management and distribution of the ANC
– Manpower and expertise to create initial version
• NYU and Vassar
– Expertise and manpower for corpus creation and
annotation
Corpus Linguistics 2000
American National Corpus
Lancaster, England
ANC Makeup
• Core “static” corpus
• Texts and transcriptions of spoken data
• 1990 onwards
• Comparable in balance to BNC
• Enables comparative studies
• At least 100 million words
• Snapshot of American English at the end of the
millenium
Corpus Linguistics 2000
American National Corpus
Lancaster, England
“Dynamic” component
• Not necessarily balanced
• Dictated by availability
• Includes email, ephemera, rap lyrics, newsgroups,
etc. plus historically important works from various
time periods
• Add 10% every five years
• Layered organization
• Dynamic component layered chronologically as
added
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Eventual components
• annotated and aligned speech data
• dialects of American and Canadian English
• other major languages of North America
– Spanish,French Canadian
– aligned to parallel translations in English.
High costs of production prevent inclusion at this stage
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Encoding and annotation
• Markup compliant with the XML Corpus
Encoding Standard (XCES)
• Annotation
– part of speech
– Sub-paragraph elements
• E.g., tokens, names, dates, numbers
• Produced in a two-stage process
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Stage 1: Base level corpus
• Produced after year 1, using limited resources
• XML markup compliant with XCES level 0
• Markup produced by automatic transduction from
original formats
• Automatically tagged for part of speech
– Only spot checking for validity
• Minimal header
– hand-produced
– Includes domain information
• Useful for concordance generation, collocation
analysis
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Stage 2: Final corpus
•
•
•
•
Available after year 3
XML markup conformant to XCES level 1
Full header
Markup for major structural divisions, paragraphs,
sentence boundaries
• Markup for some sub-paragraph elements, where
can be done automatically
– E.g., tokens, names, dates, numbers
• 10% markup and annotation hand-validated
– “gold standard” corpus
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Data architecture
• Follow XCES specifications for “stand-off”
markup
– Annotations in separate XML documents, linked to
original
– Easy to modify and/or add to
• Enables a distributed development model
• Different sites independently add annotation
– Suitable for delivery over the WWW
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Software
• ANC project will provide search and access
software
• Encoding via XML and layered architecture
enables exploiting the evolving XML
environment for search, access, manipulation of
ANC data
– XML Transformation Language (XSLT)
– Resource Description Framework (RDF)
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Availability
• Freely available to non-profit educational and
research organizations from the outset
• No restrictions on obtaining the corpus based
on geographical location
• Consortium members have exclusive access
for commercial exploitation for 5 years
• Distributed by LDC
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Licensing
• LDC
– obtains licenses from text providers
– issues licenses to users
• no redistribution without publisher’s
permission
• “open sub-corpus” portion of the ANC
– licensed on the model of open-source software
Corpus Linguistics 2000
American National Corpus
Lancaster, England
ANC Status
• Founding memberships closed March 31 2001
– Consortium membership now $40K
• Text gathering, format transduction, header production
underway
– Base corpus due March 31 2002
• Preparing production of level 1 corpus
– Gathering technical input from research community
• ANLP/NAACL workshop (Seattle, April 2000)
• LREC workshop (Athens, June, 2000)
– Seeking major funding
– Final core corpus due March 31 2004
Corpus Linguistics 2000
American National Corpus
Lancaster, England
Information
• ANC:
– http://AmericanNationalCorpus.org
– Project Director:
• Catherine Macleod <[email protected]>
– Technical Director:
• Nancy Ide <[email protected]>
• XCES:
– http://www.cs.vassar.edu/XCES
Corpus Linguistics 2000
American National Corpus
Lancaster, England