Using Corpora For Language Research

Download Report

Transcript Using Corpora For Language Research

Using Corpora for Language
Research
COGS 523-Lecture 3
Corpus
Annotation
17.07.2015
COGS 523 - Bilge Say
1
Related Readings
Course Pack: Meyer (2002) Ch4;
Sampson and McCarthy (2005) Ch 39;
Garside (1997) Chs 4,5,16
 Optional: McEnery et al (2006): A3, A4,
A8, A9

• For your reference, rest of Garside et al.
(1997) is relatively old but useful.
Slides with tagged text are adopted from McEnery and Wilson (2001) or McEnery et
al(2006) except TEI encodings (see http://www.tei-c.org/Support/Learn/)
17.07.2015
COGS 523 - Bilge Say
2
Mark-up and Annotation



Corpus Mark-up: System of codes inserted into a
document stored in electronic form to provide
information about the text itself and govern
formating: exs: Text Encoding Initiative TEI
Corpus Annotation: Addition of interpretive,
linguistic information to an electronic corpus of
spoken and/or written data
Sometimes used interchangeably
Conflict: Utility of Annotations vs Ease of
Annotations
17.07.2015
COGS 523 - Bilge Say
3
Other Issues
Standards vs Guidelines
 Manual vs Automatic Annotation
 Documentation
 Evaluation of Annotation Schemes
 See LREC conferences....

17.07.2015
COGS 523 - Bilge Say
4
Maxims in Annotation of
Text Corpora (Leech, 93)







Removable-revertable
Extractable
End user guidelines available
Annotation mode and annotator info clear
Reliability available
Annotation schemes – theory neutral or
widely agreed upon ?
No a priori standard
17.07.2015
COGS 523 - Bilge Say
5
Cross-Linguistic Annotation
Standards
Reusability and Shareability
 Ease and Efficiency in Building a
Corpus
 Crosslinguistic Comparability
Examples: TEI, CES; EAGLES (Expert
Advisory Group for Language
Engineering Standards)

17.07.2015
COGS 523 - Bilge Say
6
Problems with
Standardization:
Applicability of standards to existing
or ongoing corpus research
 Acceptibility of standards by general
linguistic community
 Task dependency of corpora
 Applicability to a wide range of
languages

17.07.2015
COGS 523 - Bilge Say
7
Documentation of Markup/Annotation
Guidelines

What should be specified in a annotation
guidelines document?







Level and layers of annotation
Set of annotation devices used and their meanings
Conventions for applying such devices defined supplemented with examples or a reference corpus
Granularity of annotation
Disambiguation process applied (if any)
Measurable quality of annotation (accuracy rate,
consistency rate, extent of manual checking)
Any incompleteness, known errors etc.
17.07.2015
COGS 523 - Bilge Say
8
Markup








A.k.a structural annotation
Different conventions for line breaks, sections,
lists etc exist.
What does that imply?
Character Sets (Unicode, ISO639-3)
Textual Information
COCOA References <A Charles Dickens>
Standard Generalized Markup Language (SGML)
Hypertext Markup Language (HTML)
Extensible Markup Language (XML)
17.07.2015
COGS 523 - Bilge Say
9
XML

Three characteristics of XML
distinguish it from other markup
languages:
its emphasis on descriptive rather than
procedural markup;
 its notion of documents as instances of
a document type;
 its independence of any one hardware
or software system.

17.07.2015
COGS 523 - Bilge Say
10
Text Encoding Initiative
(TEI)






Objective: The Development of an Interchange Language for
Textual Data
Started in 1987
Original P3 documentation 1400 pages
Currently in P5 with extensive web support (see Links)
TEILite: Simplified by a factor of 3.
Moved from SGML to XML





Flexible tagset
Document Type Definitions (DTDs, rules for a particular markup
language, i.e. elements, attributes, entities), more flexible and
optional
XSL – Extensible Style Language
Simpler and better syntax
Corpus Encoding Standard (CES) and XCES

an attempt to specialize XML for corpora (not currently fully
compliant to TEI P5 but many commonalities) (see Links)
17.07.2015
COGS 523 - Bilge Say
11
TEI

Alternative customizations:
tei_bare: TEI Absolutely Bare
 teilite: TEI Lite
 tei_corpus: TEI for Linguistic Corpora
 tei_ms: TEI for Manuscript Description
 tei_drama: TEI with Drama
 tei_speech: TEI for Speech
Representation

17.07.2015
COGS 523 - Bilge Say
12
An example of a feature system declaration (FSD)
<fs id=vvd type=word-form>
<f name=verb-class><sym value=verb>
<f name=base><sym value=verb>
<f name=verb-form><sym value=lexical>
<f name=verb-class><sym value=past>
</fs>
17.07.2015
COGS 523 - Bilge Say
13
Examples of SGML tags
<Q></Q> encloses a question
<EX></EX> encloses an expansion of an abbreviation in the original
manuscript
<LB> indicates a line break
<FRN></FRN> encloses words in another language; Lang=“LA”
indicates Latin
17.07.2015
COGS 523 - Bilge Say
14
Example of XML breakfast food menu
17.07.2015
COGS 523 - Bilge Say
15
TEI P5 structure
The TEI encoding scheme consists of
a number of modules, each of which
declares particular XML elements
and their attributes. (from TEI
Guidelines)
 Modules: core, header,
textstructure, corpus ...

17.07.2015
COGS 523 - Bilge Say
16
TEI for Language Corpora –
text descriptions





channel (primary channel) describes the medium or channel by which a
text is delivered or experienced. For a written text, this might be print,
manuscript, e-mail, etc.; for a spoken one, radio, telephone, face-toface, etc. modespecifies the mode of this channel with respect to
speech and writing.
constitution describes the internal composition of a text or text sample,
for example as fragmentary, complete, etc. typespecifies how the text
was constituted.
derivation describes the nature and extent of originality of this text.
typecategorizes the derivation of the text.
domain (domain of use) describes the most important social context in
which the text was realized or for which it is intended, for example
private vs. public, education, religion, etc. typecategorizes the domain
of use.
factuality describes the extent to which the text may be regarded as
imaginative or non-imaginative, that is, as describing a fictional or a nonfictional world. typecategorizes the factuality of the text. ...
(from Guidelines)
17.07.2015
COGS 523 - Bilge Say
17
TEI Elements

Elements:







Attributes:


Major structuring elements: text, body, front, back..
Paragraph level elements: citation, speaker..
Lists, tables, figures
Phrase level elements: date, emph, foreign
Bibliographical elements: author, publisher
Others: file description, revision description
<div type=“chapter” n=“1”> ...</div>
Entities

&auml; for ä
17.07.2015
COGS 523 - Bilge Say
18
A Text Description Example

textDesc ="Informal domestic conversation">
<channel mode="s">informal face-to-face
conversation</channel>
<constitution type="single">each text represents a continuously
recorded interaction among the specified participants
</constitution>
<derivation type="original"/>
<domain type="domestic">plans for coming week, local
affairs</domain>
<factuality type="mixed">mostly factual, some
jokes</factuality>
<interaction type="complete" active="plural" passive="many"/
>
<preparedness type="spontaneous"/>
<purpose type="entertain" degree="high"/>
<purpose type="inform" degree="medium"/>
</textDesc>
17.07.2015
COGS 523 - Bilge Say
19
A Sample Participant
Description

<person sex="2" age="mid">
<birth when="1950-01-12">
<date>12 Jan 1950</date>
<name type="place">Shropshire, UK</name>
</birth>
<langKnowledge tags="en fr">
<langKnown level="first" tag="en">English</langKnown
>
<langKnown tag="fr">French</langKnown>
</langKnowledge>
<residence>Long term resident of Hull</residence>
<education>University postgraduate</education>
<occupation>Unknown</occupation>
<socecStatus scheme="#pep" code="#b2"/>
</person>
17.07.2015
COGS 523 - Bilge Say
20
Example of TEI Header from University of Michigan Library
17.07.2015
COGS 523 - Bilge Say
21
Adopting XML-based
linguistic annotation
Technical difficulties – human
perceptual difficulties
 Not conformant to how linguistic
knowledge is expressed in many
layers of linguistic annotation...

17.07.2015
COGS 523 - Bilge Say
22
Types of Annotation

Morphosyntactic


Semantic




Word senses, thematic roles
Discourse


Part-of-speech tagging; partial or full parse
Information structure, anaphoric relations, discourse
relations
Prosodic (e.g Intonation)
Pragmatic (e.g. Speech acts)
Problem Understanding (see Message
Understanding (MUC) or Document
Understanding Conferences (DUC))
17.07.2015
COGS 523 - Bilge Say
23
POS Tagging
Obligatory attributes or values:
major word categories
 Recommended attributes or
values:type,gender, case
 Optional: semantic classes,
language specific information,
derivational morphology

17.07.2015
COGS 523 - Bilge Say
24
Tagsets



Issues: Conciseness, ease of
interpretation, analysability,
disambiguatibility, linguistic quality vs
computational tractability– trade-offs...
Size of tagsets: English 30-200; Spanish
475, Turkish 6000 distinct morphological
feature combinations for 250,000 words
What to do with multiwords: in spite of
(ditto tags), mergers (clitics, eg hasn’t),
compounds (eye strain vs eyestrain)
17.07.2015
COGS 523 - Bilge Say
25
Tagging Accuracy
Amount of training data available
 The size of the tagset
 Training data,dictionary vs the real
corpus – the differences
 Unknown words
 Recall and Precision
 2-6% error rate for English

17.07.2015
COGS 523 - Bilge Say
26
Examples of codes of tag sets
LOB Corpus (C1 tag set)
IN
preposition
JJ
adjective
NN
singular common noun
NNS plural common noun
NP
singular proper noun
NP$ genitive proper noun
PP$ possessive pronoun
RP
adverbial adjective
17.07.2015
SEC (tagset C7)
IF
for
IO
of
NN1 singular common noun
NN2 plural common noun
NNJ singular group noun
RL
locative adverb
RR
general adverb
RT
temporal adverb
BNC (C5 tagset)
DPS possessive determiner
NN1 singular common noun
NN2 plural common noun
NP0 proper noun
PNP personal pronoun
POS genitive marker ('s)
PRF of
PRP preposition
COGS 523 - Bilge Say
27
Example of part-of-speech tagging from LOB corpus (C1 tagset)
P05
32
^ Joanna_NP stubbed_VBD out_RP her_PP$ cigarette_NN with_IN
P05
32
unnecessary_JJ fierceness_NN ._.
P05
33
^ her_PP$ lovely_JJ eyes_NNS were_BED defiant_JJ above_IN
P05
33
cheeks_NNS whose_WP$ colour_NN had_HVD deepened_VBN
P05
34
at_IN Noreen’s_NP$ remark_NN ._.
17.07.2015
COGS 523 - Bilge Say
28
Example of part-of-speech tagging from Spoken English Corpus (C7 tagset)
^ For_IF the_AT members_NN2 of_IO this_DD1 university_NN1 this_DD1 character_NN1
enshrines_VVZ a_AT1 victorious_JJ principle_NN1 ;_; and_CC the_AT fruits_NN2 of_IO
that_DD1 victory_NN1 can_VM immediately_RR be_VBI seen_VVN in_II the_AT
international_JJ community_NNJ of_IO scholars_NN2 that_CST has_VHZ graduates_VVN
here_RL today_RT ._.
17.07.2015
COGS 523 - Bilge Say
29
Example of part-of-speech tagging from the British National Corpus (C5 tagset in TEI-conformant
layout)
Predita&NN1-NP0; , &PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF;
the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; pritect&VVI;
&bquo;&PUQ;I&PNP;’11&VM0; polish&VVI; your&DPS;
boots&NN2; , &PUN; &equo;&PUQ; he&PNP; offered&VVD; .&PUN;
17.07.2015
COGS 523 - Bilge Say
30
Example from the CLAWS system
0000117
040
I
03 PPIS1
0000117
050
do
03 VD0
0000117
051
n’t
03 XX
0000117
060
think
99 VVI
17.07.2015
COGS 523 - Bilge Say
31
Syntactic Annotation


More problematic than POS tagging. Can
you guess why?
Proposed levels





Bracketing of segments
Labeling of segments
Marking of dependency relations, eg
complements
Indicating functional labels, e.g. Subject,
object
Extra: ellipsis, traces...
17.07.2015
COGS 523 - Bilge Say
32
Treebanks





Penn Treebank – the initiator
Treebanks for Swedish, Danish, German,
Dutch, French, Turkish, Czech, Spanish,
Basque, Russian, Chinese Portuguese,
Italian...
Sizes: 700 to 90,000 sentences
Automated and manual annotation
Grammar formalisms: Context-free
grammar trees, dependency, LFG, HPSG,
CCG
17.07.2015
COGS 523 - Bilge Say
33
Example of full parsing from the Lancaster-Leeds treebank
[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb
is_BEZ Vzb] [Ns the_ATI [NN/JJ& wine-glass_NN [JJ+ or_CC
flared_JJ JJ+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq]
…
Fr]Ns] ._. S]
17.07.2015
COGS 523 - Bilge Say
34
Example of skeleton parsing from Spoken English Corpus
[S&[P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1
university_NNl1 N]P]N]P] [N this_DD1 character_NN1 N] [V
enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1
N]V]S&]._.
17.07.2015
COGS 523 - Bilge Say
35
From Penn Treebank
((S (NP-SBJ-1
(NP Yields)
(PP on
(NP money-market mutual funds)))
(VP continued
(S (NP-SBJ *-1)
(VP to
, (VP slide)))
(PP-LOC amid
(NP signs
(SBAR that
(S (NP-SBJ portfolio managers)
(VP expect
(NP (NP further declines)
(PP-LOC in
(NP interest rates)))))))))
Tiger Treebank – A German treebank
<n id="n1_500" cat="S">
<edge href="#id(w1)"/>
<edge href="#id(w2)"/>
</n>
<w id="w1" word="the"/>
<w id="w2" word="boy"/>
17.07.2015
COGS 523 - Bilge Say
37
Semantic Annotation
Makes sense in linguistic or
psycholinguistic terms
 Applicable to whole corpus
 Flexible and right level of granularity
 Hierarchical structure (?)
 Conforming to standards
(Schmidt, 88)

17.07.2015
COGS 523 - Bilge Say
38
Other Issues
Harder to annotate
 Can be computer assisted if
appropriate interfaces to lexical
resources are developed
 General frequency information can
help in disambiguation

17.07.2015
COGS 523 - Bilge Say
39
Example of semantic text analysis, based upon Wilson (1996)
And
00000000
the
00000000
soldiers
23241000
Key:
platted
21072000
a
00000000
crown
21110400
00000000
13010000
21030000
21072000
of
00000000
thorns
13010000
and
00000000
put
21072000
it
00000000
on
00000000
his
00000000
head
21030000
Low content word
Plant life in general
Body and body parts
Object-oriented
physical activity
21110321 Men’s clothing: outer
clothing
21110400 Headgear
23241000
War and conflict:
general
31241100Color
Example of anaphoric annotation from Lancaster Anaphoric Treebank.
A039 1 v
(1 [N Local_JJ atheists_NN2 N] 1) [V want_VVO (2 [N the_AT (9
Charlotte_NP1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO
get_VV0 rid_VVN of_IO [N (3 <REF=2 its_APP$ chaplain_NN1 3)
,_, [N {{3 the_AT Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_,
38_MC N]N]Ti]V] ._.
17.07.2015
COGS 523 - Bilge Say
41
Example of prosodic annotation from London-Lund corpus
1
1
1
1
1
1
8
8
8
8
8
8
14
14
14
14
14
14
1470
1480
1490
1500
1510
1520
1
1
1
1
1
1
1A
1A
1B
1A
1B
1A
11
20
11
11
11
11
^ what a_bout a cigar\ette# .
*((4 sylls))*
*I ^w\on't have one th/anks#* - - ^ aren't you 'going to sit d/own# ^ [/\m] # ^have my _coffee in p=eace# - - -
/
/
/
/
/
/
Example of codes of prosodic annotation.
#
end of tone group
^
onset
/
raising nuclear tone
\
falling nuclear tone
/\
raise-fall nuclear tone
_
level nuclear tone
[]
enclose partial words and phonetic symbols
Also represent
unintelligable speech
background noise
overlapping speech
(conventions exist)
Changing names for
privacy
Lecture 4
Using corpora w. other resources and corpus
query tools (general); corpus/treebank
quality control.
Readings: Buchholz and Green (2006); Miller
and Fellbaum (2007); Sampson and
McCarthy Ch 29.
Due Date: Project Proposals
17.07.2015
COGS 523 - Bilge Say
43