Week 8: Corpus linguistics (2)

Download Report

Transcript Week 8: Corpus linguistics (2)

Annotation of corpora
•
•
•
•
•
A. Part-of-speech tagging
B. Syntactic annotation
C. Semantic annotation
D. Discourse annotation
E. Pragmatic annotation
Annotation of corpora
• perfectly plain: produced by scanning; no information
about text (usually, not even edition)
• marked up for formatting attributes: e.g. page breaks,
paragraphs, font sizes, italics, etc.
• annotated with identifying information, e.g. edition date,
author, genre, register, etc.
• annotated for part of speech, syntactic structure, discourse
information, etc.
A. Part-of-speech tagging
LOB sample with POS tagging
A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.
A01 3 ^ by_IN Trevor_NP Williams_NP ._.
A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN
A01 4 nominating_VBG any_DTI more_AP labour_NN
A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT
meeting_NN
A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.
A. Part-of-speech tagging
• Main steps:
– Divide the text into word tokens (tokenization)
– Select a set of tags
– Apply tag set to tokens
• Tokenization:
– orthographic word - morpho-syntactic unit?
– multiwords, e.g., in spite of label as
in_PREP31 spite_PREP32 of_PREP33
– mergers, e.g., clitics as in hasn’t, je t’aime, vendetelo
label as vendete_VERBlo_PRON
– compounds, e.g., tag set label as
tagset_NOUN or tag_NOUN set_NOUN?
A. Part-of-speech tagging
• Choice of tag set
• sophisticated, linguistically well grounded set of tags…
• BUT: not automatically applicable without loss of
accuracy
• example: come - present plural indicative, imperative,
subjunctive; Lancaster corpus: distinguish from toinfinitive, LOB, Brown corpus: don’t distinguish
A. Part-of-speech tagging
• tag = word class
• label = alphanumeric characters
• examples:
preposition
singular proper noun
preposition
prep
IN
NOUN:prop:sing
N-p-sg
NP1
• logically organized (taxonomy), e.g., in Lancaster, BNC,
C7
• presentation: horizontal or vertical
A. Part-of-speech tagging
• encoding of tags
• TEI (SGML), e.g., BNC
<w AV0>Even <w AT0>the <w AJ0> old
<w NN2>women <w VVB>manage <c PUN>, <w
AVO>just <w CJS>as <w PNP>they <w
VVB>’re <w VVG>passing <wPNP>you <c
PUN>.</PUN> (Garside et al., 1997)
A. Part-of-speech tagging
• Applying tags to words
• tagging scheme should include a procedure of how to
assign tags to words (both for humans and machines)
• need a lexicon: it will say which tags are assignable to
which words
• again: ambuguity is a problem
B. Syntactic annotation
• syntactic annotation = parsed corpora
• purposes:
– training automatic parsers (computational linguistics,
e.g. probabilistic parsers - inductive training through
extraction of frequency counts)
– extracting information (linguistics, e.g., building a
lexicon, investigating subcategorization frames,
collocations or other linguistic things, describing
sublanguages)
B. Syntactic annotation
• a parsing scheme needs (cf. POS tagging):
– a list of symbols
– definitions of symbols
– description of how to apply symbols to text
• syntactically annotated corpora: tree banks
• examples of tree banks: Penn Treebank, Nijmegen
Treebank, Susanne Corpus , Helsinki Constraint
Grammar (ENGCG), Lancaster/IBM SEC treebank
B. Syntactic annotation
• Parsing
• the (automatic) analysis of texts (sentences) in terms of
syntactic categories
S
NP
NP
VP
ADJP
NP
PP
NP
NP
Pierre
Vinken
61
old will join
years
the
as an executive Nov 29
board
director
B. Syntactic annotation
• Penn Treebank
• skeleton parsing: partial parse, leaving out the “hard”
things (such as PP-attachment)
• phrase structure model (Garside et al., 1997, p.42)
((S (NP (NP Pierre Vinken)
,
(ADJP (NP 61 years)
old
,))
will
(VP join
(NP the board)
(PP as
(NP a nonexecutive director))
(NP Nov 29)))
.)
B. Syntactic annotation
•
•
•
•
•
Penn Treebank
available through LDC
size: 3,300,000 words (Feb 97)
Brown corpus, Wall Street Journal
in the current phase:
– add function labels (Subj, Obj etc.)
– add null constituents or traces (e.g., It’s easy [t] to eat)
– add indices for coreferences (e.g., Mary[i] saw herself[i] in the
mirror)
– discontinuous constituents
– add semantic roles (Agent, Goal etc)
• may get too complex for large-scale reliable analysis…
B. Syntactic annotation
•
•
•
•
•
Susanne Corpus
part of the Brown corpus, 128,000 words
result of manual analysis
parsing scheme specified in great detail
available from Oxford Text Archive:
– sable.ox.ac.uk/ota (http)
– ota.ox.ac.uk/pub/ota/public (ftp)
A./B. Demo
• TIGER
• NEGRA
C. Semantic annotation
• problem (1): more than one way of referring to a
concept, e.g.,
– text analysis: choice of expression may reflect
ideologies in the text or relationships between
participants in conversation, for example, in doctorpatient interaction
abdomen --- tummy
– information retrieval: historian in fashion seeks
information about trousers
trousers --- slacks, shorts, leggings, breeches
--> cf. RECALL in IR
C. Semantic annotation
• problem (2): one single word can refer to different
concepts, e.g.,
– information retrieval: historian in fashion wants to
know about boots
boot --- may refer to shoe, computer, kick, car
--> cf. PRECISION in IR
• so:
– need to identify related words
(problem 1)
– need to identify the different senses of a word (problem
2)
C. Semantic annotation
• labeling words according to semantic field (word
senses) so that you can
• … extract all the related words by querying on the
semantic field
• … extract only those instances of ambiguous
words with the specific senses you want by
querying on the combination of word and
semantic field
C. Semantic annotation
• semantic fields: sense relations and other kinds of relations
(e.g., part-of, related-to etc.)
• annotation (cf. PoS tagging):
– definition of the tagging scheme (labels and their meanings)
– guidelines for applying the tagging scheme
– in semantics: this is not as easy and straightforward as for PoS
tagging!
– requirements:
• should make linguistic/psycholinguistic sense
• should be able to account for the vocabulary in the corpus
exhaustively
• should be suitable for texts from different periods and register
(comprehensiveness)
• should preferably have a hierarchical structure
C. Semantic annotation
• multiple membership, e.g.,
deepened: color and change/remain
• multiword units, e.g.,
stubbed out: encoded as two separate words, but
belonging together
• one recent ambitious attempt at a taxonomy of such
semantic relations (sense relations, thesaurus-type
relations, semantic fields etc.): WORDNET at
www.cogsci.princeton.edu/~wn/
• you can try it online:
www.cogsci.princeton.edu/~wn/online/
C. Semantic annotation
• How to do it?
– manually
– computer-assisted (need at least a computerreadable lexicon and a disambiguation process similar to PoS tagging)
– fully automatic (not really feasible):
• semantic analysis is even harder than syntactic parsing
• no integrated ‘parse’ of meaning possible at the present
time
D. Discourse annotation
• discourse features: what are they?
• Typically: cohesion and coherence
• coherence: what makes a text hang together
in terms of content
• cohesion: the means of making a text hang
together
• reference, substitution, ellipsis, conjunctive
relations (cause, result, effect etc.), thematic
development
• Halliday & Hasan, 76
D. Discourse annotation
• example: anaphoric relations in the
IBM/Lancaster corpus (UCREL)
• try to build up sth. like an ‘anaphoric
treebank’
• what are anaphoric relations?
– links between a proform and an antecedent
– example:
The married couple said that they were happy
with their lot.
The married couple said that they were happy
with their lot.
D. Discourse annotation
• anaphoric annotation in UCREL: categories used are
based n Halliday & Hasan, 76
• example of annotation:
(1 Feodor Baumenk 1), a former Nazi death
camp guard, has asked the U.S Supreme
Court to allow <REF=1 him to retain <REF=1
his American citizenship. (2 The Hartford
Courant 2) said…
• symbols:
(1), (2)… = antecedent
<
= anaphoric
(> = cataphoric)
REF = central pronoun
D. Discourse annotation
• few corpora annotated for discourse features…
• how to do it?
– manually
– computer-assisted: either interactive hand annotation,
using some kind of specialized editor or automatic
annotation with the possibility of hand correction or
disambiguation
– a tool supporting annotation of anaphora: XANADU in
Lancaster
E. Pragmatic annotation
• anything beyond sentences and discourse: contexts of
situation and culture
• examples of things people look at in pragmatics
– carry-on signals in conversation (e.g., Stenstroem 87):
which functions do carry-on signals such as “well”, “you
know” etc. have in conversation?
– speech acts (e.g., Stiles 92): speech act types in
conversation, e.g., in doctor-patient interactions
PATIENT: I have the headaches to the
point that I have to vomit (D)
DOCTOR: Mm -hm (K)
PATIENT: Then I have to go to bed and
I sleep for a while (E)
D = Disclosure
K = Acknowledgment
E = Edification
E. Pragmatic annotation
• how to do it?
– manually
– computer-assisted: ?
– fully-automatic: -
• You have to use your imagination!
• Stenstroem example: Can be done with a concordance
program because it’s essentially word-based
• Stiles example: would probably have to be done manually
(then use a concordance program on the annotated texts?)
Higher-level annotation: tools
• Tools that support specialized analysis, such as
– specialized editors, e.g., Xanadu for anaphoric relations
– specialized in terms of linguistics models,
• e.g., Sys-Tools for Systemic Functional Grammar
(minerva.ling.mq.edu.au/)
(http://cirrus.dai.ed.ac.uk:8000/Coder/index.html)
• e.g., RSTTools for rhetorical relations analysis
(www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool/index.
html)
• Tools that support various kinds of analysis (but not
quite everything you might want to do):
– TATOE (www.darmstadt.gmd.de/~rostek/tatoe.htm)
References
• Garside R., G. Leech & A. McEnery (eds.), 1997. Corpus
Annotation. Linguistic Information from Computer Text
Corpora. Longman: London
• Fellbaum C. (ed), 1998. WordNet. An Electronic Lexical
Database. MIT Press.
• Garside et al., 1997. Corpus annotation. London, Longman.
• Halliday M.A.K. & R. Hasan, 1976. Cohesion in English.
Longman, London.
• Mindt, 1991. Syntactic evidence for semantic distinctions in
English. In Aijmer & Altenberg (eds), English Corpus
Linguistics: Studies in Honour of Jan Svartvik, London,
Longman.
• Stenstroem, 1987. Carry-on signals in English conversation. In
Meijs (ed), Corpus Linguistics and Beyond. Amsterdam,
Rodopi.
• Stiles, 1992. Describing talk: a taxonomy of verbal response
models. Beverly Hills, Sage.