Using Corpora For Language Research
Download
Report
Transcript Using Corpora For Language Research
Using Corpora for Language
Research
COGS 523-Lecture 4
Using Corpora with
Other Resources;
Corpus Software
06.07.2015
COGS 523 - Bilge Say
1
Related Readings
Readings:
Buchholz and Green (2006); Miller and
Fellbaum (2007); Sampson and McCarthy
Ch 29.
Extra – Information sheet for Resources
Optional (can be used in software reviews!!)
Garretson, G. (2008) Desiderata for
Linguistics Software Design. International
Journal of English Studies 8(1), 67-74. (The
link is available on METU Online)
06.07.2015
COGS 523 - Bilge Say
2
Lexical and Ontological
Resources
Useful for Natural Language
Processing, Pyscholinguistics,
Corpus Annotation (eg automating
semantic annotation)
A selected review is to follow, but
there are others...
06.07.2015
COGS 523 - Bilge Say
3
WordNet - Preliminaries
Lexeme vs Sense
Homonyms (Homophones or
homographs): Words that have the same
form with unrelated meanings
Polysemy: Multiple related meanings with
a single lexeme (eg sperm bank)
Hard to distinguish between polysemy
and homonymy sometimes.
06.07.2015
COGS 523 - Bilge Say
4
WordNet - Preliminaries
Synonymy: Different lexemes, same
(or nearly same) meanings
Hyponymy: A subclass of: poodle>dog; car -> vehicle (opp. direction
hypernymy)
Mereonymy: A part of: leg -> table
Antonymy: Opposites
06.07.2015
COGS 523 - Bilge Say
5
WordNet
A lexical database for English (and 30 other
languages, see Balkanet and EuroWordnet
projects); most extensive use: word sense
disambiguation (Wordnet book available at the
library)
Synsets: A set of synonyms
Each sense entry contains synsets, a dictionary style
definition, some example uses (and a frequency
number)
Four separate databases: nouns (hyponymy,
meronymy), verbs (hyponymy,manner, causation, etc.),
adjectives and adverbs
Synsets will be chained together with hyponynms and
hypernyms – multiple chains possible
06.07.2015
COGS 523 - Bilge Say
6
Bass -> musical instrument ->
instrument -> device ....-> entity
Bass -> singer, vocalist -> musician > performer ....-> entity
06.07.2015
COGS 523 - Bilge Say
7
Extensions
WordNetPlus: Dense Weighted Xdatabase of automatically learned
evocation (how much a certain
concept brings to mind the second)
ratings...First human-rated 120,000
pairs from 1000 synsets – most
frequent concepts in BNC.
ImageNet: Enhancing WordNet with
images and icons.
06.07.2015
COGS 523 - Bilge Say
8
An example of Wordnet Query
06.07.2015
COGS 523 - Bilge Say
9
Turkish WordNet project
http://www.hlst.sabanciuniv.edu/TL/
Combined with phonetic rendering,
morphological analysis, English
equivalent etc.
http://www.ceid.upatras.gr/Balkanet/index.htm
Part of Balkanet project for 6 Balkan languages
12,000 synsets
06.07.2015
COGS 523 - Bilge Say
10
An example of Turkish Wordnet Query
06.07.2015
COGS 523 - Bilge Say
11
An Alternative to Turkish
WordNet
60000 hypernyms, 72 layers
Machine learning from TDK dictionary
Ongoing work, needs disambiguation
More coverage than Turkish WordNet
By Tunga Güngör and Onur Güngör, Boğaziçi Univ
Ontologies - Cyc
A knowledge base of human commonsense and
associated inference engine.
http://www.opencyc.org/ (Free version)
http://research.cyc.com/ (Academic version)
Doug Lenat’s project – 1984+
300,000 concepts
Nearly 3,000,000 assertions (facts and rules),
using 26,000+ relations, that interrelate,
constrain, and, in effect, (partially) define the
concepts.
Natural Language Query and Information Entry
Tools
06.07.2015
COGS 523 - Bilge Say
13
The graph representation of the Cyc Knowledge Base
http://www.cyc.com/cyc/technology/whatiscyc_dir/whatdoescycknow
06.07.2015
COGS 523 - Bilge Say
14
An example of a knowledge representation sample
coded with CycL
06.07.2015
COGS 523 - Bilge Say
15
ConceptNet
http://web.media.mit.edu/~hugo/conceptnet/
Part of Open Mind Initiative
A huge wiki type of effort to create a commonsense
knowledgebase represented as a semantic network
1.6 million edges (assertions) connecting more than 300
000 nodes, where nodes are semi-structured English
fragments.
interrelated by an ontology of twenty semantic relations
such as EffectOf (causality), SubeventOf (event hierarchy),
CapableOf (agent’s ability), PropertyOf, LocationOf,
andMotivationOf (affect).
06.07.2015
COGS 523 - Bilge Say
16
An excerpt from ConceptNet’s semantic network
06.07.2015
COGS 523 - Bilge Say
17
06.07.2015
COGS 523 - Bilge Say
from Liu, H. & Singh, P.
(2004) ConceptNet: A
Practical Commonsense
Reasoning Toolkit. BT
Technology Journal 18
FrameNet
FrameNet is a lexicon-building project for
English, based on frame semantics, carried out
by International Computer Science Institute of
University of Berkeley.
Frame: schematic representation of a situation
type (eating, spying, removing, classifying, etc.)
together with lists of the kinds of participants,
props, and other conceptual roles that are seen
as components of such situations. The semantic
arguments of a predicating word correspond to
what we call the frame elements(FE) of the
frame associated with that word.
06.07.2015
COGS 523 - Bilge Say
19
FrameNet
Uses BNC and ANC
Currently (version 1.3), there are more
than 10,000 lexical units, more than
6,000 of which are fully annotated, in
more than 800 hierarchically-related
semantic frames, exemplified in more
than 135,000 annotated sentences in the
database.
WordNet – ConceptNet hybrid, with a
grammar theory in the background
(Fillmore’s Frame Semantics).
06.07.2015
COGS 523 - Bilge Say
20
Interface of the Frame Grapher
06.07.2015
COGS 523 - Bilge Say
21
Sample Output From Frame Grapher
input: Crime_Scenario
06.07.2015
COGS 523 - Bilge Say
22
Software for Working with
Corpora
“Corpus Linguistics in its current form
cannot work without the help of the
computer.” (Mason)
Acc. to Function: Corpus Building
Software vs Corpus Query Software
Acc. to Design: Standard Software for
Non-Technical Users vs Specialized
Toolkits Providing Standard Functions vs
Using Non-Corpus Specific Tools and
Programming Languages (e.g. grep,
egrep, perl, phyton, tcl/tk, java)
06.07.2015
COGS 523 - Bilge Say
23
Corpus Software
Standard Software: MonoConcPro,
WConcord, Wordsmith, IMS CQP
(Corpus Query Processor, Qwick,
Xaira, Gsearch
More General Purpose NLP
Suites/Toolkits for Programmers:
CUE (Corpus Universal Examiner),
NLTK, GATE
06.07.2015
COGS 523 - Bilge Say
24
Corpus Query/Analysis
Software
Text Analysis Software -> Corpus
Query Software -> Concordancers
Collocations in KWIC format
(Keyword in Contex)
General Features
Search
Display, Save, Export
Statistics
06.07.2015
COGS 523 - Bilge Say
25
Features
Search
Display, save, export
Word, phrase, POS etc search
Regular expression search
Context-sensitive search
Header info search
KWIC or sentence format
Sorting
Saving results or search patterns
Statistics
Frequency and various statistics
Plotting graphs
06.07.2015
COGS 523 - Bilge Say
26
A Comparison Framework
Platform/Operating System
Price
Ease of Installation
User friendliness
Speed
Ease of setting up a corpus/texts
Query syntax
Query search power (collocational, discontinous constituents)
Statistical Analysis
Standard markup scheme handling
Whole text browsing
Character set handling
Output for presentation
06.07.2015
COGS 523 - Bilge Say
27
Desiderata – some maxims
Do not build linguistic theory into the program
any more than necessary
Do separate markup from annotation
Do not gloss over complexities in data – sensible
defaults that can be overriden are fine
Allow users to supply their own analytical
categories – e.g. Annotation of concordance
lines
Make use of standards
Use Unicode
06.07.2015
COGS 523 - Bilge Say
28
IMS Corpus Workbench
(CWB)
http://www.ims.unistuttgart.de/projekte/CorpusWorkbench/
IMS Corpus Query Processor (CQP): query
system for CWB
Allowing use of multiple knowledge sources
(corpora, machine readable dictionaries etc)
Allowing the use of stored information and
calculating information on-line (from remote
corpora)
Both for Human-Machine Use but not really for
novice users...
Regular Expression based syntax.
06.07.2015
COGS 523 - Bilge Say
29
From CWB web site
Query language
unrestricted number of attributes per corpus position
regular expressions over attribute values of individual
corpus positions (e.g. wild cards for word forms, part-ofspeech values)
regular expressions over sequences of corpus positions
(partial) support of structural annotations (e.g. SGML)
incremental concordancing
application of a query to all items of a list
'virtual attributes', i.e. runtime access to external
applications (e.g. WordNet)
queries on parallel translated texts
06.07.2015
COGS 523 - Bilge Say
30
From CWB web site
Display of results
user-definable size of 'keyword in context'
display
'keyword in context' lines can be sorted in
various ways
frequency counts, e.g. for word combinations
multilingual concordances from aligned corpora
html and latex output supported
query history
06.07.2015
COGS 523 - Bilge Say
31
From CWB web site
registration of corpora
'encoding' of corpora, i.e. indexing (and
compression)
(for text sources in one-word-per-line format,
using ISO8859/Latin-1 8bit character sets, and
maybe others)
For example, the BNC corpus with part-ofspeech and lemma annotation will need about 1
GB of disk space.
incremental addition of types of corpus
annotations ('attributes'). E.g. add part-ofspeech values to a corpus once you have access
to a POS-tagger.
06.07.2015
COGS 523 - Bilge Say
32
Regular Expressions
Equivalent to regular languages and
finite automaton languages
Take empty language, languages
with a single string, and apply
concatenation, union or Kleene star
operations on them. Everything you
can generate in this way will be
regular languages. (Partee et al.,
1993)
06.07.2015
COGS 523 - Bilge Say
33
Regular Expressions
From CQP Tutorial...
Basic syntax of regular expressions
letters and digits are matched literally (including all non-ASCII
characters)
word word; C3PO C3PO; déjà déjà
. matches any single character (``matchall'')
r.ng ring, rung, rang, rkng, r3ng, ...
character set: [...] matches any of the characters listed
moderni[sz]e modernise, modernize
[a-c5-9] a, b, c, 5, 6, 7, 8, 9
[^aeiou] b, c, d, f, ..., 1, 2, 3, ..., ä, à, á, ...
repetition of the preceding element (character or group):
? (0 or 1), * (0 or more), + (1 or more), { } (exactly ), { , } ( )
colou?r color, colour; go{2,4}d good, goood, goood
[A-Z][a-z]+ ``regular'' capitalised word such as British
grouping with parentheses: (...)
(bla)+ bla, blabla, blablabla, ...
(school)?bus(es)? bus, buses, schoolbus, schoolbuses
| separates alternatives (use parentheses to limit scope)
mouse|mice mouse, mice; corp(us|ora) corpus, corpora
06.07.2015
COGS 523 - Bilge Say
34
Regular Expressions
Complex regular expressions can be used
to model (regular) inflection:
ask(s|ed|ing)? ask, asks, asked,
asking
(equivalent to the less compact
expression ask|asks|asked|asking)
sa(y(s|ing)?|id) say, says, saying,
said
[a-z]+i[sz](e[sd]?|ing) any form of a
verb with -ise or -ize suffix
06.07.2015
COGS 523 - Bilge Say
35
Some examples from CQP
the specified word is interpreted as a regular expression
>"interest(s|(ed|ing)(ly)?)?";
> [(lemma="under.+") & (pos="V.*")];
a noun, followed by either is or was, followed by a verb ending
in ed:
[pos="N.*"] "is|was" [pos="V.*" & word=".*ed"];
similar, but is or was followed by a past participle (which is
described by a special POS tag):
[pos="N.*"] "is|was" [pos="VBD"];
catch or caught, followed by a determiner, any number of
adjectives and a noun, or
a noun, followed by was or were, followed by caught:
"catch|caught" [pos="DT"] [pos="JJ"]* [pos="N.*"] |
[pos="N.*"] "was|were" "caught";
look or bring, followed by either up or down with at most 10
non-verbs in between:
"look|bring" [pos != "VB.*"]{0,10} "up|down";
06.07.2015
COGS 523 - Bilge Say
36
Searching for more complex
patterns
Gsearch Corpus Query System
http://www.hcrc.ed.ac.uk/gsearch/
Facilitating the investigation of lexical and
syntactic phenomena in unparsed but tagged
corpora (can work with external taggers too)
Users specify their own context free grammar
Can take something like 167 minutes for a
search on 100 million words BNC,
False positives should be manually eliminated
Visualization tools to display tree structures
06.07.2015
COGS 523 - Bilge Say
37
Alternative: Using a class
library
Mason, O. Programming for Corpus
Linguistics: How to do text analysis with
Java, Edinburgh University Press, 2000.
CUE (Corpus Universal Examiner): class
library in Java that takes care of
indexing, compressing large corpora,
support for XML and Unicode
Qwick: a concordancing application that
is developed using CUE
06.07.2015
COGS 523 - Bilge Say
38
A Professional Alternative
http://athel.com/
MonoConcPro ($95)
Features: Context Search, Regular
Expression search, Part-of-Speech Tag
Search, Collocations, and Corpus
Comparison.
Not language specific
You can also buy a Chinese (and other
languages) concordance T-shirt
06.07.2015
COGS 523 - Bilge Say
39
From an older version of MonoConc Pro
06.07.2015
COGS 523 - Bilge Say
40
06.07.2015
COGS 523 - Bilge Say
41
Quality Control in Corpora
Format: Punctuation, delimiters,
character encoding,
Presence and order of all fields,
Typos in labels and annotation.
Explicit Documentation
Format Checker – Structure Checker
Solution: Versioning and Patching
mechanism in Treebanks and Corpora
06.07.2015
COGS 523 - Bilge Say
42
Interrater agreements reliability
Cochran’s Q test – binary values
Kappa – multivalued (Carletta, 1996)
Sensible chosen unit of agreement
Expert vs naive coders
K>0.8 good
Generalizability Theory (G-Theory)
(Bayerl and Paul, 2007) – finer grained
06.07.2015
COGS 523 - Bilge Say
43
Lecture 5
See articles on METU Turkish Corpus and
Metu-Sabanci Treebank under Lecture
Notes.
06.07.2015
COGS 523 - Bilge Say
44