Using Corpora For Language Research

Download Report

Transcript Using Corpora For Language Research

Introduction to Corpora
and Corpus Linguistics
COGS 523-Lecture 2
Corpus Design
Issues I
18.07.2015
COGS 523 - Bilge Say
1
Related Readings
Readings: (Course Pack):
 Tognini-Bonelli (2001) Corpus Issues. Ch3
 McEnery et al(2006) Unit A7-A9, B1 –all appear to be
one article in the course pack
 Meyer (2002) Planning the Construction of a corpus. Ch
2.
Optional : PennTreebank and Czech National Corpus
articles from Course Pack
 McEnery and Wilson (2001) Chs 2 and 3
 Also Available in Sampson and McCarthy (2005)
Anthology:
• Biber (1993) Representativeness in Corpus Design.
Literary and Linguistic Computing 8(4)
• Atkins, Clear and Otkins (1992) Corpus Design Criteria.
Literary and Linguistic Computing, 7(1)
18.07.2015
COGS 523 - Bilge Say
2
What is a Corpus?
Derlem (alt. Bütünce)
Text/Speech/
Video
+
Annotation
Digital media
Written/Spoken
Language
18.07.2015
Design Criteria
COGS 523 - Bilge Say
3
Stages of Corpus Building-I


(aka as Corpus Compilation)
Specifications and Design









Develop Infrastructure and Find Funding !!!
Sampling, Representativeness, Balance, Copyright
issues
Piloting
Planning Manpower
Preparation of an Annotation Manual
Acquisition or Development of Software for Annotation
Technical Equipment Acquisition
Design and Development of Corpus Query Tools
Design of Change Management Processes
18.07.2015
COGS 523 - Bilge Say
4
Stages of Corpus Building-II

Data capture and Preprocessing

Transcription, Tokenization, Error
Correction
Annotation (Markup)
 User Documentation
All these accompanied by cyclic
quality control processes and beta
releases for user feedback

18.07.2015
COGS 523 - Bilge Say
5
Representativeness and
Balance




Balance: Weightings between different
sections of a corpus, according to its
design purpose
Representativeness: The findings from an
idealized representative corpus should be
generalizable to whole language or a
specified part of it.
What is the relationship between balance
and representativeness?
Is ideal representativeness possible?
18.07.2015
COGS 523 - Bilge Say
6
Ways to Approach Sampling
Elitist – Based on Literary and
Academic Merit
 Popularity
 Typicalness
 Availability
 Random  (or sampling out of a
National Library Holdings for
example)

18.07.2015
COGS 523 - Bilge Say
7
More about sampling

Choose a sampling frame: identify a
specific population to make
generalizations about




For BNC spoken part: United Kingdom was
divided into 12 regions of 30 sampling points
selected based on their demographic profile.
Gender balance: may be hard to get in some
genres
Who is native? ICE-US: had lived in USA and
spoken American English since 10-12 years of
age
Education Levels, Age, Dialect Variation
18.07.2015
COGS 523 - Bilge Say
8
Spoken Data Sampling
Elicited – MapTask corpus
 Natural  - Self-recording
 Origins (immigrancy/nativeness,
age,gender,geographic district,
dialect)
 Dialogues vs Monologues

18.07.2015
COGS 523 - Bilge Say
9
Something in between
Netspeak: blogs, chatrooms, SMSs...
 Pre-prepared speeches...

18.07.2015
COGS 523 - Bilge Say
10
Minimal Criteria for a
Balanced General Corpus

Suggested by Sinclair (91)
Fiction vs Nonfiction
 Book, journal vs newspaper
 Formal vs informal
 Control of age, gender, and origin of
authors

18.07.2015
COGS 523 - Bilge Say
11
Idealized vs Opportunistic
Representativeness
Measuring exposures (perception)
 Measuring production

Purely frequency based estimate:
90% conversation,
3% letters or notes,
7% press reportage, fiction, lectures etc.


Distinguishing genre, register, text
type
18.07.2015
COGS 523 - Bilge Say
12
The size and frequency of exposures
res
ofof
Czech
Czech
sepakers
speakers
to various topics and kinds of written language (Kucera, 2002).
I. Specialized
II. Non-specialized
(technical) texts
33,50% texts
66,50%
Journals
56%
Fiction and Poetry
10%
Letters or chronicles
0,50%
18.07.2015
COGS 523 - Bilge Say
13
Size




How many tokens are enough to discover
the patterns of collocation, polysemy,
morphology, syntax, discourse etc?
10-20 millions words suggested by
Sinclair in 1991 for a general,small useful
corpus
100 million words CNC, BNC
100 million words core, several hundred
more as periphery for ANC
18.07.2015
COGS 523 - Bilge Say
14
Types vs Tokens

Hapax Legomana (Greek for “read only
once”)



Almost half of the word types occur only once
in the corpus
1 million word corpus – 100 word types
occur more than 1000 times
100 million word corpus – 8000 word
types can be expected to occur more than
1000 times – 95% of tokens. Remaining
5% - ½ million word types.
18.07.2015
COGS 523 - Bilge Say
15
General Guidelines




Prosody – 100.000 words of spontaneous
speech
1 million – verb form morphology, some
syntactic processes, high frequency
vocabulary
Cross-linguistics and scientific studies
are rare!
Always collect ~10% more than your aim.
Despite best effort for quality control, you
may have to discard some data.
18.07.2015
COGS 523 - Bilge Say
16
Individual Sample Size


2000 words (first generation corpora)
Varied vs fixed- BNC varies, as much as 40.000.





Fixed size: what if something is too small or too big?
Newspapers: “constructed week” concept
20.000 words (Ooostdijk, 88)
2000-5000 words from 20-80 texts from each
genre (Based on Biber’s 1990 study of 10
linguistic features from 55 pairs of samples from
LOB and LLC)
May be an issue for copyright!
18.07.2015
COGS 523 - Bilge Say
17
Brown University Standard Corpus of Present-Day American English
(Francis & Kucera)
(Brown Corpus)
1 million words -- 1961-1964, 500 samples of 2000 words each
Structure
Informative Prose
75.0 Y.
Imaginative
25 Y.
A. Press: reportage
8.8 Y.
K. General Fiction
5.8 Y.
B. Editorial (Press)
5.4 Y.
L. Mysteryy and Detect. F. 4.8 Y.
C. Reviews (Press)
3.4 Y.
M. Sciencefiction
1.2 Y.
D. Religion
3.4 Y.
N. Adventure & Western 5.8 Y.
E. Skills & hobbies
7.2 Y.
P. Romance & Love Story 5.8 Y.
F. Popular lore
9.6 Y.
R. Humor
1.8 Y.
G. Learned (academic) 16 Y.
(Meyer, 2002)
18.07.2015
COGS 523 - Bilge Say
18
The division of text types and domains in
Czech syncronic corpus of written texts (Kucera, 2002)
I. Imaginative texts
15% II.Informative texts
85%
1.Fiction
60%
2.Poetry
11,02% 1.Journals
2.Technical and
0,81% specialized texts
25%
3.Drama
0,21% a.Lifestyle
5,55%
4.Other literary texts
0,36% b.Technology
4,61%
5.Transitional types of texts 2,60% c.Social Sciences
18.07.2015
3,67%
d.Arts
3,48%
e.Natural sciences
f.Economics and
management
3,37%
g.Law and security
0,82%
h.Blief and religion
0,74%
i.Administrative texts
0,49%
COGS 523 - Bilge Say
2,27%
19
The composition of the British National Corpus
(part of Table 2.1 in Meyer (2002))
Speech
Type
Number of Text
Number of Words
% of Spoken Corpus
Demographically
Sampled
153
4,211,216
41%
Educational
144
1,265,318
12%
Business
136
1,321,844
13%
Institutional
241
1,345,694
13%
Leisure
187
1,459,419
14%
54
761,973
7%
915
10,365,464
100%
Unclassified
Total
18.07.2015
COGS 523 - Bilge Say
20
The composition of the British National Corpus
Writing
Type
Number of Text
Number of Words
% of Written Corpus
Imaginative
625
19,664,309
22%
Natural Science
144
3,752,659
4%
Applied Science
364
7,369,290
8%
Social Science
510
13,290,441
15%
World Affairs
453
16,507,399
18%
Commerce
284
7,118,321
8%
Arts
259
7,523,846
8%
Blief & thought
146
3053672
0.03
Leissure
374
9,990,080
11%
50
1,740,527
2%
3209
89,740,554
99%
Unclassified
Total
(part of Table 2.1 in Meyer (2002)
18.07.2015
COGS 523 - Bilge Say
21
Composition of the ICE (part of Table 2.2 in Meyer (2002))
Speech
Type
Dialogues
Number of Text
Number of Words
% of Spoken Corpus
180
360,000
59%
100
200,000
33%
80
160,000
26%
120
240,000
40%
70
140,000
23%
50
100,000
17%
300
600,000
99%
Private
(direct conversions, distance
conversions)
Public
(class lessons, broadcast
discussions, broadcast interviews,
parliamentary debates, legal crossexaminations, business
transactions)
Monologues
Unscripted
(spontaneous commentaries,
speeches, demonstrations, legal
presentations)
Scripted
(broadcast news, broadcast talks,
speeches (not broadcast))
Total
18.07.2015
COGS 523 - Bilge Say
22
Copyright Issues

Publishers





science vs commercial aims conflict
check who has the copyright
have written signed agreements
status of some sources might be disputable:
still have written and signed agreements
Individuals

Their informed consent, give guarantee of
being non-identified
18.07.2015
COGS 523 - Bilge Say
23
Collecting and
Computerizing Samples

Written Text




Scanning (introduces OCR errors)
Electronic Documents (different formats, different character
sets)
Uploading documents (See ANC web site)
Spoken Text





Inform participants of your aim and that there is no
linguistically “correct” Turkish etc.
Record longer than needed (2000 word sample- 10-20
minutes needed, collect 30 mins) so that you can cut off
unnatural parts in the beginning
Record in natural environments
Invest in good equipment and good software
Even like that, 4 out 10 samples may be unusable (Meyer,
2002)
18.07.2015
COGS 523 - Bilge Say
24
Recording Information
About Samples

File headings – Annotation schemes
like TEI account for that


Bibliographical info, ethnographic info,
recording info, annotation info etc.
Directory Structures and File names

Usable – for the builders, for the users?
18.07.2015
COGS 523 - Bilge Say
25
Partial directory structure of American component of ICE (Figure 3.1 of Meyer (2002))
Spoken
Written
Dialoges
Monologues
Printed
Non-Printed
Business
transactions
Classroom
discussions
Draft
S1B-071d
S1B-072d
etc.
Lexical version
S1B-071l
S1B-072l
etc.
Proofread version (I)
S1B-071p1
S1B-072p1
etc.
Proofread version (II)
S1B-071p2
S1B-072p2
etc.
Political
debates
Spontaneous
conversations
Broadcast
discussions
Broadcast
interviews
Legal
conversations
Lecture 3
Corpus Design II (Annotation)
 Readings: Meyer (2002) Ch4; Sampson
and McCarthy (2005) Ch 39; Garside
(1997) Chs 4,5,16
 Inform me and Ayisigi (in writing) of
your chosen corpus tool for software
review by 17 March. Precheck w.
Ayisigi that the tools suits the task
criteria.

18.07.2015
COGS 523 - Bilge Say
27