Large numbers and indexing the British National Corpus

Download Report

Transcript Large numbers and indexing the British National Corpus

Large numbers and indexing
the British National Corpus
Paul Rayson, Geoffrey Leech, Andrew Wilson
Depts. of Computing and Linguistics, Lancaster University.
Dept. of Linguistics, University of Wales, Bangor.
Data & Software
• British National Corpus Version 2
• Lemmatised
• Matrix software on Sun Workstation
Types & Single Tokens
Types
Hapax Legomena
%
Entire BNC
757097
397045
52.4
Informative
675497
353638
52.4
Imaginative
194385
85359
43.9
Context Gov
64568
23208
35.9
Demographic
44532
16508
37.1
Comparison
• Take two lemmatised grammatical
word frequency lists
• Apply log-likelihood measure
• Sort by log-likelihood
Three views
Spoken
vs.
Written
All frequencies are
per million words.
POS
Int
Pron
Uncl
NoC
NoP
Adj
Prep
Det
Adv
Neg
Verb
DetP
Lett
Gen
VMod
Ex
Fore
ClO
Conj
Num
Inf
FqSpoken
29015
130374
20283
134567
16354
41980
72969
71881
84809
17279
206114
45468
4974
1748
21400
4067
103
57
58148
21794
16616
+
+
+
+
+
+
+
+
+
+
+
+
+
LL
980618.0
776397.7
775237.2
428646.5
212676.4
188679.9
182661.4
122250.0
120251.9
97049.2
86674.6
79025.1
54964.1
32983.8
30854.7
9284.1
2935.7
830.2
449.5
97.4
14.4
FqWritten
930
50610
347
228509
43499
77896
117270
106907
55650
6929
165180
28402
1238
5463
13988
2355
394
159
56487
21319
16456