Transcript Document

Corpora and Statistical Methods –
Lecture 3
Albert Gatt
Part 1
Zipf’s law and the Zipfian distribution
Identifying words
Words
 Levels of identification:
 Graphical word (a token)
 Dependent on surface properties of text
 Underlying word (stem, root…)
 Dependent on some form of morphological analysis
 Practical definition: A word…
 is an indivisible (!) sequence of characters
 carries elementary meaning
 is reusable in different contexts
Indivisibility
 Words can have compositional meaning from parts that are
either words themselves, or prefixes and suffixes
 colour + -less = colourless (derivation)
 data + base = database (compounding)
 The notion of “atomicity” or “indivisibility” is a matter of
degree.
Problems with indivisibility
 Definite article in Maltese
 il-kelb
DEF-dog
“the dog”
 phonologically dependent on word
 German componding
 Lebensversicherunggesellschaftsangestellter
“life insurance company employee”
 Arabic conjunctions:
 waliy
 One possible gloss: and I follow (w- is “and”)
Resuability
 Words become part of the lexicon of a language, and can be
reused.
 But some words can be formed on the fly using productive
morphological processes.
 Many words are used very rarely
 A large majority of the lexicon is inaccessible to native speakers
 Approximately 50% of the words in a novel will be used only
once within that novel (hapax legomena)
The graphic definition
 Many corpora, starting with Brown, use a definition of a
graphic word:
 sequence of letters/numbers
 possibly some other symbols
 separated by whitespace or punctuation
 But even here, there are exceptions.
 Not much use for tokenisation of languages like Arabic.
Non-alphanumeric characters
 Numbers such as 22.5
 in word frequency counts, typically mapped to a single type ##
 Other characters:
 Abbreviations: U.S.A.
 Apostrophes: O’Hara vs. John’s
 Whitespace: New Delhi
 A problem for tokenisation
 Hyphenated compounds:
 so-called, A-1-plus vs. aluminum-export industry
 How many words do we have here?
Tokenisation
 Task of breaking up running text into component words.
 Crucial for most NLP tasks, as parameters typically estimated based
on words.
 Can be statistical or rule-based. Often, simple regular
expressions will go a long way.
 Some practical problems:
 Whitespace: very useful in Indo-European languages. In others (e.g.
East Asian languages, ancient Greek) no space is used.
 Non-alphanumeric symbols: need to decide if these are part of a word
or not.
Types and tokens
Running example
 Throughout this lecture, data is taken from a corpus of
Maltese texts:
 ca. 51,000 words
 all from Maltese-language newspapers
 various topics and article types
 Compared to data from English corpora taken from Baroni
2007
Definitions (I)
 token = any word in the corpus
 (also counting words that occur more than once)
 type = all the individual, different words in the corpus
 (grouping words together as representatives of a single type)
 Example:
 I spoke to the chap who spoke to the child
 10 tokens
 7 types (I, spoke, to, the, chap, who, child)
Definitions (II)
 The number of tokens in the corpus is an estimate of overall
corpus size
 Maltese corpus: 51,000 tokens
 The number of types is an estimate of vocabulary size
 gives an idea of the lexical richness of the corpus
 Maltese corpus: 8193 types
Relative measures of frequency
 Type-token ratio:
 no. occurrences of a type / corpus size
 essentially relative frequency
 In very large corpora, this is typically multiplied by a
constant
 e.g. multiplying by 1 million gives frequency per million
Type/token ratio
 Ratio varies enormously depending on corpus size!
 If the corpus is 1000 words, it’s easy to see a TTR of, say,
40%.
 With 4 million words, it’s more likely to be in the region of
2%.
 Reasons:
 vocab size grows with corpus size but
 large corpora will contain a lot of types that occur many times
Frequency lists (BNC)
 A simple list, pairing each word with its frequency
type
frequency
the
6054231
in
1931797
time
149487
year
73167
man
57699
…
monarch
744
cumin
51
prestidigitation
3
Frequency lists (MT)
type
aħħar (“last”)
jkun (“be.IMPERF.3SG”)
ukoll (“also”)
bħala (“as”)
dak (“that.SGM”)
tat- (“of.DEF”)
frequency
97
96
93
91
86
86
Frequency ranks
 Word counts can get very big.
 most frequent word in the Maltese corpus occurs 2195 times (and the
corpus is small)
 Raw frequency lists can be hard to process.
 Useful to represent words in terms of rank:
 count the words
 sort by frequency (most frequent first)
 assign a rank to the words:
 rank 1 = most frequent
 rank 2 = next most frequent
 …
Rank/frequency profile (BNC)
 rank 1 goes to the most frequent type
 all ranks are unique
 ties in frequency are given arbitrary rank
rank (r)
1
2
3
freq (f)
6054231
1931797
149487
…
Note the
large
differences
in frequency
from one
rank to
another
Rank-frequency profile (MT)
Rank (r)
Frequency (f)
1
2195
2
2080
3
1277
4
1264
Differences
in frequency
from one
rank to
another are
smaller
than in
BNC.
Frequency spectrum (MT)
 A representation that shows,
for each frequency value, the
number of different types
that occur with that
frequency.
frequency types
1
4382
2
1253
3
661
4
356
Word distributions (few giants, many midgets)
Non-linguistic case study
 Suppose we are interested in measuring people’s height.
 population = adult, male/female, European
 sample: N people from the relevant population
 measure height of each person in the sample
 Results:
 person 1: 1.6 m
 person 2: 1.5 m
 …
Measures of central tendency
 Given the height of individuals in our sample, we can
calculate some summary statistics:
 mean (“average”): sum of all heights in sample, divided by N
 mode: most frequent value
 What are your expectations?
 will most people be extremely tall?
 extremely short?
 more or less average?
Plotting height/frequency
Observations:
1. Extreme values
are less
frequent.
2. Most people fall
on the mean
3. Mode is
approximately
same as mean
4. Bell-shaped
curve
(“normal”
distribution)
Distributions of words
 Out of 51,000 tokens in the Maltese corpus:
 8016 tokens belong to just the 5 most frequent types (the types at ranks 1 --
5)
 ca. 15% of our corpus size is made up of only 5 different words!
 Out of 8193 types:
 4382 are hapax legomena, occurring only once (bottom ranks)
 1253 occur only twice
 …
 In this data, the mean won’t tell us very much.
 it hides huge variations!
Ranks and frequencies (MT)
1.
2.
3.
2195
2080
1277
Among top ranks, frequency drops
very dramatically (but depends on corpus size)
…
2298. 1
2299. 1
…
Among bottom ranks, frequency drops very
gradually
General observations
 There are always a few very high-frequency words, and many
low-frequency words.
 Among the top ranks, frequency differences are big.
 Among bottom ranks, frequency differences are very small.
So what are the high-frequency words?
 Top 5 ranked words in the Maltese data:
 li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of ”), tal- (“of the”)
 Bottom ranked words:

żona (“zone”) f = 1
 yankee f = 1
 żwieten (“Zejtun residents”) f = 1
 xortih (“luck.3SGM”) f = 1
 widnejhom (“ear.POSS.3PL”) f = 1
Frequency distributions in corpora
 The top few frequency ranks are taken up by function words.
 In the Brown corpus, the 10 top-ranked words make up 23% of total
corpus size (Baroni, 2007)
 Bottom-ranked words display lots of ties in frequency.
 Lots of words occurring only once (hapax legomena)
 In Brown, ca. ½ of vocabulary size is made up of words that occur
only once.
Implications
 The mean or average frequency hides huge deviations.
 In Brown, average frequency of a type is 19 tokens. But:
 the mean is inflated by a few very frequent types
 most words will have frequency well below the mean
 Mean will therefore be higher than median (the middle value)
 not a very meaningful indicator of central tendency
 Mode (most frequent frequency value) is usually 1.
 This is typical of most large corpora. Same happens if we look at n-grams rather
than words.
Typical shape of a rank/frequency curve
Actual example (MT)
A few high frequency,
low-rank words
Hundreds of low-frequency,
frequency
high-rank words
2500
frequency
2000
1500
frequency
1000
500
0
0
1000
2000
3000
4000
rank
5000
6000
7000
8000
9000
Zipf’s law
 Observation: Frequency decreases non-linearly with rank.
C
f ( w) 
r ( w) a
a constant, determined from data,
roughly the frequency of the most
frequent word
a constant, determined from data
 Suppose a = 1, and C = 60,000.
 The model predicts:
 2nd most frequent word will be C/2 = 30,000
 3rd most frequent: C/3 = 20,000
 20th most frequent = C/20 = 3000
 So frequency decreases very rapidly (exponentially) as rank increases.
Things to note
 The law doesn’t predict frequency ties
 there are no ties among ranks
 The law is a power law: frequency is a function of negative
power of rank
 Taking the log of both sides gives us a linear function:
log f ( w)  logC  a logr ( w)
 Basically a straight line plot.
Log-log plot for MT data (a=1)
Deviation from prediction for
high frequencies
Deviation from prediction
for low frequencies
Log-log plot for data from Baroni 2007
Some observations
 Empirical work has shown that the law doesn’t perfectly
predict frequencies:
 at the bottom ranks (low frequencies), actual frequency drops
more rapidly than predicted
 at the top ranks (high frequencies), the model predicts higher
frequencies than actually attested
Mandelbrot’s law
 Mandelbrot proposed a version of Zipf’s law as follows:
C
f ( w) 
a
(r ( w)  b)
 (Note: Zipf’s original law is Mandelbrot’s law with b = 0)
 If b is a small value, it will make frequency of items ranked at the top
(rank 1, 2, etc) significantly smaller, but won’t affect the lower ranks.
Comparison
 Let C = 60,000, a = 1 and b = 1
 Then, for a word of rank 1:
 Zipf’s law predicts f(w) = 60,000/1 = 60,000
 Mandelbrot’s law predicts f(w) = 60,000/(1+1) = 30,000
 For a word of rank 1000:
 Zipf predicts: f(w) = 60,000/1000 = 60
 Mandelbrot: f(w) = 60,000/1001 = 59.94
 So differences are bigger at the top than at the bottom.
Linear version of Mandelbrot
log f ( w)  logC  a log(r ( w)  b)
 Note: this is no longer a linear curve, so should fit our data
better.
Consequences of the law
 Data sparseness: no matter how big your corpus, most of the
words in it will be of very low frequency.
 You can’t exhaust the vocabulary of a language: new words
will crop up as corpus size increases.
 implication: you can’t compare vocabulary richness of corpora
of different sizes
Explanation for Zipfian distributions
 Zipf’s own explanation (“least effort” principle):
 Speaker’s goal is to minimise effort by using a few distinct
words as frequently as possible
 Hearer’s goal is to maximise clarity by having as large a
vocabulary as possible
Other Zipfian distributions
 Zipf’s law crops up in other domains (e.g. distribution of
incomes)
 Even randomly generated character strings show the same
pattern!
 short strings will be few, but likely to crop up by chance
 more long strings, but each one less likely individually