幻灯片 1 - 外语教学与研究出版社 外研社

Download Report

Transcript 幻灯片 1 - 外语教学与研究出版社 外研社

Analysing vocabulary in learner corpora
梁茂成
中国外语教育研究中心
Outline
What’s in a wordlist?
Types of wordlists
Comparing wordlists
Word frequencies, text difficulty, and learners’ writing
quality
What’s in a wordlist?
Zipf’s law
Major sections of a wordlist
The effect of text homogeneity
Zipf’s law
Zipf’s law:
Given some corpus of natural language
utterances, the frequency of any word is
inversely proportional to its rank in the
frequency table. Thus the most frequent word
will occur approximately twice as often as the
second most frequent word, three times as often
as the third most frequent word, etc.
Zipf’s law
In the Brown Corpus:
"the" is the most frequently occurring word,
and by itself accounts for nearly 7% of all
word occurrences. True to Zipf's Law, the
second-place word "of" accounts for slightly
over 3.5% of words, followed by "and".
Zipf’s law
Zipf’s law
Zipf’s law
Zipf's law is most easily observed by plotting
the data on a log-log graph, with the axes
being log(rank order) and log(frequency).
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Zipf’s law
For different corpora, Zipf’s curve remains
pretty much the same, except that the ranking
of the words is different.
The difference in the rank order of the words
in the lists reflects the difference between the
corpora.
Major sections of a wordlist
1
2
3
4
5
6
7
…
the
of
and
to
a
in
that
…
69387
36355
28827
26128
23452
21277
10750
…
6.71%
3.52%
2.79%
2.53%
2.27%
2.06%
1.04%
…
Top section:
Mostly function words
Familiar to most people
?Sensitive to text types
Hardly meaningful
Major sections of a wordlist
199
213
215
270
275
…
always
almost
enough
often
ever
…
456
431
430
369
363
…
0.04%
0.04%
0.04%
0.04%
0.04%
…
Near-top
section:
Major sections of a wordlist
122
156
208
212
231
237
247
…
make
take
put
set
look
find
give
…
803
618
438
433
401
397
392
…
0.08%
0.06%
0.04%
0.04%
0.04%
0.04%
0.04%
…
Near-top
section:
Major sections of a wordlist
201
211
218
221
273
314
787
789
fact
head
end
system
things
problem
trouble
game
447
434
428
419
365
315
137
137
0.04%
0.04%
0.04%
0.04%
0.04%
0.03%
0.01%
0.01%
Near-top
section:
Major sections of a wordlist
203
223
240
258
264
266
268
309
less
better
next
large
important
possible
big
whole
446
417
396
378
373
372
370
318
0.04%
0.04%
0.04%
0.04%
0.04%
0.04%
0.04%
0.03%
Near-top
section:
Major sections of a wordlist
Near-top section:
Frequency adverbs and intensifiers;
Delexicalised verbs;
Shell nouns;
Procedural vocabulary;
General adjectives
Major sections of a wordlist
Near-top section:
Meaningful, but often ambiguous;
Seldom function independently;
Distributed across all text types
Major sections of a wordlist
In-the-middle-of-the-list words:
Genre-specific;
Distributed evenly across texts in a certain
genre.
Mostly AWL words
Major sections of a wordlist
Low-frequency words :
Technical vocabulary;
Different across subject domains or regions.
Does anyone know this word?
kumara
Major sections of a wordlist
Long tail:
Hapax legomena;
Coinages;
Errors;
What about proper names?
Major sections of a wordlist
Wordlist derived from learner corpora:
High frequency words;
Errors;
Much more to be found.
Zipf’s curve from a learner corpus
Types of wordlists
Depending on what the entries are, a wordlist
can be:
a raw wordlist
a lemmatised wordlist
a word-family list
Types of wordlists
Depending on the source from which a
wordlist is derived, a wordlist can be:
from a general corpus
from a specialised corpus
texts of a shared topic
from one particular text
from learner corpora
Comparing wordlists
Keywords
Statistical tests
Comparing wordlists
Keywords: words whose frequency in one
corpus (observed corpus) is unusually higher
or lower than its frequency in another corpus
(reference corpus).
Interpreting keywords
Word frequencies, text difficulty, and learners’ writing quality
Paul Nation’s great invention:
ADAPT 0
ADAPTABILITY 0
ADAPTABLE 0
ADAPTATION 0
ADAPTATIONS 0
ADAPTED 0
ADAPTING 0
ADAPTIVE 0
ADAPTS 0
Word frequencies, text difficulty, and learners’ writing quality
Token frequencies
Type frequencies
Family frequencies
Word frequencies, text difficulty, and learners’ writing quality
Word frequencies and text difficulty
The more high-level words a text contains, the
more difficulty the text tends to be.
The more types a text contains, the more
difficult the text tends to be.
Word frequencies, text difficulty, and learners’ writing quality
Type-token ratio
Word frequencies, text difficulty, and learners’ writing quality
Word frequencies and learners’ writing quality:
A series of correlation analysis
Thanks