Analyzing a Japanese Reading Text as a Vocabulary Learning

Download Report

Transcript Analyzing a Japanese Reading Text as a Vocabulary Learning

Analyzing a Japanese Reading Text
as a Vocabulary Learning Resource
by Lexical Profiling and Indices
Tatsuhiko Matsushita (松下達彦)
PhD candidate
Victoria University of Wellington
[email protected]
The First Extensive Reading World Congress
4 September, Kyoto Sangyo University
Motive
• How can we control the vocabulary of a
reading text to maximize the vocabulary
learning effect?
– Too easy -- fewer words to learn
– Too many unknown words -- no learning/inference
Goals
• To show methods to assess a (Japanese)
reading text as a vocabulary learning resource
by exploiting lexical profiling and indices
Conclusion = Main Points
The simplest way to rewrite a reading text (with
2000 words or less) for a better resource for
vocabulary learning:
I. Delete one-timers (or the words whose
occurrences are less than the set level) at the
lowest frequency level in the text, or
II. make them occur more in the text by adding
words or replacing other words with the onetimer
 The index (LEPIX) figure will be improved
• These methods make it possible to predict and
compare the efficiency of second language
vocabulary learning with a reading text.
Similar Previous Ideas and Attempts
• Nation & Deweerdt (2001)
• Ghardirian (2002)
• Cobb (2007)
*No integrated index is shown in previous studies
Lexical Profiling
• Basically the same idea as Lexical Frequency
Profiling (LFP) (Laufer, 1994)
“the percentage of words …… at different
vocabulary frequency levels” (p.23)
The Baseword Lists for Lexical Profiling
• VDRJ: Vocabulary Database for Reading Japanese
(Matsushita, 2010; 2011) http://www.geocities.jp/tatsum2003/
– All words are ranked by Usage Coefficient (Juilland & ChangRodrigues, 1964)
U = Frequency × Dispersion
– Three types of word rankings
• For General Learners
• For International Students --used for this study
• For General Written Japanese
• Japanese Character Frequency List
(Matsushita, Unpublished)
– From the same corpus (BCCWJ) as VDRJ is created from
* When analyzing Japanese texts, it is necessary to set a
certain level of known characters (Kanji) as well as
vocabulary
Assumptions
I. Required Level of Text Coverage
Words which are assumed known to the reader
must be within a certain level.
(e.g. Hu & Nation, 2000)
II. Minimum Occurrences of Target Words
Among the words assumed unknown, words
which occur more frequently than a certain times
can be the learning target words.
(e.g. Waring & Takaki, 2003)
III. More Types of Target Words
The text where the more types of target
words occur is a better text as a vocabulary
learning resource.
IV. Density of Target Words (%)
The text where the target words occur at a
higher ratio is a better text as a vocabulary
learning resource.
Methods
The main software: AntWordProfiler Ver. 1.200W (Anthony, 2009)
I.
To identify the lexical level of the text by lexical profiling,
set the threshold level of (assumed) known words.
In this study, the levels are:
A) 98% for an extensive reading text  Lexical Level of Text (LLT98)
B) 95% for an instructional material  Lexical Level of Text (LLT95)
(Hu & Nation, 2000)
II. To Identify the target words, set the minimum
occurrences of target words.
*6-10 occurrences are required for learning a word
incidentally through reading (e.g. Waring & Takaki, 2003),
however, a word is not learned by reading one short text
A)
B)
Twice or more for an extensive reading text
*Set occurrence will depend on the text length
Twice for a short instructional material
III. Count T which is the Number of Types of the Target
Words.
IV. Calculate (W*100)/N where:
W is the Number of Tokens of the Target Words.
N is the Total Number of Tokens of the Text.
Lexical Learning Possibility Index for a
Reading Text
*Simply multiply the factors of III & IV  {T*(W*100)/N}
(LEPIX) = (T*W*100)/N
Sample Text (original)
人知のシミュレーションが人工知能だとすれば、コンピュータのなかに
「知をあつかうメカニズム」を作り込まなければならない。
ところでコンピュータとは、要するに〈記号処理マシン〉である。だから
この場合の〈知〉とは、「記号で表された知」ということになる。記号と
いっても色々あるが、人工知能が得意なのは、いわゆる言語記号である。た
とえば、「今は五月だ」「五月は春だ」「楓の葉は、春と夏には緑色、秋に
は赤色である」などというのがその守備範囲ということになる。
ところでこういった例は、少しばかり興ざめではなかろうか? というの
は、〈知〉とは、単なる知識の断片ではなく、それらを包括し、横断しなが
ら世界に光を当てていく精神のダイナミズムのように思えるからである。
〈知〉はイマジネーションの能力を持たなければならない。さらに〈知〉は、
スポーツのような身体の所作にうめこまれている、明言化されない暗黙知の
領域をもカバーしなければならない。それこそが、知の知たるゆえんではな
いだろうか?
残念ながら、現在の人工知能技術は、この期待に応えるすべを知らない。
それはいまだに、図像さえ自由自在には扱えないのである。英語や日本語な
どの〈自然言語〉を操作するだけでも四苦八苦なのである。
(出典:西垣 通『秘術としてのAI思考』)
Sample Text (modified)
人間の頭脳を模倣して作ったものが人工知能だとすれば、コンピュータの
中に「知をあつかうメカニズム」をていねいに作っていかなければならない。
しかしそこへの道はまだ程遠い。
コンピュータとは、要するに〈記号処理のメカニズム〉である。だからこ
の場合の知とは、「記号で表された知」ということになる。記号といっても
いろいろあるが、人工知能が得意なのは、いわゆる言語記号である。例えば、
「今は五月だ」「五月は春だ」「カエデの葉は、春と夏には緑、秋には赤で
ある」などという人工言語的表現は処理しやすいのである。
しかし、こういった例は、少しばかりつまらないのではないだろうか?
というのは、知とは、一つ一つの知識がバラバラに存在するのではなく、そ
れらを一つにまとめたり、横断したりしながら、世界に光を当てていく精神
の力強い働きのように思えるからである。知は想像力を持たなければならな
い。さらに知は、スポーツのような身体の動きの中にある、はっきりとした
言葉にならない知の領域もカバーしなければならない。カエデといえば私た
ちが紅葉を見て感じる気持ちまで横断的にカバーしなければならないのだ。
それこそが、知を知として成り立たせているものではないだろうか。
残念ながら、現在の人工知能技術は、この期待に応えるすべを知らない。
人間の頭脳の模倣にはまだ程遠いレベルだ。英語や日本語などの〈自然言
語〉を操作するだけでも非常に苦労しているのである。
Treatment for Low Frequency Words
Word Level Type
IS_05K
IS_05K
Frequency Cumulative
in Original Text
Text
Coverage
IS_05K
記号
IS_06K
IS_06K
IS_06K
IS_06K
IS_07K
IS_07K
IS_07K
IS_08K
IS_08K
IS_08K
IS_08K
IS_08K
IS_08K
IS_09K
マシン
9
0
4
1
メカニズム
横断
1
1
緑色
断片
自在
1
1
1
IS_10K
IS_10K
IS_11K
IS_11K
IS_16K
IS_17K
IS_19K
IS_19K
IS_20K
IS_21K+
IS_21K+
IS_21K+
IS_21K+
IS_21K+
知
紅葉
頭脳
0
包括
暗黙
1
1
楓
模倣
知能
程遠い
1
0
守備
シミュレーション
埋め込む
明言
赤色
所作
図像
八苦
四苦
ダイナミズム
イマジネーション
人知
作り込む
由縁
興醒め
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
88.7
88.7
90.2
90.5
90.9
91.3
91.6
92.0
92.4
92.4
92.7
93.1
93.5
93.5
3
94.5
0
94.5
94.9
95.3
95.6
96.0
96.4
96.7
97.1
97.5
97.8
98.2
98.5
98.9
99.3
99.6
100.0
Frequency Cumulative
in Modified Text
Treatment
Text
Coverage
9
1
4
0
2
2
0
0
0
2
0
0
2
2
95.6
A
95.6
96.2
96.8
96.8
96.8
96.8
97.3
97.3
97.3
97.9
98.5
Deleted
3
99.4
2
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Explanation of Treatment
94.1
94.4
B
B
Deleted
Deleted
Deleted
C
A: Changed from
assumed known
words to target words
due to the change of
Lexical Level of Text
Deleted
Deleted
B
C
A
C
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
Deleted
B: Changed from nontaget words to target
words by adding
occurrences to onetimer
C: Newly added target
words by replacing
original sentences
with new expressions
*Check the level of characters (Kanji) and avoid low frequency ones.
Comparison between the Original and the Modified Texts
Item
Text Length (= Total Number of Token) (N)
Total Number of Types
Number of Tokens over 95% Text Coverage
Number of Types over 95% Text Coverage
95% Text Coverage Level = Lexical Level of the Text (LLT95)
Minimum Occurrences of Target Words over 95% Text Coverage
Number of Target Tokens over 95% Text Coverage (W95)
Number of Target Types over 95% Text Coverage (T95)
Density of Target Words (%) (W95*100/N)
Average Occurrences of Target Words (W95/T95)
Lexical Learning Possibility Index for a Reading Text over 95% Text Coverage
(LEPIX9 5 ) ((T9 5 * W9 5 * 1 0 0 )/ N )
Number of Tokens over 98% Text Coverage
Number of Types over 98% Text Coverage
98% Text Coverage Level = Lexical Level of the Text (LLT98)
Minimum Occurrences of Target Words over 98% Text Coverage
Number of Target Tokens over 98% Text Coverage (W98)
Number of Target Types over 98% Text Coverage (T98)
Density of Target Words (%) (W98*100/N)
Average Occurrences of Target Words (W98/T98)
Lexical Learning Possibility Index for a Reading Text over 98% Text Coverage
(LEPIX9 8 ) ((T9 8 * W9 8 * 1 0 0 )/ N )
Original
Text
Modified
Text
275
118
339
130
14
14
19
8
10K
05K
2
0
0
0.0
0.0
2
19
8
5.6
2.4
0.0 44.8
6
6
7
3
20K
08K
2
0
0
0
0.00
2
7
3
2.1
2.33
0.0 6.2
For Learning DomainSpecific Words
I.
The target domain is set up
at first
II. The domain-specific words
included in the text are
identified by checking the list
of the domain-specific words
III. The levels of the identified
domain-specific words
included in the text are
checked by lexical profiling to
see how many unknown
domain-specific words are
contained in the text
IV. The indices are calculated
Indices, Text Coverage and Numbers of Tokens and Types for
the Whole Text and the Target Words (Technical Words)
Text Number
Text Length (= Total Number of Token) (N)
Total Number of Types
Target Domain
Number of Tokens over 95% Text Coverage
Number of Types over 95% Text Coverage
95% Text Coverage Level = Lexical Level of the Text (LLT 9 5 )
Number of Techni cal Word Tokens over 95% Text Coverage
Number of Techni cal Word Types over 95% Text Coverage
Number of Te c h n ic al Wo rd Tokens over 95% Text Coverage (W 9 5 t )
Number of Te c h n ic al Wo rd Types over 95% Text Coverage (T 9 5 t )
Density of Technical Target Words (%) (W 9 5 t *100/N)
Average Occurrences of Techni cal Target Words (W 9 5 t /T 9 5 t )
Lexical Learning Possibility Index for a Reading Text
over 95% Text Coverage (LEPIX 9 5 t )
((T 95t *W 95t *100)/N)
Number of Tokens over 98% Text Coverage
Number of Types over 98% Text Coverage
98% Text Coverage Level = Lexical Level of the Text (LLT 9 8 )
Number of Techni cal Word Tokens over 98% Text Coverage
Number of Techni cal Word Types over 98% Text Coverage
Number of Te c h n ic al Wo rd Tokens over 95% Text Coverage (W 9 8 t )
Number of Te c h n ic al Wo rd Types over 95% Text Coverage (T 9 8 t )
Density of Technical Target Words (%) (W 9 8 t *100/N)
Average Occurrences of Techni cal Target Words (W 9 8 t /T 9 8 t )
Lexical Learning Possibility Index for a Reading Text
over 98% Text Coverage (LEPIX 98t )
((T 98t *W 98t *100)/N)
#6-1 #6-2
1193 2823
250
690
Economics
60
24
04K
25
10
22
7
1.84
3.14
142
87
08K
35
15
27
7
0.96
3.86
12.9
6.7
12
8
09K
7
4
5
2
0.42
2.50
52
37
12K
9
6
5
2
0.18
2.50
0.84 0.35
More Examples of Analysis
Indices, Text Coverage and Numbers of Tokens and Types for the Whole Text and the Target Words
* Data from passages in a textbook which are mostly authentic but slightly modified for advanced learners of Japanese
** Minimum Occurrences of Target Words over 95%/98% Text Coverage = 2
Text Number
Text Length (= Total Number of Token) (N)
Total Number of Types
#5-1 #4-3 #3-1 #4-1 #8-2 #6-1 #1-3 #2-2 #9-2 #1-2 #8-1 #1-1 #3-3 #2-1 #2-3 #3-2 #9-1 #4-2 #7-1 #6-2 #7-2 #5-2 #9-3
SD
504 616 959 1055 1092 1193 1210 1317 1416 1418 1455 1592 1717 1785 1959 2035 2241 2342 2361 2823 2964 3754 4344 1832.7 928.0
226 246 358 296 282 250 335 409 383 406 400 540 530 560 528 621 533 535 555 690 628 849 923 481.9 180.8
Number of Tokens over 95% Text Coverage
Number of Types over 95% Text Coverage
26 31 50 53 64 60 61 68 71 71 80 83 86 91 99 102 113 118 120 142 149 227 297
24 19 37 43 39 24 37 48 53 51 32 73 82 62 67 84 77 82 68 87 71 138 138
95% Text Coverage Level = Lexical Level of the Text (LLT 95 ) 07K 08K 08K 04K 05K 04K 06K 09K 06K 07K 10K 06K 07K 08K 07K 07K 06K 05K 06K 08K 08K 06K 07K
Number of Target Tokens over 95% Text Coverage (W95 )
4 14 20 17 36 47 33 33 27 25 58 15 7 33 45 27 55 50 68 81 99 115 184
Number of Target Types over 95% Text Coverage (T 95 )
2 2 7 7 11 11 9 13 9 5 10 5 3 4 13 9 19 14 16 26 21 26 25
Density of Target Words (%) (W 95 *100/N)
0.8 2.3 2.1 1.6 3.3 3.9 2.7 2.5 1.9 1.8 4.0 0.9 0.4 1.8 2.3 1.3 2.5 2.1 2.9 2.9 3.3 3.1 4.2
Average Occurrences of Target Words (W 95 /T95 ) 2.0 7.0 2.9 2.4 3.3 4.3 3.7 2.5 3.0 5.0 5.8 3.0 2.3 8.3 3.5 3.0 2.9 3.6 4.3 3.1 4.7 4.4 7.4
Lexical Learning Possibility Index for a Reading Text
over 95% Text Coverage (LEPIX 95 )
1.6 4.5 14.6 11.3 36.3 43.3 24.5 32.6 17.2 8.8 39.9 4.7 1.2 7.4 29.9 11.9 46.6 29.9 46.1 74.6 70.1 79.6 105.9
((T 95 *W 95 *100)/N)
Number of Tokens over 98% Text Coverage
Number of Types over 98% Text Coverage
M
11
10
98% Text Coverage Level = Lexical Level of the Text (LLT 98 ) 13K
Number of Target Tokens over 98% Text Coverage (W98 )
2
Number of Target Types over 98% Text Coverage (T 98 )
1
Density of Target Words (%) (W 98 *100/N)
0.4
Average Occurrences of Target Words (W 98 /T98 ) 2.0
13
13
10K
0
0
0.0
0.0
21
16
18K
8
3
0.8
2.7
22
15
11K
11
4
1.0
2.8
24
17
11K
12
5
1.1
2.4
24
9
09K
19
4
1.6
4.8
25
12
18K
15
2
1.2
7.5
27
21
18K
10
4
0.8
2.5
30
22
11K
11
3
0.8
3.7
30
21
12K
11
2
0.8
5.5
30
16
15K
18
4
1.2
4.5
32
29
12K
4
1
0.3
4.0
36
34
13K
3
1
0.2
3.0
36
35
13K
2
1
0.1
2.0
40
36
11K
8
4
0.4
2.0
41
39
11K
4
2
0.2
2.0
45
34
11K
20
9
0.9
2.2
48
40
09K
14
6
0.6
2.3
50
25
16K
30
5
1.3
6.0
57
38
12K
31
12
1.1
2.6
61
33
13K
39
11
1.3
3.5
76
66
11K
18
8
0.5
2.3
98.3 60.1
62.4 31.0
47.5 40.1
11.6 7.3
2.4 1.0
4.0 1.6
32.3 27.6
87 37.7 18.5
72 28.4 15.9
12K
23 13.61 9.92
8 4.35 3.23
0.5
0.7 0.4
2.9
3.2 1.6
Lexical Learning Possibility Index for a Reading Text
over 98% Text Coverage (LEPIX 98 )
0.4 0.0 2.5 4.2 5.5 6.4 2.5 3.0 2.3 1.6 4.9 0.3 0.2 0.1 1.6 0.4 8.0 3.6 6.4 13.2 14.5 3.8 4.2
((T 98 *W 98 *100)/N)
3.9
3.8
How does the text length work for LEPIX?
Total Number of Token/Type and LEPIX
from Texts with 500-4000 Running Words
5000
120.0
4500
2500
50.0
Text Length (= Total
Number of Token) (N)
100.0
4000
Total Number of Token/Type and LEPIX
from Texts with 1000-2000 Running Words
45.0
2000
Text Length (= Total
Number of Token) (N)
40.0
Total Number of Types
35.0
3500
80.0
Total Number of Types
1500
3000
2500
60.0
2000
40.0
1500
1000
20.0
Lexical Learning
Possibility Index for a
Reading Text over 95%
Text Coverage
(LEPIX95)
Lexical Learning
Possibility Index for a
Reading Text over 98%
Text Coverage
(LEPIX98)
30.0
25.0
1000
20.0
15.0
500
10.0
500
5.0
0
0.0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Total
Token
Total Number of Tokens
(Text Length)
Total
Type
1
.956
***
1
.956
***
.837
.685
LEPIX95
***
***
.438
.271
LEPIX98
*
n.s.
***: (p < .001), *: (p <.05), n.s.: not significant
Total Number of Types
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
Correlation Coefficient (Pearson) between
Total Number of Token/Type and LEPIX from
Texts with 1000-2000 Running Words
Total
Token
LEPIX95 LEPIX98
.837
***
.685
***
1
.691
***
.438
*
.271
n.s.
.691
***
1
Lexical Learning
Possibility Index for a
Reading Text over 98%
Text Coverage (LEPIX98)
0.0
1
Correlation Coefficient (Pearson) between
Total Number of Token/Type and LEPIX from
Texts with 500-4000 Running Words
Lexical Learning
Possibility Index for a
Reading Text over 95%
Text Coverage (LEPIX95)
Total Number of Tokens
(Text Length)
Total Number of Types
LEPIX95
LEPIX98
Total
Type
LEPIX95 LEPIX98
1
.858 . 2 2 2 . 0 6 1
*** n . s . n . s .
.858
1 -.195 -.372
***
n.s. n.s.
. 2 2 2 -.195
1 .877
n.s.
n.s.
***
. 0 6 1 -.372 .877
1
n.s.
n.s.
***
***: (p < .001), n.s.: not significant
LEPIX cannot be compared when the text length is more than double the other.
Remaining Issues
• If a repeatedly-used essential key word in the
text is at the lowest frequency level, the index
doesn’t work well.  there are some solutions
for that, but it makes the procedure/calculation
more complicated.
• Minimum occurrence level of target words will
differ according to the text length. Twice will be
enough for a short material text, but it is not
clear for a longer extensive reading text.
• Validation of the indices through empirical study
Conclusion = Main Points
The simplest way to rewrite a reading text (with
2000 words or less) for a better resource for
vocabulary learning:
I. Delete one-timers (or the words whose
occurrences are less than the set level) at the
lowest frequency level in the text, or
II. make them occur more in the text by adding
words or replacing other words with the onetimer
 The index (LEPIX) figure will be improved
• These methods make it possible to predict and
compare the efficiency of second language
vocabulary learning with a reading text.
References
Anthony, L. (2009). AntWordProfiler 1.200w program.
Downloaded from http://www.antlab.sci.waseda.ac.jp/software.html
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning and Technology,
11(3), 38-63.
Ghadirian, S. (2002). Providing controlled exposure to target vocabulary through the screening and
arranging of texts. Language Learning and Technology, 6(1), 147-164.
Hu, M., & Nation, I. S. P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign
Language, 13(1), 403-430.
Juilland, A., & Chang-Rodrigues, E. (1964). Frequency Dictionary of Spanish Words. London: Mouton & Co.
Nation, I. S. P., & Deweerdt, J. (2001). A defence of simplification. Prospect, 16(3), 55-67.
Laufer, B. (1994). The lexical profile of second language writing: does it change over time? RELC Journal,
25(2), 21-33.
Matsushita, T. (松下達彦). (2010) 日本語を読むために必要な語彙とは? -書籍とインターネットの大
規模コーパスに基づく語彙リストの作成- [What words are essential to read Japanese? Making word
lists from a large corpus of books and internet forum sites]. 2010年度日本語教育学会春季大会予稿
集 [Proceedings for the Conference of the Society for Teaching Japanese as a Foreign Language, Spring
2010], 335-336.
Matsushita, T. (松下達彦). (2011). 日本語を読むための語彙データベース (The Vocabulary Database for
Reading Japanese) Ver. 1.01. Downloaded from http://www.geocities.jp/tatsum2003/
Waring, R., & Takaki, M. (2003). At what rate do learners learn and retain new vocabulary from reading a
graded reader? Reading in a Foreign Language, 15(2), 130-163.