School of Computing School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds,

Download Report

Transcript School of Computing School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds,

School
of Computing
School
of Computing
FACULTY OF ENGNEERING
Linguistically Informed and
Corpus Informed Morphological
Analysis of Arabic
Majdi Sawalha & Eric Atwell
School of Computing,
University of Leeds, Leeds, LS2 9JT, UK
[email protected] , [email protected]
Outline
Introduction
Arabic Morphological Analyzers
• Arabic Corpora & Lexicons
• Analytical Study of Tri-literal Roots of Arabic
• Specifications of the Morphological Analyzer
Morphological Features of Arabic Words and Tag Set
Evaluation and Results
• Gold Standard for Evaluation
• Morphochallenge 2009 Qur’an Gold Standard
2
Introduction
Methodologies for developing a robust Arabic morphological
analyzer
Roots, stems, patterns and affixes are prestored. Grammar and linguistic information
are encoded with the analyzers
• Syllable-based Morphology (SBM)
• Root-Pattern Methodology
• Lexeme-based Morphology
• Stem-based Arabic lexicon with grammar and lexis specifications
• Using tagged corpora and computer algorithms to build
morphological database of the tagged words
3
Arabic Morphological Analyzers
Buckwalter Morphological Analyzer
• Uses pre-stored dictionaries of words, stems and affixes constructed
manually.
Khoja’s Stemmer
• Removes the longest prefix and suffix of the word,
• Matches the processed word with lists of noun and verb patterns to
extract the correct root of the word.
Al-Shalabi et al
• Depends on mathematical calculations of weights assigned to the letters
of the word,
• The algorithm selects the letters with lower weights as root letters.
4
Comparative Evaluation of
Arabic Morphological Analyzers
Studying freely available morphological analyzers
and stemmers.
Developing a gold standard for evaluation.
Results:
• More work is needed for the development of morphological analysis of
Arabic.
• We can not rely on such analyzers for further analysis such as part-ofspeech tagging and parsing.
5
Arabic Corpora
The Qur’an
78,000 tokens, 19,000 vowelized word types, 15,000 non-vowelized word
types.
The Corpus of Contemporary Arabic (CCA)
Modern standard Arabic text corpus consists of 1 million word.
The Penn Arabic Treebank
734 files, 166,000 words of written Modern Standard Arabic.
The text of 15 traditional Arabic lexicons as corpora.
About 11 million words and 2 million word types of both modern and
classical Arabic text.
6
Arabic Lexicons
Methodologies of ordering lexical entries in the Arabic lexicons
• Al-Khalil methodology ( Listed the lexical entries based on the pronunciation
of the letters, starting from the farthest in the mouth to the nearest)
• Abi Obaid methodology ( Listed the lexical entries based on similarity in
meaning.)
• Al-Jawhari methodology ( Listed the lexical entries based on last letter of the
word.)
• Al Barmaki methodology ( Listed the lexical entries alphabetically.)
7
Arabic Lexicons
A sample of Arabic lexicon
:‫ و َُكتَّبَه‬،ُ‫ب الشي َُء يَ ْكتُبه َكتْباُ و ِكتاباُ و ِكتابة‬
َُ َ ‫ َكت‬.ُ‫ والجمع كتبُ وكتْب‬،‫ معروف‬:ُ‫ ال ِكتاب‬:‫كتب‬
َّ ‫َخ‬
‫بان في‬
ُِ ‫ ت َُك ِت‬،‫ف‬
َُ ‫ تَخطُ ِر ْج‬،‫ف‬
ْ ‫الي ب َخطُ م ْختَ ِل‬
ْ ‫ أ َ ْقبَ ْلتُ من ِع ْن ُِد زيادُ كال َخ ِر‬:‫طه؛ قال أَبو النجم‬
َّ ‫ال‬
،‫ وهي لغة بَ ْه َرا َء‬،‫ بكسر التاء‬،‫بان‬
ُِ ِ‫النسخِ تِ ِكت‬
ُ
‫ ورأَيت في بعض‬:‫ف قال‬
ُْ ‫ريق ال َُم أ َ ِل‬
ُِ ‫ط‬
...
َُ ‫تِ ْعلَم‬
‫ون‬
:‫فيقولون‬
،‫التاء‬
‫سرون‬
ِ ‫يَ ْك‬
k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote
something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of
writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after
meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote
[tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He
said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter
[ ta a] , a s i t i s us e d by B a h raa ’ [ Ar a b t rib e ] dia l e ct . T h ey sa y : [ ti ’l a muw n] (yo u kn ow ) .
A sample of Arabic-English Dictionary by Edward Lane
Analytical Study of Tri-literal
Roots of Arabic
Tri-literal roots were classified into 3 main groups and 22 detailed groups.
Experiment 1: Qur’an words derived from tri-literal roots were analyzed,
(45,534 words) and (1,610 tri-literal roots)
Compoun
d, 6.82%
Defective
Compound
Intact
Compound,
45, 2.80%
Intact
Defective
Compound
Defective,
468, 29.07%
Defective,
32.12%
Intact,
61.06%
Qur’an tokens
Intact,
1097,
68.14%
Tri-literal roots of Qur’an
9
Analytical Study of Tri-literal
Roots of Arabic
Experiment 2:
Word-types of broad-lexical resource constructed by analyzing 15 Arabic
lexicons, which contains 376,167 word types
Intact
Compound,
2.33%
Intact
Com pound,
309, 3.63%
Defective
Compound
Compoun
d
Defective,
29.42%
Defective
Defective,
2825, 33.23%
Intact,
68.25%
Word types of broad-lexical resource
Intact, 5368,
63.14%
Roots of broad-lexical resource
10
Specifications of the
Morphological Analyzers - Inputs
Input: single words or text (fully vowelized, partially vowelized, or non-vowelized)
Tokenization: Arabic word, number, currency or punctuation mark.
Processing Arabic words:
• Resolving doubled letter marked with Shaddah • Resolving the Extention (maddah)
‫صى‬
َُّ ‫َُو‬
 ‫صى‬
َُ ‫ص‬
ْ ‫َُو‬
‫آ َمنوا‬
waS~aY  waSoSaY

|manuwA
‫ءا َمنوا‬
 ‘AmanuwA
Only one short vowel might appear on any letter of the Arabic word.
12 11 10 9 8
7 6
5
4
3
2 1
- ‫و ُ ص ُ ص ُ ى‬
- Y a
Position
‫وصصى‬
S
o
S
a w
waSoSaY
‫ءامنوا‬
-
‫ا‬
- ‫م ُ ن ُ و‬
-
‫ا‬
- ‫ء‬
-
A
-
-
A
-
w u n a m
‘
‘AmanuwA
11
Stop Words (Unambiguous Words)
Stop word has only one morphological analysis wherever they appear in
the text.
About 40% of any text tokens belongs to stop words.
The system contains a list of 1,368 stop words.
Personal Pronouns : ‫> “ أنا‬nA” I, ‫“ هي‬hy” she
Relative pronouns : ‫“ الذي‬Al*y” who (sm), ‫“ التي‬Alty” who (sf)
Demonstrative pronouns : ‫“ هذا‬h*A” this (sm), ‫“ هذه‬h*h” this (sf)
Prepositions: ‫“ في‬fy” in,
‫“ على‬ElY” on , ‫<“ إلى‬lY” to
12
Cliticts, Prefixes and Suffixes
Proclitics, prefixes, suffixes and enclitics were
collected from traditional Arabic grammar books.
Clitics and affixes lists were checked using four
Arabic corpora:
• The Qur’an
• The Corpus of Contemporary Arabic (CCA)
• The Penn Arabic Treebank
• The text of the 15 traditional Arabic lexicons as a corpus
13
Cliticts, Prefixes and Suffixes
215 Proclitics & Prefixes
Prefix
Example
P1
Tag
P2
Tag
P3
Tag
‫ ف فستـذكرون فست‬p--t--------------- ‫ س‬p--i--------------- ‫ ت‬r---s-nus---------fst fst*krwn f
s
t
‫وال‬
‫والـسماء‬
‫ و‬p--t--------------- ‫ ال‬r---d-----d-------wAl wAlsmA’ w
Al
127 Suffixes & Enclitics
Suffix
Example
P1
‫تموهما‬
‫اورثـتموها‬
‫تم‬
Tag
r---&-mps??----h---
tmwhA >wrvtmwhA tm
‫يون‬
‫الحواريون‬
Ywn AlHwArywn
‫ي‬
y
P2
Tag
P3
Tag
‫و‬
r---l-mp-n?----?---
‫هما‬
r---&-ndt??----h---
w
r---j--------------
‫ون‬
hmA
r---l-mp-n?----?---
wn
14
Cliticts, Prefixes and Suffixes
Words are divided into three parts of different size.
The first part is searched in the proclitics & prefixes list
The third part is searched in the suffixes & enclitics list
Analyzed Word
‫يعملون‬
yaEomaluwna
First Part
‫ي‬
y
‫يع‬
yE
Second Part Third Part Prefixes & Suffixes analyses
‫ يعملون‬yEmlwn
Candidate analysis
‫عمل‬
‫ون‬
Eml
wn
Candidate analysis
‫ يعملو‬yEmlw
‫ن‬
n
Not accepted
‫م‬
‫ لون‬lwn
m
Not accepted
15
Root or Stem
The system uses a list of about 12,000 roots extracted by
analyzing 15 traditional Arabic language lexicons
The second part of the word is searched by the root list.
Analyzed Word
َ ‫يَ ْع َمل‬
ُ‫ون‬
First
part
Second part
Third
Part
‫يعملو‬
yEmlwn
‫ن‬
‫ يعمل‬yEml ‫ ون‬wn
‫عملو‬
Emlwn
‫ ي‬y
‫ن‬
‫ ي‬y ‫ عمل‬Eml ‫ ون‬wn
yaEomaluwna
Affixes analyses
Affixes and Root analyses
Candidate analysis
Not accepted analysis
Candidate analysis
Not accepted analysis
Candidate analysis
Not accepted analysis
Candidate analysis
Accepted Analysis
16
Word Pattern
Different words are derived from their roots using certain patterns.
Derived words inherent morphological features of the derivation patterns.
The system has a list of patterns which are extracted from traditional Arabic
language grammar books.
• 2730 verb patterns
• 985 noun patterns
• Morphological features POS tags are assigned to each pattern in the list.
• Patterns are fully vowelized
Verb Patterns
POS Tag
‫ فعلت‬faEalotu
‫ فعلنا‬faEalonaA
‫ فعلت‬faEalota
v-p---nsf---an?-st?v-p---npf---an?-st?v-p---mss---an?-st?-
Noun Patterns
POS Tag
‫> ُافعالوى‬ufoEulAwaY nw----??-??----?qt-?
‫ ِاف ِعيالل‬AifoEiylAl nw----??-??----?qt-?
‫ فاعوالء‬fAEuwlA’ nw----??-??----?qt-?
17
Pattern Matching Algorithms
First algorithm: depends on the word and its root as inputs.
• The root letters of the word are replaced by the letters
(fa’, Aiin, Lam, [Lam]) (]‫ [ل‬، ‫ ل‬، ‫ ع‬، ‫)ف‬.
Replacement of root letters is not an easy task !!!!
Second algorithm: depends on a pre-stored list of patterns.
• Searches the pattern list for patterns of similar size as the analyzed word, after
removing its affixes.
• E.g: The word ‫ كـتب‬ktb matches the following patterns:
‫فعل‬
‫فعل‬
‫فع ل‬
faEol FaEal faEul
‫ف ِعل‬
‫فعل‬
‫فعل‬
‫فعل‬
‫ف ِعل‬
‫ِفعل‬
‫ِف ِعل‬
faEl
fuEol
fuEal
fuEul
fuEil
fiEol fiEil
• Replaces the letters of the word corresponding to the letters
(Fa’, Ain, Lam , [Lam]) (]‫ [ل‬، ‫ ل‬، ‫ ع‬، ‫ )ف‬of the pattern.
18
Word Pattern:
The second algorithm (Example)
Analyzed Word
‫يعملون‬
Matched Patterns
‫يفعلون‬
yafoEuluwna
yaEomaluwna
Tag
v-c---mpt--ian?-st?
‫يف ِعلون‬
yafoEiluwna
v-c---mpt--ian?-st?
‫يفعلون‬
yafoEaluwna
v-c---mpt--ian?-st?
‫يف ِعلون‬
yufoEiluwna
v-c---mpt--ipn?-at?
‫يفعلون‬
yufoEaluwna
v-c---mpt--ipn?-tt?
19
Vowelization
Helps in determining some morphological features of the words.
Analyzed Word
‫كـتب‬
Pattern
Vowelization
‫فعل‬
‫فعل‬
‫فعل‬
faEol FaEal faEul
‫كـتب‬
‫كـتب‬
‫كـتب‬
katob katab katub
ktb
‫ف ِعل‬
‫فع ل‬
‫فع ل‬
‫فع ل‬
‫ف ِعل‬
‫ِفعل‬
‫ِف ِعل‬
faEl
fuEol
fuEal
fuEul
fuEil
fiEol fiEil
‫كـ ِتب‬
‫كـتب‬
‫كـتب‬
‫كـتب‬
‫كـ ِتب‬
‫ِكـ ِتب ِكـتب‬
katib
kutob
kutab
kutub
kutib
kitob kitib
20
Morphological Features of Arabic
Words and Tag Set
http://www.comp.leeds.ac.uk/sawalha/tagset
Part-of-Speech Tag Set is designed following the traditional
grammar classifications.
Tag Set has 22 morphological features of Arabic words.
The Tag consists of 22 characters.
E.g.
• v at the first position indicates verb, n at the second position indicates
proper name. At the seventh position m indicates masculine, and f
indicates feminine
• “ - “ is used If the value of a certain feature is not applicable for the
tagged word.
• “?” is used if the value of a certain feature belongs to word, but at the
moment is not available or the automatic tagger could not guess it.
21
Morphological Features of Arabic
Words and Tag Set
http://www.comp.leeds.ac.uk/sawalha/tagset
P
Morphological Features Categories
1
2
3
4
5
6
7
8
9
10
Main POS
POS of Noun
POS of Verb
POS of Particle
Residuals
Punctuations
Gender
Number
Person
Morphology
11
Case and Mood
12
13
Case and Mood
marks
Definiteness
ُ‫أَقسامُالكالم‬
‫الرئيسيَّة‬
ُ‫أقسامُفرعُيَّة‬
)‫(االسم‬
ُ‫أقسامُفرعيَّة‬
)‫(الفعل‬
ُ‫أقسامُفرعيَّة‬
)‫(الحرف‬
ُ‫أقسامُفرعيَّة‬
)‫(أخرى‬
‫عالماتُالترقيم‬
‫الجنس‬
‫العدد‬
‫الشخص‬
‫صرف‬
َّ ‫ال‬
ُ‫الحالةُاإلعرابية‬
‫لالسمُأوُالفعل‬
ُ‫عالمةُاإلعرابُأو‬
‫البناء‬
‫ال َم ْع ِرفَةُوالُنَّ ِك َر‬
P
Morphological Features Categories
14
Voice
15
Emphasize
16
17
Transitivity
Humanness
18
19
Variability &
Conjugation
Augmented &
Unaugmented
20
Root letters
21
Verb Internal
Structure
22
Noun finals
ُ‫ال َمبْنيُ ِلل َم ُْعلومُو‬
‫ال َمبْنيُ ِلل َمجُْهول‬
ُ‫المؤكَّدُوغير‬
‫المؤكَّد‬
‫الالزمُوالمتعدي‬
‫العاقلُوغيرُالعاقل‬
‫التَّصريف‬
‫المجردُوالمزيد‬
َّ
‫ج ْذر‬
َُ ‫عدَدُأحْ رفُال‬
َ
‫بنيةُالفعل‬
ُ‫أقسامُاألسمُتبعا‬
‫للفظُآخره‬
22
Morphological Features of Arabic
Words and Tag Set
Sample of tagged document using the morphological feature Tag Set
َ ‫س‬
‫سنا‬
َّ ‫َو َو‬
ْ ‫انُ ِب َوا ِل َد ْي ِهُح‬
َ ‫اُاإلن‬
ِ ْ َ‫ص ْين‬
We have recommended that a person must take good care of their parents.
Word
Tag
ُ‫َو‬
wa
And
p--t-----a---------
ُ‫ص ْي‬
َّ ‫َو‬
waS~ayo
Recommende
d
v-p---npf--iano-at&
‫نَا‬
naA
We
p--&---p-n---------
َُ ‫س‬
‫ان‬
َُ ‫اإلن‬
ِْ
Alo<insaAn
a
human
nq----np-ad-----bt-
ُ‫ب‬
ِ
ُْ ‫َوا ِلد‬
‫َي‬
‫ُِه‬
bi
to
p--r-----g---------
waAlidayo
parents
nw----nd-gd-----at-
hi
his
p--&-----g---------
23
Evaluation and Results:
Gold Standard for Evaluation
Gold standards are used to evaluate and measure the actual accuracy of
automatic systems.
To construct a gold standard for evaluation, we need to determine:
• The Problem Domain
• Evaluating morphological analyzers and part-of-speech taggers.
• The Corpora
• Corpora of different text domains, formats and genres of both vowelized and
non-vowelized Arabic text.
• Two versions of the Qur’an text, vowelized Qur’an text, and non-vowelized
Qur’an text.
• The Corpus of Contemporary Arabic (Al-Sulaiti & Atwell, 2006).
24
Gold Standard for Evaluation
Gold Standard Format
• Includes morphological and part-of-speech information for each word of
the gold standard in a line separated by tabs.
• Contains the root and the pattern information of the words.
• The gold standard will be stored using flat text files, using Unicode utf8
encoding or using XML.
Gold Standard Size
• It must be relatively large.
• can cover most cases that morphological analyzers have to handle.
• It is measured by the number of words it contains.
25
Morphochallenge 2009 Gold
Standard
http://www.cis.hut.fi/morphochallenge2009/
MorphoChallenge aims to develop an unsupervised
morphological analyzer to be used for different languages
including Arabic.
A Gold standard of the Qur’an has been constructed to be
used to evaluate morphological analyzers in Morphochallenge
2009 competition.
• Its size is 78,004 words.
• It contains the full morphological analysis for each word, according to
the morphological analysis of the Qur’an in the tagged database of the
Qur’an developed at the University of Haifa (Dror et al, 2004).
26
Morphochallenge 2009
Qur’an Gold Standard
ِ‫س ِم‬
ْ ‫ِب‬
ِّ
ِ‫للا‬
ِِ ‫الرحْ مـ ا‬
‫ن‬
َّ
ِِ ‫الر ِح‬
‫يم‬
َّ
‫سم‬
None
‫رحم‬
‫رحم‬
None
None
‫فا ا‬
‫عَلن‬
‫فا ِعيل‬
‫ب‬+Prep , ‫سم‬+Noun+Triptotic+Sg+Masc+Gen ,
‫ا‬
‫لَله‬+Noun+ProperName+Gen+Def
,
‫بسم‬
‫هللا‬
‫الرحمـن‬
‫الرحيم‬
‫سم‬
None
‫رحم‬
‫رحم‬
None
None
‫فعَلن‬
‫فعيل‬
‫ب‬+Prep , ‫سم‬+Noun+Triptotic+Sg+Masc+Gen ,
‫لَله‬+Noun+ProperName+Gen+Def ,
‫رحمان‬+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
‫رحيم‬+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
bisomi
All~hi
Alr~aHom_ani
Alr~aHiymi
sm
None
rHm
rHm
None
None
faElaAn
faEiyl
b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen ,
llaah+Noun+ProperName+Gen+Def ,
raHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def
raHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
bsm
Allh
AlrHm_n
AlrHym
sm
None
rHm
rHm
None
None
fElAn
fEyl
b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen ,
llAh+Noun+ProperName+Gen+Def ,
rHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def
rHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
‫رح امان‬+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def
,
‫ا‬
‫ر ِحيم‬+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def
,
‫ا‬
27
Thank you!
Questions ?
28