School of Computing School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds,
Download ReportTranscript School of Computing School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds,
School of Computing School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds, LS2 9JT, UK [email protected] , [email protected] Outline Introduction Arabic Morphological Analyzers • Arabic Corpora & Lexicons • Analytical Study of Tri-literal Roots of Arabic • Specifications of the Morphological Analyzer Morphological Features of Arabic Words and Tag Set Evaluation and Results • Gold Standard for Evaluation • Morphochallenge 2009 Qur’an Gold Standard 2 Introduction Methodologies for developing a robust Arabic morphological analyzer Roots, stems, patterns and affixes are prestored. Grammar and linguistic information are encoded with the analyzers • Syllable-based Morphology (SBM) • Root-Pattern Methodology • Lexeme-based Morphology • Stem-based Arabic lexicon with grammar and lexis specifications • Using tagged corpora and computer algorithms to build morphological database of the tagged words 3 Arabic Morphological Analyzers Buckwalter Morphological Analyzer • Uses pre-stored dictionaries of words, stems and affixes constructed manually. Khoja’s Stemmer • Removes the longest prefix and suffix of the word, • Matches the processed word with lists of noun and verb patterns to extract the correct root of the word. Al-Shalabi et al • Depends on mathematical calculations of weights assigned to the letters of the word, • The algorithm selects the letters with lower weights as root letters. 4 Comparative Evaluation of Arabic Morphological Analyzers Studying freely available morphological analyzers and stemmers. Developing a gold standard for evaluation. Results: • More work is needed for the development of morphological analysis of Arabic. • We can not rely on such analyzers for further analysis such as part-ofspeech tagging and parsing. 5 Arabic Corpora The Qur’an 78,000 tokens, 19,000 vowelized word types, 15,000 non-vowelized word types. The Corpus of Contemporary Arabic (CCA) Modern standard Arabic text corpus consists of 1 million word. The Penn Arabic Treebank 734 files, 166,000 words of written Modern Standard Arabic. The text of 15 traditional Arabic lexicons as corpora. About 11 million words and 2 million word types of both modern and classical Arabic text. 6 Arabic Lexicons Methodologies of ordering lexical entries in the Arabic lexicons • Al-Khalil methodology ( Listed the lexical entries based on the pronunciation of the letters, starting from the farthest in the mouth to the nearest) • Abi Obaid methodology ( Listed the lexical entries based on similarity in meaning.) • Al-Jawhari methodology ( Listed the lexical entries based on last letter of the word.) • Al Barmaki methodology ( Listed the lexical entries alphabetically.) 7 Arabic Lexicons A sample of Arabic lexicon : و َُكتَّبَه،ُب الشي َُء يَ ْكتُبه َكتْباُ و ِكتاباُ و ِكتابة َُ َ َكت.ُ والجمع كتبُ وكتْب، معروف:ُ ال ِكتاب:كتب َّ َخ بان في ُِ ت َُك ِت،ف َُ تَخطُ ِر ْج،ف ْ الي ب َخطُ م ْختَ ِل ْ أ َ ْقبَ ْلتُ من ِع ْن ُِد زيادُ كال َخ ِر:طه؛ قال أَبو النجم َّ ال ، وهي لغة بَ ْه َرا َء، بكسر التاء،بان ُِ ِالنسخِ تِ ِكت ُ ورأَيت في بعض:ف قال ُْ ريق ال َُم أ َ ِل ُِ ط ... َُ تِ ْعلَم ون :فيقولون ،التاء سرون ِ يَ ْك k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [ ta a] , a s i t i s us e d by B a h raa ’ [ Ar a b t rib e ] dia l e ct . T h ey sa y : [ ti ’l a muw n] (yo u kn ow ) . A sample of Arabic-English Dictionary by Edward Lane Analytical Study of Tri-literal Roots of Arabic Tri-literal roots were classified into 3 main groups and 22 detailed groups. Experiment 1: Qur’an words derived from tri-literal roots were analyzed, (45,534 words) and (1,610 tri-literal roots) Compoun d, 6.82% Defective Compound Intact Compound, 45, 2.80% Intact Defective Compound Defective, 468, 29.07% Defective, 32.12% Intact, 61.06% Qur’an tokens Intact, 1097, 68.14% Tri-literal roots of Qur’an 9 Analytical Study of Tri-literal Roots of Arabic Experiment 2: Word-types of broad-lexical resource constructed by analyzing 15 Arabic lexicons, which contains 376,167 word types Intact Compound, 2.33% Intact Com pound, 309, 3.63% Defective Compound Compoun d Defective, 29.42% Defective Defective, 2825, 33.23% Intact, 68.25% Word types of broad-lexical resource Intact, 5368, 63.14% Roots of broad-lexical resource 10 Specifications of the Morphological Analyzers - Inputs Input: single words or text (fully vowelized, partially vowelized, or non-vowelized) Tokenization: Arabic word, number, currency or punctuation mark. Processing Arabic words: • Resolving doubled letter marked with Shaddah • Resolving the Extention (maddah) صى َُّ َُو صى َُ ص ْ َُو آ َمنوا waS~aY waSoSaY |manuwA ءا َمنوا ‘AmanuwA Only one short vowel might appear on any letter of the Arabic word. 12 11 10 9 8 7 6 5 4 3 2 1 - و ُ ص ُ ص ُ ى - Y a Position وصصى S o S a w waSoSaY ءامنوا - ا - م ُ ن ُ و - ا - ء - A - - A - w u n a m ‘ ‘AmanuwA 11 Stop Words (Unambiguous Words) Stop word has only one morphological analysis wherever they appear in the text. About 40% of any text tokens belongs to stop words. The system contains a list of 1,368 stop words. Personal Pronouns : > “ أناnA” I, “ هيhy” she Relative pronouns : “ الذيAl*y” who (sm), “ التيAlty” who (sf) Demonstrative pronouns : “ هذاh*A” this (sm), “ هذهh*h” this (sf) Prepositions: “ فيfy” in, “ علىElY” on , <“ إلىlY” to 12 Cliticts, Prefixes and Suffixes Proclitics, prefixes, suffixes and enclitics were collected from traditional Arabic grammar books. Clitics and affixes lists were checked using four Arabic corpora: • The Qur’an • The Corpus of Contemporary Arabic (CCA) • The Penn Arabic Treebank • The text of the 15 traditional Arabic lexicons as a corpus 13 Cliticts, Prefixes and Suffixes 215 Proclitics & Prefixes Prefix Example P1 Tag P2 Tag P3 Tag ف فستـذكرون فستp--t--------------- سp--i--------------- تr---s-nus---------fst fst*krwn f s t وال والـسماء وp--t--------------- الr---d-----d-------wAl wAlsmA’ w Al 127 Suffixes & Enclitics Suffix Example P1 تموهما اورثـتموها تم Tag r---&-mps??----h--- tmwhA >wrvtmwhA tm يون الحواريون Ywn AlHwArywn ي y P2 Tag P3 Tag و r---l-mp-n?----?--- هما r---&-ndt??----h--- w r---j-------------- ون hmA r---l-mp-n?----?--- wn 14 Cliticts, Prefixes and Suffixes Words are divided into three parts of different size. The first part is searched in the proclitics & prefixes list The third part is searched in the suffixes & enclitics list Analyzed Word يعملون yaEomaluwna First Part ي y يع yE Second Part Third Part Prefixes & Suffixes analyses يعملونyEmlwn Candidate analysis عمل ون Eml wn Candidate analysis يعملوyEmlw ن n Not accepted م لونlwn m Not accepted 15 Root or Stem The system uses a list of about 12,000 roots extracted by analyzing 15 traditional Arabic language lexicons The second part of the word is searched by the root list. Analyzed Word َ يَ ْع َمل ُون First part Second part Third Part يعملو yEmlwn ن يعملyEml ونwn عملو Emlwn يy ن يy عملEml ونwn yaEomaluwna Affixes analyses Affixes and Root analyses Candidate analysis Not accepted analysis Candidate analysis Not accepted analysis Candidate analysis Not accepted analysis Candidate analysis Accepted Analysis 16 Word Pattern Different words are derived from their roots using certain patterns. Derived words inherent morphological features of the derivation patterns. The system has a list of patterns which are extracted from traditional Arabic language grammar books. • 2730 verb patterns • 985 noun patterns • Morphological features POS tags are assigned to each pattern in the list. • Patterns are fully vowelized Verb Patterns POS Tag فعلتfaEalotu فعلناfaEalonaA فعلتfaEalota v-p---nsf---an?-st?v-p---npf---an?-st?v-p---mss---an?-st?- Noun Patterns POS Tag > ُافعالوىufoEulAwaY nw----??-??----?qt-? ِاف ِعياللAifoEiylAl nw----??-??----?qt-? فاعوالءfAEuwlA’ nw----??-??----?qt-? 17 Pattern Matching Algorithms First algorithm: depends on the word and its root as inputs. • The root letters of the word are replaced by the letters (fa’, Aiin, Lam, [Lam]) (] [ل، ل، ع، )ف. Replacement of root letters is not an easy task !!!! Second algorithm: depends on a pre-stored list of patterns. • Searches the pattern list for patterns of similar size as the analyzed word, after removing its affixes. • E.g: The word كـتبktb matches the following patterns: فعل فعل فع ل faEol FaEal faEul ف ِعل فعل فعل فعل ف ِعل ِفعل ِف ِعل faEl fuEol fuEal fuEul fuEil fiEol fiEil • Replaces the letters of the word corresponding to the letters (Fa’, Ain, Lam , [Lam]) (] [ل، ل، ع، )فof the pattern. 18 Word Pattern: The second algorithm (Example) Analyzed Word يعملون Matched Patterns يفعلون yafoEuluwna yaEomaluwna Tag v-c---mpt--ian?-st? يف ِعلون yafoEiluwna v-c---mpt--ian?-st? يفعلون yafoEaluwna v-c---mpt--ian?-st? يف ِعلون yufoEiluwna v-c---mpt--ipn?-at? يفعلون yufoEaluwna v-c---mpt--ipn?-tt? 19 Vowelization Helps in determining some morphological features of the words. Analyzed Word كـتب Pattern Vowelization فعل فعل فعل faEol FaEal faEul كـتب كـتب كـتب katob katab katub ktb ف ِعل فع ل فع ل فع ل ف ِعل ِفعل ِف ِعل faEl fuEol fuEal fuEul fuEil fiEol fiEil كـ ِتب كـتب كـتب كـتب كـ ِتب ِكـ ِتب ِكـتب katib kutob kutab kutub kutib kitob kitib 20 Morphological Features of Arabic Words and Tag Set http://www.comp.leeds.ac.uk/sawalha/tagset Part-of-Speech Tag Set is designed following the traditional grammar classifications. Tag Set has 22 morphological features of Arabic words. The Tag consists of 22 characters. E.g. • v at the first position indicates verb, n at the second position indicates proper name. At the seventh position m indicates masculine, and f indicates feminine • “ - “ is used If the value of a certain feature is not applicable for the tagged word. • “?” is used if the value of a certain feature belongs to word, but at the moment is not available or the automatic tagger could not guess it. 21 Morphological Features of Arabic Words and Tag Set http://www.comp.leeds.ac.uk/sawalha/tagset P Morphological Features Categories 1 2 3 4 5 6 7 8 9 10 Main POS POS of Noun POS of Verb POS of Particle Residuals Punctuations Gender Number Person Morphology 11 Case and Mood 12 13 Case and Mood marks Definiteness ُأَقسامُالكالم الرئيسيَّة ُأقسامُفرعُيَّة )(االسم ُأقسامُفرعيَّة )(الفعل ُأقسامُفرعيَّة )(الحرف ُأقسامُفرعيَّة )(أخرى عالماتُالترقيم الجنس العدد الشخص صرف َّ ال ُالحالةُاإلعرابية لالسمُأوُالفعل ُعالمةُاإلعرابُأو البناء ال َم ْع ِرفَةُوالُنَّ ِك َر P Morphological Features Categories 14 Voice 15 Emphasize 16 17 Transitivity Humanness 18 19 Variability & Conjugation Augmented & Unaugmented 20 Root letters 21 Verb Internal Structure 22 Noun finals ُال َمبْنيُ ِلل َم ُْعلومُو ال َمبْنيُ ِلل َمجُْهول ُالمؤكَّدُوغير المؤكَّد الالزمُوالمتعدي العاقلُوغيرُالعاقل التَّصريف المجردُوالمزيد َّ ج ْذر َُ عدَدُأحْ رفُال َ بنيةُالفعل ُأقسامُاألسمُتبعا للفظُآخره 22 Morphological Features of Arabic Words and Tag Set Sample of tagged document using the morphological feature Tag Set َ س سنا َّ َو َو ْ انُ ِب َوا ِل َد ْي ِهُح َ اُاإلن ِ ْ َص ْين We have recommended that a person must take good care of their parents. Word Tag َُو wa And p--t-----a--------- ُص ْي َّ َو waS~ayo Recommende d v-p---npf--iano-at& نَا naA We p--&---p-n--------- َُ س ان َُ اإلن ِْ Alo<insaAn a human nq----np-ad-----bt- ُب ِ ُْ َوا ِلد َي ُِه bi to p--r-----g--------- waAlidayo parents nw----nd-gd-----at- hi his p--&-----g--------- 23 Evaluation and Results: Gold Standard for Evaluation Gold standards are used to evaluate and measure the actual accuracy of automatic systems. To construct a gold standard for evaluation, we need to determine: • The Problem Domain • Evaluating morphological analyzers and part-of-speech taggers. • The Corpora • Corpora of different text domains, formats and genres of both vowelized and non-vowelized Arabic text. • Two versions of the Qur’an text, vowelized Qur’an text, and non-vowelized Qur’an text. • The Corpus of Contemporary Arabic (Al-Sulaiti & Atwell, 2006). 24 Gold Standard for Evaluation Gold Standard Format • Includes morphological and part-of-speech information for each word of the gold standard in a line separated by tabs. • Contains the root and the pattern information of the words. • The gold standard will be stored using flat text files, using Unicode utf8 encoding or using XML. Gold Standard Size • It must be relatively large. • can cover most cases that morphological analyzers have to handle. • It is measured by the number of words it contains. 25 Morphochallenge 2009 Gold Standard http://www.cis.hut.fi/morphochallenge2009/ MorphoChallenge aims to develop an unsupervised morphological analyzer to be used for different languages including Arabic. A Gold standard of the Qur’an has been constructed to be used to evaluate morphological analyzers in Morphochallenge 2009 competition. • Its size is 78,004 words. • It contains the full morphological analysis for each word, according to the morphological analysis of the Qur’an in the tagged database of the Qur’an developed at the University of Haifa (Dror et al, 2004). 26 Morphochallenge 2009 Qur’an Gold Standard ِس ِم ْ ِب ِّ ِللا ِِ الرحْ مـ ا ن َّ ِِ الر ِح يم َّ سم None رحم رحم None None فا ا عَلن فا ِعيل ب+Prep , سم+Noun+Triptotic+Sg+Masc+Gen , ا لَله+Noun+ProperName+Gen+Def , بسم هللا الرحمـن الرحيم سم None رحم رحم None None فعَلن فعيل ب+Prep , سم+Noun+Triptotic+Sg+Masc+Gen , لَله+Noun+ProperName+Gen+Def , رحمان+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , رحيم+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , bisomi All~hi Alr~aHom_ani Alr~aHiymi sm None rHm rHm None None faElaAn faEiyl b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen , llaah+Noun+ProperName+Gen+Def , raHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def raHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , bsm Allh AlrHm_n AlrHym sm None rHm rHm None None fElAn fEyl b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen , llAh+Noun+ProperName+Gen+Def , rHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def rHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , رح امان+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , ا ر ِحيم+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , ا 27 Thank you! Questions ? 28