Rule-based approach in Arabic NLP: Tools, Systems and
Download
Report
Transcript Rule-based approach in Arabic NLP: Tools, Systems and
Prof. Khaled Shaalan
[email protected]
Faculty of Engineering & IT, The British University in
Dubai, UAE
Keynote Speak
Workshop in Plagiarism Detection in Arabic
Department of Computer Science, King
Abdulaziz University, Jeddah – KSA
1
Overview
Challenges in Arabic Computational Linguistics
Research Opportunities
2
“Computational linguistics is the study of
computer systems for understanding and
generating natural language”
“Computational linguistics is the scientific
study of language from a computational
perspective”
Translation: “”اللسانيات الحاسوبية
3
The following terms are used to refer to the same or
closely related disciplines:
◦
◦
◦
◦
◦
Computational Linguistics (CL)
Natural Language Processing (NLP)
Natural Language Understanding (NLU)
Language Engineering (LE)
Human Language Technologies (HLT)
4
To implement a fully automatic computer system that
can understand and express itself in human Language
The dream: Computers/Robots who can converse with
us as in science-fiction films (Star Wars, Alien, Star
Trek, The Terminator, etc.), and yet we don’t see any
really robust or effective ones in 20DD.
5
You could tell the computer what you wanted and it
understood you?
◦ Tasks: information retrieval, question answering, dialogue systems,
etc.
You could dictate a letter to the computer, it printed it and
then saved it as a file?
◦ Task: text to speech
Having to mark essays or reports and want to locate
instances of plagiarism, you could ask the computer to
compare each a suspicious document with likely similar
reports.
◦ Task: Plagiarism Detection (direct changes vs style changes)
6
Having no time to read a 100 page book, you could ask
the computer to summarise it for you and it produced a
one page summary in a few minutes
◦ Tasks: Automatic summarisation
You could ask the computer to translate for you a text
in Japanese which you did not understand?
◦ machine translation
7
You could ask your computer for legal advice in
ordinary language and it gave you the answer straight
away?
◦ information extraction, summarisation, text generation
When you feel unwell you could tell the computer what
your symptoms were and it could diagnose your
condition?
◦ information extraction, summarisation, text generation,
question answering, knowledge management.
8
Because Natural Language is extremely rich in form
and structure, and very ambiguous.
The degree varies from language to another: Chinese
and Arabic are one of the most difficult languages to
learn and process.
9
Can we formulate theorems for
language?
◦
◦
◦
◦
Each word has only one meaning.
Each sentence can be interpreted in only one way.
The utterances produced by humans are sincere.
The size of the vocabulary is fixed and there is linear function,
between each pair, e.g. say y = 2x as a relations for the size of
the vocabulary between English Arabic.
10
Ambiguity is a fundamental problem of computational
linguistics
Resolving ambiguity is a crucial goal—called
disambiguation
11
Arabic Script
Language in Use
Lack of Capitalization
Optional short vowels
Lack of uniformity in writing styles
Systematic spelling mistakes
Lack of resources
Agglutination
12
Genuine Arabic Script
Transliterated
Bismi Allāhi Ar-Raĥmāni Ar-Raĥīmi
In the name of Allah, the Beneficent,
the Merciful.
13
An Arabic word, defined as a string of characters
delimited by spaces.
Right-to-left
ﻍشمكبن
ﻏ ﺷ ﻣ ﻛ ﺑ ﻧ
ﻐ ﺸ ﻤ ﻜ ﺒ ﻨ
ﻎﺶﻢﻚﺐﻦ
Stand
alone
initial
media
l
final
14
Buckwalter Encoding
Romanization
◦ One-to-one mapping
to Arabic script spelling
◦ Left-to-right
◦ Easy to learn/use
◦ Human & machine compatible
15
Social Media: Arabic characters are nowhere near
Latin-Based Characters.
16
Segmentation is a preprocessing required because
Arabic allows to generate fully concatenated phrases:
“(”الدكتورمحمدوزيرالخارجيةDr-Mohammed-the-Ministerof-Foreign-Affairs).
Segmentation during analysis leads to ambiguity that
can be solved by grammatical rules or normalization.
word
وجد
للغة
Stem
he found
and grandfather
for a language
Possible analysis
Conj+Verb: جد+و
Conj+Noun: جد+و
لغة+اللغة = ل+ل
17
Classical (Quranic) Arabic
Modern Standard Arabic
Colloquial Arabic Dialects (Local spoken Arabic-no
linguistic rules, written in social media)
Classical Arabic
Modern Standard Arabic Dialects
ضدة
َ ِمن
mindada
طاولة
ţawile
طربيزة
ţarabēza
ميدة
mida
18
Non-casing language: Arabic writing systems do not
exhibit differences in orthographic case such as
capitalized initial letter which leads to ambiguity
(mainly with proper names).
Arabic Word
Meaning or function in a sentence
أشرف
Ashraf
• A given name, an inflected verb (he-supervised)
• A superlative (the-most-honorable)
مسقط رأسه بجدة
• the falling of his head in grandfather/Jeddah
Maskat rAsah fi Jeddah • MWE (=)مسقط رأسهplace of birth Jeddah
This ambiguity can be resolved by analyzing the
context surrounding the ambiguous word(s).
19
Arabic text contains diacritics representing most
vowels which affect the phonetic representation and
give different meaning to the same lexical form which
leads to unvocalized-to-vocalized, or orthographic
ambiguity .
Unvocalized
Arabic
word
قطر
•
•
•
•
•
كتب
Possible meaning
Country name (a location) if transliterated as Qatar
the literal meaning of country
radius (a mmeasure) if transliterated as qutr, or
the literal meaning of distill if transliterated as qat~ar.
ب
َ َ – َكتkatab, a verb, wrote if we use the vowel (a)
• – ُكتُبkutub, plural noun, books if we use the vowel (u)
20
Word
Different Interpretations
Lemma
يعد
(bring back) ي ِعد
(return) يعُد
(promise) ي ِعد
أعاد
عاد
وعد
(count) يَعُد
(prepare) ي ِعد
عد
أعد
English
How are
you?
Arabic Translation
Kaifa haloka
ك؟
َ ُيف َحال
َ َك
(the suffix “a”
indicates male
gender ♂)
يف َحالُ ِك ؟
Kaifa haloki (the
َ َك
suffix “i” indicates
female gender ♀)
Arabic has a high level of
transcriptional ambiguity
because it has more speech
sounds than Western
European languages: an
Arabic word can be
transliterated in a multitude
of ways.
23
Increase dictionary size: retain all versions of the name
variants in a dictionary with a possibility of linking
them together.
Increase processing time: normalize each occurrence of
the variant to a canonical form which requires a
mechanism (such as string distance calculation) for
name variant matching between a name variant and its
normalized representation.
24
Frequent typographic errors made by Arabic writers in
regard to certain characters.
◦ Hamza dropping: أ, إ ا
◦ Undotted ta-marbuta: ة ه
◦ Undotted final ya: ي ى
Typos
Correct form
المملكة العربية السعودية المملكه العربيه السعوديه
The Kingdom of Saudi Arabia
ابوظبى
أبوظبي
Abu Dhabi
An edit-distance technique (shortest edit sequence) can be
used to recover of some deleted/modified letters.
25
Most of them developed outside the Arab world and
focused on news genre.
Dictionaries: List of words along and their morphological
and orthographic features, meaning, and uses.
Annotated linguistics resources (manually annotated &
verified rich resources to train on and evaluate against)
◦ Corpus: a large collections of tagged documents.
◦ Treebank: a large collections of syntactic analyses of sentences.
26
<?xml version="1.0" encoding="UTF-8"?>
<file language="ar">
…<وكان الحاكم العسكري الباكستاني الجنرالperson><برويز مشرف/person> 57
عاما ً استعد في غير مناسبة إلى تقليص القوات العسكرية علىالحدود
<معlocation><الهند/location> في وتوقيع اتفاقية عدم اعتداء ووقف سباق التسلح
غير أن رفع،<المنطقةlocation><الهند/location> موازنة
</file>
<?xml version="1.0" encoding="UTF-8"?>
<file language="ar">
…and the Pakistani military governor general<person>Pervez
Musharraf</person> 57 years he got ready in other than an occasion to the
reduction of military forces on the border with <location>India</location>
And signing a non-aggression pact and stopping the arms race in the region,
however raising the budget in <location>India</location> …
</file>
(S (VP rafaDat ضت
َ ََرف
ُ ُ) ال ُسل
(NP-SBJ Al+suluTAtu طات
(S-NOM-OBJ
(VP manoHa َم ْن َح
(NP-SBJ *)
(NP-DTV Al>amiyri مير
ِ َاأل
AlhAribi ب
)
ِ الهار
ِ
َ َج
(NP-OBJ (NP jawAza واز
(NP safarK )) َسفَر
(ADJP dyblwmAsy~AF ً ))))) ديبلوماسيا
َ
سفر ديبلوماسيا
منح األمير الهارب
َ رفضت السلطات
ٍ جواز
The authorities refused to give the escaping prince a diplomatic passport
Arabic constructs complex words that often contain affixes and
clitics representing various parts of speech.
Inflected forms systematically analyzed by rules
Arabic Word = sentence
Decomposition into tokens
ورأيتهم
• و/wa/ Conjunction "and"
/wra’aytuhum/
• رأى/r'aa/ Past tense Verb "saw"
ُ /tu/ Subject Pronoun "I"
and I saw them
• ت
• هُم/hum/ Object Pronoun "them"
Arabic word decomposition requires tokenization and
morphological analysis (along with part of speech tagging)usually combined.
29
Surface form:
اتهم
ِ وب َح َس َن
ِ
Surface form:
Clitics:
ـهم
ِ و ِبـ َح َس َنـ اتـ
Clitics:
Inflection:
َح َسنـ ات
Inflection:
Lemma:
Pattern:
Root:
َح َس َنة
َف َعلَة
حسن
ورأيتهم
ـهم
ِ و رأيــ تـ
رأيــ تـ
Lemma:
رأي
Pattern:
َف َعل
Root:
رأي
30
Affix as function word might be ambiguous but
resolving them is possible.
Suffix Stem
هم
رأيتهم
hm
I saw
them
كتابهم
noun
وهم
And they
Unambigious analysis
• Verb: object pronoun
• Noun: Pronoun
• Conjunction: subject
31
Linguistic Features
◦ Part-of-speech
Traditional: Noun, Verb,
Particle
Computational: N, PN, V, Adj,
Adv, P, Pron, Num, Conj, Det,
Aux, Pun, IJ, and others
◦ Noun-specific
Number: singular, dual, plural,
collective
Gender: masculine, feminine,
Neutral
Definiteness: definite,
indefinite
Case: nominative, accusative,
genitive
Possessive clitic
◦ Verb-specific
Aspect: perfective,
imperfective, imperative
Voice: active, passive
Tense: past, present, future
Mood: indicative, subjunctive,
jussive
Subject (Person, Number,
Gender)
Object clitic
◦ Others
Single-letter conjunctions
Single-letter prepositions
32
Phonology
Morphology
Syntax
Semantics
Pragmatics
Discourse
Morphological
Processing
Syntactic
Analysis
Each kind of knowledge has associated with it an
encapsulated set of processes that make use of it.
Interfaces are defined that allow the various levels
to communicate.
This often leads to a pipeline architecture.
Semantic
Interpretation
Context
33
Morphological
Processing
Semantic
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Syntactic
Interpretation
Semantic
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Interpretation
Semantic
Analysis
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Interpretation
34
Four possible disambiguation approaches:
1.
Tightly coupled interaction among processing levels;
knowledge from other levels can help decide among choices
at ambiguous levels.
2.
Pipeline processing that ignores ambiguity as it occurs and
hopes that other levels can eliminate incorrect structures.
35
3.
Probabilistic approaches based on making the most likely
choices
1. Or passing along n-best choices
4.
Don’t do anything, maybe it won’t matter
1. We’ll leave when the duck is ready to eat.
2. The duck is ready to eat now.
Does the “duck” ambiguity matter with respect to whether we can
leave?
1. .سأرحل عندما تعلن المدرسة نتائج االمتحان
2. .أعلنت المدرسة نتائج االمتحان
المدرسة
Teacher or school
Does the “ ”المدرسةambiguity matter with respect to whether we can
leave?
36
37
Rule-Based Systems
◦
◦
◦
◦
◦
Explicit encoding of linguistic knowledge
Usually consisting of a set of hand-crafted, grammatical rules
Easy to test and debug
Require considerable human effort
Often based on limited inspection of the data with an emphasis
on prototypical examples
◦ Often fail to reach sufficient domain coverage
◦ Often lack sufficient robustness when input data are noisy
38
Rule:
NP = DET + NOUN
NP = DET + NOUN + DET + ADJ
A computer can identify that الحاسوبand الحاسوب الجديد
are nouns
39
Implicit encoding of linguistic knowledge
Often using statistical methods or machine learning
methods
Require less human effort
Are data-driven and require large-scale data sources
Achieve coverage directly proportional to the richness
of the data source
Are more adaptive to noisy data
40
The computer is told that the الحاسوب, the الصندوق األحمر,
the الكتاب األحمر, etc. are noun phrases
And it learns that the الحاسوب األحمرis a noun phrase
41
Features:
◦ properties or characteristic attributes of words designed for
consumption by a computational system.
four major steps:
1. feature selection,
2. algorithm selection or the decision of which ML algorithm(s)
to use for training and classification,
3. training, the actual learning of distinguishing patterns using
the selected feature list, and
4. classification, applying these patterns to the input text to
detect and classify the NEs.
42
43
Dr Khaled Shaalan
Sentence
ساعدت الهيئة الفلسطينيين
Tokenization: breaking words into smaller units by separating articles, prepositions
and conjunctions
helped@the@agency@the@Palestinians
Tokenizer
44
Dr Khaled Shaalan
Morphological analysis: All possible interpretations
helped
the
agency
Palestinians
45
Dr Khaled Shaalan
Morph.
Analyzer
Lexicon: Lexical properties and subcategorization frames
helped
agency
Palestinian
Lexicon
46
Dr Khaled Shaalan
Grammar Rules: Phrase Structure rules and functional equations
Grammar
47
Dr Khaled Shaalan
Full parse: constituent-structures and functional-structures
helped
the agency
48
the
Palestinians
A word that acquires its meaning from the linguistic
and/or pragmatic context. Usually an anaphor refers to
a preceding antecedent.
قال الرجل أن الوزير قد استدعاه
قال الرجل أن الوزير قد أقاله الرئيس
49
There is usually a few people working on the grammar.
Development is hampered by linguistic (theoretical)
issues that pop up frequently.
Speed also depends on what tools are already available.
Grammar writers are usually researchers who are more
interested in linguistic phenomena than in coverage.
No formal guidelines, training, or project management.
50
Overgeneration (too many output)
Ambiguity
Reconstruction of vowels
MultiWord/compound Expressions: بالحديد والنار
Out-of-Vocabulary (OOV): استالل
Handling ill-formed input
◦ Detection (spell checking)
◦ Correction- relaxation “ ”هinstead of “”ة
Prevent ill-formed output
◦ Check the compatibility (the prefix “ ”فcannot come after the
prefix “( ”بor “))”ك.
◦ Word sense: الجزيرة- حدود- رجل
◦ Complex word ambiguity: بعقوبة- وهمي- وفي
51
Ambiguity (more than parse tree)
◦ Disambiguation techniques
Syntactic freedom
◦ VSO or SVO ذهب األوالد إلي المدرسة – األوالد ذهبوا إلي المرسة
Handling ill-formed input
◦ Detection (grammar checking)
◦ Recovering (Partial parsing - parses = chunks to be related)
Anaphora resolution:
◦ Gender Agreement
هدية من مكة2 أحضر لها1 أنه2 لسميرة1قال احمد
◦ Number agreement
احمر1 واحدة لونهم2 في المعرض ماعدا سيارة1كل السيارات
52
◦ Attia, M., Pecina, P., Samih, Y., Shaalan, K., Van Genabith, J. Arabic Spelling Error Detection
and Correction, Journal of Natural Language Engineering (JNLE), Cambridge University Press,
UK, Sept. March 2015. DOI: 10.1017/S1351324915000030 (IF 0.463)
◦ Attia, M., Samih, Y., Shaalan, K., Genabith, J., The Floating Arabic Dictionary: An Automatic
Method for Updating a Lexical Database through the detection and lemmatization of the
Unknown Words, The International Conference on Computational Linguistics (COLING),
PP. 83-96, Mumbai, India, 8-15 December, 2012.
◦ Attia, M., Pecina, P., Samih, Y., Shaalan, K., Genabith, J., Improved Spelling Error Detection and
Correction for Arabic, The International Conference on Computational Linguistics
(COLING), PP. 103-112, Mumbai, India, 8-15 December, 2012.
◦ Shaalan, K., and Attia, M. Handling Unknown Words in Arabic FST Morphology. The 10th
edition of the International Workshop on Finite State Methods and Natural Language
Processing (FSMNLP) 2012, Donostia - San Sebastian, Spain, July 23-25, 2012.
◦ Shaalan, K., Samih, Y., Attia, M., Pecina, P., Genabith, J., 2012. Arabic Word Generation and
Modelling for Spell Checking, In the Proceedings of The eighth international conference on
Language Resources and Evaluation (LREC'12), PP. 719-725, Istanbul, Turkey, 21-27 May
2012.
◦ Attia, M., Shaalan, K., Tounsi, L., Genabith, J., Automatic Extraction and Evaluation of Arabic
LFG Resources, In the Proceedings of The eighth international conference on Language
Resources and Evaluation (LREC'12), PP. 1947-1954, Istanbul, Turkey, 21-27 May 2012.
53
These are core Computational Linguistics tools, useful
for:
Machine Translation
CALL
Information Extraction
Question-answering
Proofing tools (spell and grammar
checking and correction)
54
Text understanding
and reasoning
NER is the task of detecting and classifying
named entities (i.e. proper names) within
unstructured and structured texts into
predefined classes (e.g. person, location and
organization).
Rule-based + Machine Learning = Hybrid
Examples:
( الملك عبدهللاKing Abdullah).
الرئيس الفخري للجمعية اللواء أبراهيم الباري
(The honorary president of the society
general Ibrahim AlBari )
الزعيم الديني الشيخ االردني أبوبكر البدوى
(The Jordanian religious leader Sheikh
Abu Baker AlBadawi )
Arabic name elements comprise of four main categories.
Persons are named by:
◦ Ism: (pronounced IZM), a personal, proper name given
shortly after birth. E.g., Muhammad [Mohammed], Musa
[Moses]
◦ Kunya: (pronounced COON-yah), an honorific name or
surname, as the father or mother of someone; e.g., abu
Da'ud [the father of David], umm Salim [the mother of
Salim].
◦ Nasab: (pronounced NAH-sahb), a pedigree, as the son or
daughter of someone. The nasab follows the ism in usage.
e.g., Hasan ibn Faraj [Hasan the son of Faraj]
◦ Laqab: (pronounced LAH-kahb), a combination of words
into a byname either religious, relating to nature, a
descriptive, or some admirable quality of the person.
Laqabs follow the ism: Yasir Harun al-Rashid [Yasir
Haroon the Rightly-guided].
56
Types:
◦
◦
◦
◦
Word-level features
List lookup features
Contextual features
Linguistic features
57
Feature
Special
Markers
Description
A binary feature indicating the presence of
punctuation marks and special characters in
a word
Word length A binary feature indicating whether the
length of the word is greater than a
predefined threshold.
Capitalization A binary feature indicating the existence of
capitalization information on the gloss
corresponding to the Arabic word
Lexical
The surface features of a character n-gram
up to a range of characters from 1n that
indicate prefix and suffix attachment.
58
Features
Gazetteer
Lexical
Trigger
Blacklist
Nationality
Description
A binary feature indicating the existence of
the word in an individual gazetteer
A binary feature indicating the existence of
the word in the individual lexical trigger list
A binary feature indicating the non-existence
of the word in an individual blacklist
A binary feature indicating the existence of
the word in the nationality list
59
Features
Word n-gram
Rule-based
Description
The features of a sliding window
comprising a word n-gram that
includes the candidate word, along
with preceding and succeeding
words
The features of a sliding window
derived from rule-based NER
decisions
60
Feature
POS
Morphological
BPC
Description
The label identifying the part of speech
category of a word
A set of morphological information
(excluding POS)
Phrase-level labels identifying syntactic
chunks such as noun phrases (NPs) and
verb phrases (VPs) within a text
61
Determine the optimized feature set for Arabic NER
Ambiguity
Washington (a person name, a city name or a political entity)
أعلنت القاهرة ان طائراتها لم تقصف مواقع ليبية
Determining the beginning and end of a named entity
Stanford University bookshop
Co-reference
– القائد األعلى للقوات المسلحة – رئيس الجمهورية – السيسي
– الرئيس- عبد الفتاح السيسي
They need to be recognized as mentions of the same entity
Entities within entities
Chicago Pizza
فندق القاهرة
63
◦ Shaalan, K., A Survey of Arabic Named Entity Recognition and
Classification, Computational Linguistics, MIT Press. (IF 1.468)
◦ Shaalan, K., Oudah, M., A Hybrid Approach to Arabic Named Entity
Recognition, Journal of Information Science (JIS), SAGE
Publications Ltd, UK. (IF 1.238)
◦ NERA: Named Entity Recognition for Arabic, Journal of the
American Society for Information Science and Technology
(JASIST), John Wiley & Sons, Inc., NJ, USA, 60(8): 1652–1663,
July 2009. (IF 2.23).
◦ Oudah, M., Shaalan, K., A Pipeline Arabic Named Entity
Recognition Using a Hybrid Approach, The International
Conference on Computational Linguistics (COLING), PP. 21592176, Mumbai, India, 8-15 December, 2012.
Lexical Error diagnosis and feedback for second language learners of Arabic
within intelligent language tutoring framework.
◦ Improve language skills of
◦ Identify the cause of error rather than providing the correct version directly
Example:
Question: choose the correct answer
( توقع- أوقع- سقوط أمطار اليوم )أتوقع.......
I …. rain today (expect - sign - expected)
Learner Answer: أوقع
Correct Answer: ( أتوقعsee next slide)
Diagnosis: incorrectly used the imperfect tense of the root / ع-ق-وw-q-E/ with the
pattern أفعل/<afoEal/ instead of the intended pattern تفعل/tafaE~al/
Word: أتوقع
Root: ع- ق-و
Lexical category: verb
Pattern: تفعل
Tense: imperfect
Voice: active
Mood: indicative
Subject person: 1
Subject num: sg
Subject gender: neutral
◦ Shaalan, K., Magdy, M., Fahmy, A. Analysis and Feedback of
Erroneous Arabic Verbs, Journal of Natural Language
Engineering (JNLE), ), 21(2):271-323, Cambridge University
Press, UK, March 2015. (IF 0.463)
◦ Shaalan K. Arabic GramCheck: A Grammar Checker for
Arabic, Software Practice and Experience, John Wiley &
sons Ltd., UK, 35(7):643-665, June 2005. (IF 1.148)
◦ Shaalan K. An Intelligent Computer Assisted Language
Learning System for Arabic Learners, Computer Assisted
Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2): 81-108, February 2005. (IF
0.880)
very useful for automating the proofreading of the
human typed Arabic text.
The use of word processors and text editors leads
to a whole class of writing errors
capable of checking user writing for certain
common grammatical errors, describes the problem
for him/her and offers suggestions for improvement
BUiD - Dr. Khaled Shaalan
68
Agreement errors
Wrong constituent forms
Missing sentence fragments
Wrong word order
BUiD - Dr. Khaled Shaalan
69
Error type
Example
Correct version
Number and gender الجنود* يدافعان عن الوطن
agreement
between The soldiers defend the country.
the inchoative and the
enunciative
الجنود يدافعون عن الوطن
The soldiers defend the country.
Number and gender جاءت بعض السيدات* تحمل أطفالهن
agreement
between Some ladies came carrying their
the
circumstantial children.
accusative and the
subject it modifies
جاءت بعض السيدات يحملن أطفالهن
Some ladies came carrying their
children.
Gender
agreement * شرب البنت عصير البرتقال
between a verb and the The girl drank (m) orange juice.
subject
شربت البنت عصير البرتقال
The girl drank (f) orange juice.
Agreement between a الرجال لن* ذهبوا إلى القرية
verb tense and the use The men will not go to the village.
of specific particles
الرجال لن يذهبوا إلى القرية
The men will not go to the village.
Case ending agreement الفالح زرع فدانين* قمح
between a number and The farmer grew two fedans of
its following descriptor
wheat.
الفالح زرع فدانين قمحا
The farmer grew two fedans of
wheat.
BUiD - Dr. Khaled Shaalan
70
Error type
case ending of
inchoative or
enunciative
Example
*المعلمين ضربا الولد
The two teachers hit the boy.
Correct version
المعلمان ضربا الولد
The two teachers hit the boy.
كم* تالميذ الفصل؟
number and case
ending of the noun that How many students are there in the
follows the interrogative classroom?
particle kam (How
many)
كم تلميذا في الفصل؟
How many students are there in
the classroom?
The verb should remain
singular even though
the subject is dual or
plural
* يلعبون األوالد في الحديقة
The boys play in the garden.
يلعب األوالد في الحديقة
The boys play (sg) in the garden.
definition of inchoative
or (compound) noun in
genitive
*رجل مهذب
A man polite.
الرجل مهذب
The man is polite.
كتبت الرسالة السادسة عشرة ألختها في العراق
declension of the simple كتبت الرسالة *السادس *عشر ألختها في العراق
and compound number She wrote the sixteenth message to She wrote the sixteenth message
her sister in Iraq.
BUiD - Dr. Khaled Shaalan to her sister in Iraq.
71
Error type
Example
Correct version
Missing a pronoun
that relates the
enunciative with the
inchoative
الرجال* يثق فيه
The men trust him.
الرجال يثقون فيه
The men trust him.
Missing the subject of
a verbal sentence
* ذهب إلى الدار
Went to the house.
ذهب الغالم إلى الدار
The boy (or any other
animated masculine entity)
went to the house.
Missing the object of
a verbal sentence
* فتح الولد
The boy opened.
فتح الولد الباب
The boy (or any other
animated masculine entity)
opened the door.
BUiD - Dr. Khaled Shaalan
72
Wrong order between
the adjective and the
noun it describes
اشتريت جميلة قطة
I bought a cat beautiful.
اشتريت قطة جميلة
I bought a beautiful cat.
BUiD - Dr. Khaled Shaalan
73
◦ Shaalan K. Arabic GramCheck: A Grammar Checker for
Arabic, Software Practice and Experience, John Wiley & sons
Ltd., UK, 35(7):643-665, June 2005. (IF 1.148)
Transfer
Interlingua
75
source sentence
(English)
English Dic.
Sentence Analysis
Morphological
& syntactic Analysis
Rules of English
English Parse Tree
Bi-ling Dic.
Transfer
English-to-Arabic
Transformation Rules
Arabic Parse Tree
Arabic Dic.
Sentence Synthesis
Morphological Gen. &
Synthesis Rules of
Arabic
Target sentence
(Arabic)
76
Involves analysis, transfer, and generation components
If you have an Arabic parser & Arabic syntactic
generator, All you need is to acquire the transfer rules
and build the transfer component
77
(1)
[wi:$1, wi+1:$2, …, wk:$k] (1 i k)
[wk:$k, wk-1:$k-1, …, wi:$i] (1 i k)
78
Networks performance evaluation تقييم أداء شبكة
np
np
noun
noun
np
networks
pl
noun
performa
nce
sg
transfer
np
noun
evaluation
sg
تقييم
sg
np
noun
أداء
sg
np
noun
شبكة
pl
79
Synonyms of a word
◦ Acquisition “ ”اكتسابor “”استخالص.
Agreement
◦ intelligent tutoring systems “ ”نظم التعليم الذكيةor “ نظم التعليم
”الذكي
Problems with prepositions
◦ did you do fungal analysis?
“”هل قمت بـتحليل الفطر؟
…
80
Multilingual Machine Translation
Interlingua (semantic)-based approach
French ?
I like Mary
English
[CAUSE (X, [BE (Y, [PLEASED])])]
أنا أحب مارى
Arabic
Mary me gustar
Spanish
Interlingua = Semantic Representation
Deep analysis –
◦ no need for transfer component)
◦ Only analysis and generation components
Add Arabic analyzer to translate to other languages
Add Arabic generator to translate from other languages
82
أنا أرغب في حجز غرفة في الفندق:العميل
Preprocessor
Sentence
Analyzer
Arabic
Lexicon
Morphological
Analyzer
Arabic Grammar
Rules
Arabic
Morphology Rules
Parse Tree
Map
Lexicon
Mapper
Ontology
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hotel,identifiability=yes),disposition=(desire,who=i))
83
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Map
Lexicon
Mapper
Feature Structure
Ontology
Map Rules
Sentence
Generator
Arabic
Lexicon
Arabic Grammar
Rules
Morphological
Generator
Arabic Morphology
Rules
أنا أرغب في حجز غرفة في الفندق:العميل
84
Interlingua:
◦ language-neutral representation
◦ captures the intended meaning of the source sentence
Requires a fully-disambiguating parser
85
◦ Abdel Monem, A., Shaalan, K., Rafea, A., Baraka, H., Generating
Arabic Text in Multilingual Speech-to-Speech Machine Translation
Framework, Machine Translation, Springer, Netherlands, 20(4):
205-258, December 2008.
◦ Shaalan, K., Monem, A. Rafea, A., Arabic Morphological Generation
from Interlingua: A Rule-based Approach, in IFIP International
Federation for Information Processing, Vol. 228, Intelligent
Information Processing IIP, eds. Z. Shi, Shimohara K., Feng D.,
(Boston:Springer), USA, PP. 441-451, 2006.
◦ Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine
Translation of English Noun Phrases into Arabic, The International
Journal of Computer Processing of Oriental Languages
(IJCPOL), World Scientific Publishing Company, 17(2):121-134,
2004.
86
Be able to reuse MSA processing tools with colloquial
Arabic by transferring colloquial Arabic words into
their corresponding MSA words.
Facilitate the communication with colloquial Arabic
speakers
Restore the Arabic dialect to the standard language in
use nowadays.
87
امتي؟
Mapping
متي؟
when?
88
عال
On-the
Mapping
ال
the
علي
on
89
جيت امتي؟
Mapping
جئت متي؟
reordering
متي جئت؟
You-came when?
•Step (1)
• جيت جئت
• امتي متي
•Step (2)
• the New Segment Position for
the word “ ”امتىis
start of sentence (SoS)
When did-you-come ?
90
Lack of standard linguistic resources
More investigations are needed for deriving/learning
transfer rules
91
◦ Abo Bakr, H., Shaalan, K., Ziedan, I., A Hybrid Approach for
Converting Written Egyptian Colloquial Dialect into
Diacritized Arabic, In the Proceedings of The 6th International
Conference on Informatics and Systems, INFOS2008, the
special track on Natural Language Processing, 27-29 March,
Cairo, Egypt, 2008.
◦ Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian
Colloquial into Modern Standard Arabic, International
Conference on Recent Advances in Natural Language
Processing (RANLP – 2007), PP. 525-529, September 2729, Borovets, Bulgaria, 2007.
92
Obtaining a brief and concise answer for Arabic questions
extracted from internet corpus.
Question Analysis
Passage Retrieval
Answer Extraction
[email protected]
◦ http://mailman.uib.no/listinfo/corpora
[email protected]
◦ http://www.linguistlist.org/
[email protected]
◦ http://www.semitic.tk/
[email protected]
◦ http://www.arabicscript.org/CAASL3/index.html
96
Advances in human language technology
require an ever increasing amount of data
and annotation. The number of current
state-of-the-art of Arabic linguistic
resources is still insufficient compared to
Arabic’s actual importance as a language.
Many existing Arabic NLP resources are
only available at significant expense.
The required Arabic NLP technologies are
sparse if even nonexistent in many areas
and researchers have to develop their own
tools.
Conferences: ACL/NAACL, SIGIR, AAAI/IJCAI, ANLP,
Coling, HLT, EACL/NAACL, AMTA/MT Summit,
ICSLP/Eurospeech
Journals: Computational Linguistics, Natural Language
Engineering, Information Retrieval, Information Processing and
Management, ACM Transactions on Information Systems,
ACM TALIP, ACM TSLP
University centers: Columbia, CMU, JHU, Brown, UMass,
MIT, UPenn, USC/ISI, NMSU, Michigan, Maryland,
Edinburgh, Cambridge, Saarland, Sheffield, and many others
Industrial research sites: IBM, SRI, BBN, MITRE, MSR,
(AT&T, Bell Labs, PARC)
The Anthology: http://www.aclweb.org/anthology
98
Full Professor, Faculty of Engineering & IT, BUiD
(Also Professor of Computer Science, FCI, Cairo Univ.)
Email: [email protected]
Web:
www.buid.ac.ae/shaalan
http://scholar.cu.edu.eg/?q=shaalan/
Personal Website: http://sites.google.com/site/khaledshaalan
99
Thank you