Rule-based approach in Arabic NLP: Tools, Systems and

Download Report

Transcript Rule-based approach in Arabic NLP: Tools, Systems and

Prof. Khaled Shaalan
[email protected]
Faculty of Engineering & IT, The British University in
Dubai, UAE
Keynote Speak
Workshop in Plagiarism Detection in Arabic
Department of Computer Science, King
Abdulaziz University, Jeddah – KSA
1



Overview
Challenges in Arabic Computational Linguistics
Research Opportunities
2

“Computational linguistics is the study of
computer systems for understanding and
generating natural language”

“Computational linguistics is the scientific
study of language from a computational
perspective”

Translation: “‫”اللسانيات الحاسوبية‬
3

The following terms are used to refer to the same or
closely related disciplines:
◦
◦
◦
◦
◦
Computational Linguistics (CL)
Natural Language Processing (NLP)
Natural Language Understanding (NLU)
Language Engineering (LE)
Human Language Technologies (HLT)
4

To implement a fully automatic computer system that
can understand and express itself in human Language

The dream: Computers/Robots who can converse with
us as in science-fiction films (Star Wars, Alien, Star
Trek, The Terminator, etc.), and yet we don’t see any
really robust or effective ones in 20DD.
5

You could tell the computer what you wanted and it
understood you?
◦ Tasks: information retrieval, question answering, dialogue systems,
etc.

You could dictate a letter to the computer, it printed it and
then saved it as a file?
◦ Task: text to speech

Having to mark essays or reports and want to locate
instances of plagiarism, you could ask the computer to
compare each a suspicious document with likely similar
reports.
◦ Task: Plagiarism Detection (direct changes vs style changes)
6

Having no time to read a 100 page book, you could ask
the computer to summarise it for you and it produced a
one page summary in a few minutes
◦ Tasks: Automatic summarisation

You could ask the computer to translate for you a text
in Japanese which you did not understand?
◦ machine translation
7

You could ask your computer for legal advice in
ordinary language and it gave you the answer straight
away?
◦ information extraction, summarisation, text generation

When you feel unwell you could tell the computer what
your symptoms were and it could diagnose your
condition?
◦ information extraction, summarisation, text generation,
question answering, knowledge management.
8

Because Natural Language is extremely rich in form
and structure, and very ambiguous.

The degree varies from language to another: Chinese
and Arabic are one of the most difficult languages to
learn and process.
9

Can we formulate theorems for
language?
◦
◦
◦
◦
Each word has only one meaning.
Each sentence can be interpreted in only one way.
The utterances produced by humans are sincere.
The size of the vocabulary is fixed and there is linear function,
between each pair, e.g. say y = 2x as a relations for the size of
the vocabulary between English Arabic.
10


Ambiguity is a fundamental problem of computational
linguistics
Resolving ambiguity is a crucial goal—called
disambiguation
11








Arabic Script
Language in Use
Lack of Capitalization
Optional short vowels
Lack of uniformity in writing styles
Systematic spelling mistakes
Lack of resources
Agglutination
12

Genuine Arabic Script

Transliterated
Bismi Allāhi Ar-Raĥmāni Ar-Raĥīmi
In the name of Allah, the Beneficent,
the Merciful.
13


An Arabic word, defined as a string of characters
delimited by spaces.
Right-to-left
‫ﻍشمكبن‬
‫ﻏ ﺷ ﻣ ﻛ ﺑ ﻧ‬
‫ﻐ ﺸ ﻤ ﻜ ﺒ ﻨ‬
‫ﻎﺶﻢﻚﺐﻦ‬
Stand
alone
initial
media
l
final
14
Buckwalter Encoding

Romanization
◦ One-to-one mapping
to Arabic script spelling
◦ Left-to-right
◦ Easy to learn/use
◦ Human & machine compatible
15

Social Media: Arabic characters are nowhere near
Latin-Based Characters.
16

Segmentation is a preprocessing required because
Arabic allows to generate fully concatenated phrases:
“‫(”الدكتورمحمدوزيرالخارجية‬Dr-Mohammed-the-Ministerof-Foreign-Affairs).

Segmentation during analysis leads to ambiguity that
can be solved by grammatical rules or normalization.
word
‫وجد‬
‫للغة‬
Stem
he found
and grandfather
for a language
Possible analysis
Conj+Verb: ‫جد‬+‫و‬
Conj+Noun: ‫جد‬+‫و‬
‫لغة‬+‫اللغة = ل‬+‫ل‬
17



Classical (Quranic) Arabic
Modern Standard Arabic
Colloquial Arabic Dialects (Local spoken Arabic-no
linguistic rules, written in social media)
Classical Arabic
Modern Standard Arabic Dialects
‫ضدة‬
َ ‫ِمن‬
mindada
‫طاولة‬
ţawile
‫طربيزة‬
ţarabēza
‫ميدة‬
mida
18

Non-casing language: Arabic writing systems do not
exhibit differences in orthographic case such as
capitalized initial letter which leads to ambiguity
(mainly with proper names).
Arabic Word
Meaning or function in a sentence
‫أشرف‬
Ashraf
• A given name, an inflected verb (he-supervised)
• A superlative (the-most-honorable)
‫مسقط رأسه بجدة‬
• the falling of his head in grandfather/Jeddah
Maskat rAsah fi Jeddah • MWE (‫=)مسقط رأسه‬place of birth  Jeddah

This ambiguity can be resolved by analyzing the
context surrounding the ambiguous word(s).
19

Arabic text contains diacritics representing most
vowels which affect the phonetic representation and
give different meaning to the same lexical form which
leads to unvocalized-to-vocalized, or orthographic
ambiguity .
Unvocalized
Arabic
word
‫قطر‬
•
•
•
•
•
‫كتب‬
Possible meaning
Country name (a location) if transliterated as Qatar
the literal meaning of country
radius (a mmeasure) if transliterated as qutr, or
the literal meaning of distill if transliterated as qat~ar.
‫ب‬
َ َ‫ – َكت‬katab, a verb, wrote if we use the vowel (a)
• ‫ – ُكتُب‬kutub, plural noun, books if we use the vowel (u)
20
Word
Different Interpretations
Lemma
‫يعد‬
(bring back) ‫ي ِعد‬
(return) ‫يعُد‬
(promise) ‫ي ِعد‬
‫أعاد‬
‫عاد‬
‫وعد‬
(count) ‫يَعُد‬
(prepare) ‫ي ِعد‬
‫عد‬
‫أعد‬
English
How are
you?
Arabic Translation
Kaifa haloka
‫ك؟‬
َ ُ‫يف َحال‬
َ ‫َك‬
(the suffix “a”
indicates male
gender ♂)
‫يف َحالُ ِك ؟‬
Kaifa haloki (the
َ ‫َك‬
suffix “i” indicates
female gender ♀)

Arabic has a high level of
transcriptional ambiguity
because it has more speech
sounds than Western
European languages: an
Arabic word can be
transliterated in a multitude
of ways.
23

Increase dictionary size: retain all versions of the name
variants in a dictionary with a possibility of linking
them together.

Increase processing time: normalize each occurrence of
the variant to a canonical form which requires a
mechanism (such as string distance calculation) for
name variant matching between a name variant and its
normalized representation.
24

Frequent typographic errors made by Arabic writers in
regard to certain characters.
◦ Hamza dropping: ‫أ‬, ‫ إ‬ ‫ا‬
◦ Undotted ta-marbuta: ‫ ة‬ ‫ه‬
◦ Undotted final ya: ‫ ي‬ ‫ى‬
Typos
Correct form
‫المملكة العربية السعودية المملكه العربيه السعوديه‬
The Kingdom of Saudi Arabia
‫ابوظبى‬

‫أبوظبي‬
Abu Dhabi
An edit-distance technique (shortest edit sequence) can be
used to recover of some deleted/modified letters.
25



Most of them developed outside the Arab world and
focused on news genre.
Dictionaries: List of words along and their morphological
and orthographic features, meaning, and uses.
Annotated linguistics resources (manually annotated &
verified rich resources to train on and evaluate against)
◦ Corpus: a large collections of tagged documents.
◦ Treebank: a large collections of syntactic analyses of sentences.
26
<?xml version="1.0" encoding="UTF-8"?>
<file language="ar">
…‫<وكان الحاكم العسكري الباكستاني الجنرال‬person>‫<برويز مشرف‬/person> 57
‫عاما ً استعد في غير مناسبة إلى تقليص القوات العسكرية علىالحدود‬
‫<مع‬location>‫<الهند‬/location> ‫في وتوقيع اتفاقية عدم اعتداء ووقف سباق التسلح‬
‫ غير أن رفع‬،‫<المنطقة‬location>‫<الهند‬/location> ‫موازنة‬
</file>
<?xml version="1.0" encoding="UTF-8"?>
<file language="ar">
…and the Pakistani military governor general<person>Pervez
Musharraf</person> 57 years he got ready in other than an occasion to the
reduction of military forces on the border with <location>India</location>
And signing a non-aggression pact and stopping the arms race in the region,
however raising the budget in <location>India</location> …
</file>
(S (VP rafaDat ‫ضت‬
َ َ‫َرف‬
ُ ُ‫) ال ُسل‬
(NP-SBJ Al+suluTAtu ‫طات‬
(S-NOM-OBJ
(VP manoHa ‫َم ْن َح‬
(NP-SBJ *)
(NP-DTV Al>amiyri ‫مير‬
ِ َ‫األ‬
AlhAribi ‫ب‬
)
ِ ‫الهار‬
ِ
َ ‫َج‬
(NP-OBJ (NP jawAza ‫واز‬
(NP safarK ‫)) َسفَر‬
(ADJP dyblwmAsy~AF ً ‫))))) ديبلوماسيا‬
َ
‫سفر ديبلوماسيا‬
‫منح األمير الهارب‬
َ ‫رفضت السلطات‬
ٍ ‫جواز‬
The authorities refused to give the escaping prince a diplomatic passport


Arabic constructs complex words that often contain affixes and
clitics representing various parts of speech.
Inflected forms systematically analyzed by rules
Arabic Word = sentence
Decomposition into tokens
‫ورأيتهم‬
• ‫و‬/wa/ Conjunction "and"
/wra’aytuhum/
• ‫ رأى‬/r'aa/ Past tense Verb "saw"
ُ /tu/ Subject Pronoun "I"
and I saw them
• ‫ت‬
• ‫ هُم‬/hum/ Object Pronoun "them"

Arabic word decomposition requires tokenization and
morphological analysis (along with part of speech tagging)usually combined.
29
Surface form:
‫اتهم‬
ِ ‫وب َح َس َن‬
ِ
Surface form:
Clitics:
‫ـهم‬
ِ ‫و ِبـ َح َس َنـ اتـ‬
Clitics:
Inflection:
‫َح َسنـ ات‬
Inflection:
Lemma:
Pattern:
Root:
‫َح َس َنة‬
‫َف َعلَة‬
‫حسن‬
‫ورأيتهم‬
‫ـهم‬
ِ ‫و رأيــ تـ‬
‫رأيــ تـ‬
Lemma:
‫رأي‬
Pattern:
‫َف َعل‬
Root:
‫رأي‬
30

Affix as function word might be ambiguous but
resolving them is possible.
Suffix Stem
‫هم‬
‫رأيتهم‬
hm
I saw
them
‫كتابهم‬
noun
‫وهم‬
And they
Unambigious analysis
• Verb: object pronoun
• Noun: Pronoun
• Conjunction: subject
31

Linguistic Features
◦ Part-of-speech
 Traditional: Noun, Verb,
Particle
 Computational: N, PN, V, Adj,
Adv, P, Pron, Num, Conj, Det,
Aux, Pun, IJ, and others
◦ Noun-specific
 Number: singular, dual, plural,
collective
 Gender: masculine, feminine,
Neutral
 Definiteness: definite,
indefinite
 Case: nominative, accusative,
genitive
 Possessive clitic
◦ Verb-specific
 Aspect: perfective,
imperfective, imperative
 Voice: active, passive
 Tense: past, present, future
 Mood: indicative, subjunctive,
jussive
 Subject (Person, Number,
Gender)
 Object clitic
◦ Others
 Single-letter conjunctions
 Single-letter prepositions
32






Phonology
Morphology
Syntax
Semantics
Pragmatics
Discourse
Morphological
Processing



Syntactic
Analysis
Each kind of knowledge has associated with it an
encapsulated set of processes that make use of it.
Interfaces are defined that allow the various levels
to communicate.
This often leads to a pipeline architecture.
Semantic
Interpretation
Context
33
Morphological
Processing
Semantic
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Syntactic
Interpretation
Semantic
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Syntactic
Interpretation
Semantic
Analysis
Interpretation
Semantic
Analysis
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Interpretation
34

Four possible disambiguation approaches:
1.
Tightly coupled interaction among processing levels;
knowledge from other levels can help decide among choices
at ambiguous levels.
2.
Pipeline processing that ignores ambiguity as it occurs and
hopes that other levels can eliminate incorrect structures.
35
3.
Probabilistic approaches based on making the most likely
choices
1. Or passing along n-best choices
4.
Don’t do anything, maybe it won’t matter
1. We’ll leave when the duck is ready to eat.
2. The duck is ready to eat now.

Does the “duck” ambiguity matter with respect to whether we can
leave?
1. .‫سأرحل عندما تعلن المدرسة نتائج االمتحان‬
2. .‫أعلنت المدرسة نتائج االمتحان‬

‫المدرسة‬
Teacher or school
Does the “ ‫ ”المدرسة‬ambiguity matter with respect to whether we can
leave?
36
37

Rule-Based Systems
◦
◦
◦
◦
◦
Explicit encoding of linguistic knowledge
Usually consisting of a set of hand-crafted, grammatical rules
Easy to test and debug
Require considerable human effort
Often based on limited inspection of the data with an emphasis
on prototypical examples
◦ Often fail to reach sufficient domain coverage
◦ Often lack sufficient robustness when input data are noisy
38
Rule:

NP = DET + NOUN
NP = DET + NOUN + DET + ADJ
A computer can identify that ‫ الحاسوب‬and ‫الحاسوب الجديد‬
are nouns
39






Implicit encoding of linguistic knowledge
Often using statistical methods or machine learning
methods
Require less human effort
Are data-driven and require large-scale data sources
Achieve coverage directly proportional to the richness
of the data source
Are more adaptive to noisy data
40


The computer is told that the ‫الحاسوب‬, the ‫الصندوق األحمر‬,
the ‫الكتاب األحمر‬, etc. are noun phrases
And it learns that the ‫الحاسوب األحمر‬is a noun phrase
41

Features:
◦ properties or characteristic attributes of words designed for
consumption by a computational system.

four major steps:
1. feature selection,
2. algorithm selection or the decision of which ML algorithm(s)
to use for training and classification,
3. training, the actual learning of distinguishing patterns using
the selected feature list, and
4. classification, applying these patterns to the input text to
detect and classify the NEs.
42
43
Dr Khaled Shaalan
Sentence
‫ساعدت الهيئة الفلسطينيين‬
Tokenization: breaking words into smaller units by separating articles, prepositions
and conjunctions
helped@the@agency@the@Palestinians
Tokenizer
44
Dr Khaled Shaalan
Morphological analysis: All possible interpretations
helped
the
agency
Palestinians
45
Dr Khaled Shaalan
Morph.
Analyzer
Lexicon: Lexical properties and subcategorization frames
helped
agency
Palestinian
Lexicon
46
Dr Khaled Shaalan
Grammar Rules: Phrase Structure rules and functional equations
Grammar
47
Dr Khaled Shaalan
Full parse: constituent-structures and functional-structures
helped
the agency
48
the
Palestinians

A word that acquires its meaning from the linguistic
and/or pragmatic context. Usually an anaphor refers to
a preceding antecedent.
‫قال الرجل أن الوزير قد استدعاه‬
‫قال الرجل أن الوزير قد أقاله الرئيس‬
49





There is usually a few people working on the grammar.
Development is hampered by linguistic (theoretical)
issues that pop up frequently.
Speed also depends on what tools are already available.
Grammar writers are usually researchers who are more
interested in linguistic phenomena than in coverage.
No formal guidelines, training, or project management.
50

Overgeneration (too many output)

Ambiguity

Reconstruction of vowels
MultiWord/compound Expressions: ‫بالحديد والنار‬
Out-of-Vocabulary (OOV): ‫استالل‬
Handling ill-formed input
◦ Detection (spell checking)
◦ Correction- relaxation “‫ ”ه‬instead of “‫”ة‬
Prevent ill-formed output
◦ Check the compatibility (the prefix “‫ ”ف‬cannot come after the
prefix “‫( ”ب‬or “‫))”ك‬.




◦ Word sense: ‫ الجزيرة‬- ‫ حدود‬- ‫رجل‬
◦ Complex word ambiguity: ‫ بعقوبة‬- ‫ وهمي‬- ‫وفي‬
51

Ambiguity (more than parse tree)
◦ Disambiguation techniques

Syntactic freedom
◦ VSO or SVO ‫ذهب األوالد إلي المدرسة – األوالد ذهبوا إلي المرسة‬

Handling ill-formed input
◦ Detection (grammar checking)
◦ Recovering (Partial parsing - parses = chunks to be related)

Anaphora resolution:
◦ Gender Agreement
‫ هدية من مكة‬2‫ أحضر لها‬1‫ أنه‬2‫ لسميرة‬1‫قال احمد‬
◦ Number agreement
‫ احمر‬1‫ واحدة لونهم‬2‫ في المعرض ماعدا سيارة‬1‫كل السيارات‬
52
◦ Attia, M., Pecina, P., Samih, Y., Shaalan, K., Van Genabith, J. Arabic Spelling Error Detection
and Correction, Journal of Natural Language Engineering (JNLE), Cambridge University Press,
UK, Sept. March 2015. DOI: 10.1017/S1351324915000030 (IF 0.463)
◦ Attia, M., Samih, Y., Shaalan, K., Genabith, J., The Floating Arabic Dictionary: An Automatic
Method for Updating a Lexical Database through the detection and lemmatization of the
Unknown Words, The International Conference on Computational Linguistics (COLING),
PP. 83-96, Mumbai, India, 8-15 December, 2012.
◦ Attia, M., Pecina, P., Samih, Y., Shaalan, K., Genabith, J., Improved Spelling Error Detection and
Correction for Arabic, The International Conference on Computational Linguistics
(COLING), PP. 103-112, Mumbai, India, 8-15 December, 2012.
◦ Shaalan, K., and Attia, M. Handling Unknown Words in Arabic FST Morphology. The 10th
edition of the International Workshop on Finite State Methods and Natural Language
Processing (FSMNLP) 2012, Donostia - San Sebastian, Spain, July 23-25, 2012.
◦ Shaalan, K., Samih, Y., Attia, M., Pecina, P., Genabith, J., 2012. Arabic Word Generation and
Modelling for Spell Checking, In the Proceedings of The eighth international conference on
Language Resources and Evaluation (LREC'12), PP. 719-725, Istanbul, Turkey, 21-27 May
2012.
◦ Attia, M., Shaalan, K., Tounsi, L., Genabith, J., Automatic Extraction and Evaluation of Arabic
LFG Resources, In the Proceedings of The eighth international conference on Language
Resources and Evaluation (LREC'12), PP. 1947-1954, Istanbul, Turkey, 21-27 May 2012.
53

These are core Computational Linguistics tools, useful
for:
Machine Translation
CALL
Information Extraction
Question-answering
Proofing tools (spell and grammar
checking and correction)
54
Text understanding
and reasoning

NER is the task of detecting and classifying
named entities (i.e. proper names) within
unstructured and structured texts into
predefined classes (e.g. person, location and
organization).

Rule-based + Machine Learning = Hybrid

Examples:

‫( الملك عبدهللا‬King Abdullah).

‫الرئيس الفخري للجمعية اللواء أبراهيم الباري‬
(The honorary president of the society
general Ibrahim AlBari )

‫الزعيم الديني الشيخ االردني أبوبكر البدوى‬
(The Jordanian religious leader Sheikh
Abu Baker AlBadawi )

Arabic name elements comprise of four main categories.
Persons are named by:
◦ Ism: (pronounced IZM), a personal, proper name given
shortly after birth. E.g., Muhammad [Mohammed], Musa
[Moses]
◦ Kunya: (pronounced COON-yah), an honorific name or
surname, as the father or mother of someone; e.g., abu
Da'ud [the father of David], umm Salim [the mother of
Salim].
◦ Nasab: (pronounced NAH-sahb), a pedigree, as the son or
daughter of someone. The nasab follows the ism in usage.
e.g., Hasan ibn Faraj [Hasan the son of Faraj]
◦ Laqab: (pronounced LAH-kahb), a combination of words
into a byname either religious, relating to nature, a
descriptive, or some admirable quality of the person.
Laqabs follow the ism: Yasir Harun al-Rashid [Yasir
Haroon the Rightly-guided].
56

Types:
◦
◦
◦
◦
Word-level features
List lookup features
Contextual features
Linguistic features
57
Feature
Special
Markers
Description
A binary feature indicating the presence of
punctuation marks and special characters in
a word
Word length A binary feature indicating whether the
length of the word is greater than a
predefined threshold.
Capitalization A binary feature indicating the existence of
capitalization information on the gloss
corresponding to the Arabic word
Lexical
The surface features of a character n-gram
up to a range of characters from 1n that
indicate prefix and suffix attachment.
58
Features
Gazetteer
Lexical
Trigger
Blacklist
Nationality
Description
A binary feature indicating the existence of
the word in an individual gazetteer
A binary feature indicating the existence of
the word in the individual lexical trigger list
A binary feature indicating the non-existence
of the word in an individual blacklist
A binary feature indicating the existence of
the word in the nationality list
59
Features
Word n-gram
Rule-based
Description
The features of a sliding window
comprising a word n-gram that
includes the candidate word, along
with preceding and succeeding
words
The features of a sliding window
derived from rule-based NER
decisions
60
Feature
POS
Morphological
BPC
Description
The label identifying the part of speech
category of a word
A set of morphological information
(excluding POS)
Phrase-level labels identifying syntactic
chunks such as noun phrases (NPs) and
verb phrases (VPs) within a text
61


Determine the optimized feature set for Arabic NER
Ambiguity
Washington (a person name, a city name or a political entity)
‫أعلنت القاهرة ان طائراتها لم تقصف مواقع ليبية‬
Determining the beginning and end of a named entity
Stanford University bookshop

Co-reference
– ‫القائد األعلى للقوات المسلحة – رئيس الجمهورية – السيسي‬
– ‫ الرئيس‬- ‫عبد الفتاح السيسي‬
They need to be recognized as mentions of the same entity

Entities within entities
Chicago Pizza
‫فندق القاهرة‬

63
◦ Shaalan, K., A Survey of Arabic Named Entity Recognition and
Classification, Computational Linguistics, MIT Press. (IF 1.468)
◦ Shaalan, K., Oudah, M., A Hybrid Approach to Arabic Named Entity
Recognition, Journal of Information Science (JIS), SAGE
Publications Ltd, UK. (IF 1.238)
◦ NERA: Named Entity Recognition for Arabic, Journal of the
American Society for Information Science and Technology
(JASIST), John Wiley & Sons, Inc., NJ, USA, 60(8): 1652–1663,
July 2009. (IF 2.23).
◦ Oudah, M., Shaalan, K., A Pipeline Arabic Named Entity
Recognition Using a Hybrid Approach, The International
Conference on Computational Linguistics (COLING), PP. 21592176, Mumbai, India, 8-15 December, 2012.

Lexical Error diagnosis and feedback for second language learners of Arabic
within intelligent language tutoring framework.
◦ Improve language skills of
◦ Identify the cause of error rather than providing the correct version directly
Example:
Question: choose the correct answer

(‫ توقع‬- ‫ أوقع‬- ‫ سقوط أمطار اليوم )أتوقع‬.......
I …. rain today (expect - sign - expected)
Learner Answer: ‫أوقع‬
Correct Answer: ‫( أتوقع‬see next slide)
Diagnosis: incorrectly used the imperfect tense of the root / ‫ع‬-‫ق‬-‫و‬w-q-E/ with the
pattern ‫ أفعل‬/<afoEal/ instead of the intended pattern ‫ تفعل‬/tafaE~al/
Word: ‫أتوقع‬
Root: ‫ ع‬-‫ ق‬-‫و‬
Lexical category: verb
Pattern: ‫تفعل‬
Tense: imperfect
Voice: active
Mood: indicative
Subject person: 1
Subject num: sg
Subject gender: neutral
◦ Shaalan, K., Magdy, M., Fahmy, A. Analysis and Feedback of
Erroneous Arabic Verbs, Journal of Natural Language
Engineering (JNLE), ), 21(2):271-323, Cambridge University
Press, UK, March 2015. (IF 0.463)
◦ Shaalan K. Arabic GramCheck: A Grammar Checker for
Arabic, Software Practice and Experience, John Wiley &
sons Ltd., UK, 35(7):643-665, June 2005. (IF 1.148)
◦ Shaalan K. An Intelligent Computer Assisted Language
Learning System for Arabic Learners, Computer Assisted
Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2): 81-108, February 2005. (IF
0.880)

very useful for automating the proofreading of the
human typed Arabic text.

The use of word processors and text editors leads
to a whole class of writing errors

capable of checking user writing for certain
common grammatical errors, describes the problem
for him/her and offers suggestions for improvement
BUiD - Dr. Khaled Shaalan
68




Agreement errors
Wrong constituent forms
Missing sentence fragments
Wrong word order
BUiD - Dr. Khaled Shaalan
69
Error type
Example
Correct version
Number and gender ‫الجنود* يدافعان عن الوطن‬
agreement
between The soldiers defend the country.
the inchoative and the
enunciative
‫الجنود يدافعون عن الوطن‬
The soldiers defend the country.
Number and gender ‫جاءت بعض السيدات* تحمل أطفالهن‬
agreement
between Some ladies came carrying their
the
circumstantial children.
accusative and the
subject it modifies
‫جاءت بعض السيدات يحملن أطفالهن‬
Some ladies came carrying their
children.
Gender
agreement ‫* شرب البنت عصير البرتقال‬
between a verb and the The girl drank (m) orange juice.
subject
‫شربت البنت عصير البرتقال‬
The girl drank (f) orange juice.
Agreement between a ‫الرجال لن* ذهبوا إلى القرية‬
verb tense and the use The men will not go to the village.
of specific particles
‫الرجال لن يذهبوا إلى القرية‬
The men will not go to the village.
Case ending agreement ‫الفالح زرع فدانين* قمح‬
between a number and The farmer grew two fedans of
its following descriptor
wheat.
‫الفالح زرع فدانين قمحا‬
The farmer grew two fedans of
wheat.
BUiD - Dr. Khaled Shaalan
70
Error type
case ending of
inchoative or
enunciative
Example
‫*المعلمين ضربا الولد‬
The two teachers hit the boy.
Correct version
‫المعلمان ضربا الولد‬
The two teachers hit the boy.
‫كم* تالميذ الفصل؟‬
number and case
ending of the noun that How many students are there in the
follows the interrogative classroom?
particle kam (How
many)
‫كم تلميذا في الفصل؟‬
How many students are there in
the classroom?
The verb should remain
singular even though
the subject is dual or
plural
‫* يلعبون األوالد في الحديقة‬
The boys play in the garden.
‫يلعب األوالد في الحديقة‬
The boys play (sg) in the garden.
definition of inchoative
or (compound) noun in
genitive
‫*رجل مهذب‬
A man polite.
‫الرجل مهذب‬
The man is polite.
‫كتبت الرسالة السادسة عشرة ألختها في العراق‬
declension of the simple ‫كتبت الرسالة *السادس *عشر ألختها في العراق‬
and compound number She wrote the sixteenth message to She wrote the sixteenth message
her sister in Iraq.
BUiD - Dr. Khaled Shaalan to her sister in Iraq.
71
Error type
Example
Correct version
Missing a pronoun
that relates the
enunciative with the
inchoative
‫الرجال* يثق فيه‬
The men trust him.
‫الرجال يثقون فيه‬
The men trust him.
Missing the subject of
a verbal sentence
‫* ذهب إلى الدار‬
Went to the house.
‫ذهب الغالم إلى الدار‬
The boy (or any other
animated masculine entity)
went to the house.
Missing the object of
a verbal sentence
‫* فتح الولد‬
The boy opened.
‫فتح الولد الباب‬
The boy (or any other
animated masculine entity)
opened the door.
BUiD - Dr. Khaled Shaalan
72
Wrong order between
the adjective and the
noun it describes
‫اشتريت جميلة قطة‬
I bought a cat beautiful.
‫اشتريت قطة جميلة‬
I bought a beautiful cat.
BUiD - Dr. Khaled Shaalan
73
◦ Shaalan K. Arabic GramCheck: A Grammar Checker for
Arabic, Software Practice and Experience, John Wiley & sons
Ltd., UK, 35(7):643-665, June 2005. (IF 1.148)


Transfer
Interlingua
75
source sentence
(English)
English Dic.
Sentence Analysis
Morphological
& syntactic Analysis
Rules of English
English Parse Tree
Bi-ling Dic.
Transfer
English-to-Arabic
Transformation Rules
Arabic Parse Tree
Arabic Dic.
Sentence Synthesis
Morphological Gen. &
Synthesis Rules of
Arabic
Target sentence
(Arabic)
76


Involves analysis, transfer, and generation components
If you have an Arabic parser & Arabic syntactic
generator, All you need is to acquire the transfer rules
and build the transfer component
77
(1)
[wi:$1, wi+1:$2, …, wk:$k] (1  i  k)
[wk:$k, wk-1:$k-1, …, wi:$i] (1  i  k)
78
Networks performance evaluation  ‫تقييم أداء شبكة‬
np
np
noun
noun
np
networks
pl
noun
performa
nce
sg
transfer
np
noun
evaluation
sg
‫تقييم‬
sg
np
noun
‫أداء‬
sg
np
noun
‫شبكة‬
pl
79

Synonyms of a word
◦ Acquisition  “‫ ”اكتساب‬or “‫”استخالص‬.

Agreement
◦ intelligent tutoring systems  “‫ ”نظم التعليم الذكية‬or “ ‫نظم التعليم‬
‫”الذكي‬

Problems with prepositions
◦ did you do fungal analysis? 
“‫”هل قمت بـتحليل الفطر؟‬

…
80


Multilingual Machine Translation
Interlingua (semantic)-based approach
French ?
I like Mary
English
[CAUSE (X, [BE (Y, [PLEASED])])]
‫أنا أحب مارى‬
Arabic
Mary me gustar
Spanish


Interlingua = Semantic Representation
Deep analysis –
◦ no need for transfer component)
◦ Only analysis and generation components


Add Arabic analyzer to translate to other languages
Add Arabic generator to translate from other languages
82
‫ أنا أرغب في حجز غرفة في الفندق‬:‫العميل‬
Preprocessor
Sentence
Analyzer
Arabic
Lexicon
Morphological
Analyzer
Arabic Grammar
Rules
Arabic
Morphology Rules
Parse Tree
Map
Lexicon
Mapper
Ontology
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hotel,identifiability=yes),disposition=(desire,who=i))
83
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Map
Lexicon
Mapper
Feature Structure
Ontology
Map Rules
Sentence
Generator
Arabic
Lexicon
Arabic Grammar
Rules
Morphological
Generator
Arabic Morphology
Rules
‫ أنا أرغب في حجز غرفة في الفندق‬:‫العميل‬
84

Interlingua:
◦ language-neutral representation
◦ captures the intended meaning of the source sentence

Requires a fully-disambiguating parser
85
◦ Abdel Monem, A., Shaalan, K., Rafea, A., Baraka, H., Generating
Arabic Text in Multilingual Speech-to-Speech Machine Translation
Framework, Machine Translation, Springer, Netherlands, 20(4):
205-258, December 2008.
◦ Shaalan, K., Monem, A. Rafea, A., Arabic Morphological Generation
from Interlingua: A Rule-based Approach, in IFIP International
Federation for Information Processing, Vol. 228, Intelligent
Information Processing IIP, eds. Z. Shi, Shimohara K., Feng D.,
(Boston:Springer), USA, PP. 441-451, 2006.
◦ Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine
Translation of English Noun Phrases into Arabic, The International
Journal of Computer Processing of Oriental Languages
(IJCPOL), World Scientific Publishing Company, 17(2):121-134,
2004.
86



Be able to reuse MSA processing tools with colloquial
Arabic by transferring colloquial Arabic words into
their corresponding MSA words.
Facilitate the communication with colloquial Arabic
speakers
Restore the Arabic dialect to the standard language in
use nowadays.
87
‫امتي؟‬
Mapping
‫متي؟‬
when?
88
‫عال‬
On-the
Mapping
‫ال‬
the
‫علي‬
on
89
‫جيت امتي؟‬
Mapping
‫جئت متي؟‬
reordering
‫متي جئت؟‬
You-came when?
•Step (1)
• ‫جيت‬ ‫جئت‬
• ‫ امتي‬ ‫متي‬
•Step (2)
• the New Segment Position for
the word “‫ ”امتى‬is
start of sentence (SoS)
When did-you-come ?
90


Lack of standard linguistic resources
More investigations are needed for deriving/learning
transfer rules
91
◦ Abo Bakr, H., Shaalan, K., Ziedan, I., A Hybrid Approach for
Converting Written Egyptian Colloquial Dialect into
Diacritized Arabic, In the Proceedings of The 6th International
Conference on Informatics and Systems, INFOS2008, the
special track on Natural Language Processing, 27-29 March,
Cairo, Egypt, 2008.
◦ Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian
Colloquial into Modern Standard Arabic, International
Conference on Recent Advances in Natural Language
Processing (RANLP – 2007), PP. 525-529, September 2729, Borovets, Bulgaria, 2007.
92

Obtaining a brief and concise answer for Arabic questions
extracted from internet corpus.

Question Analysis
Passage Retrieval
Answer Extraction



[email protected]
◦ http://mailman.uib.no/listinfo/corpora

[email protected]
◦ http://www.linguistlist.org/

[email protected]
◦ http://www.semitic.tk/

[email protected]
◦ http://www.arabicscript.org/CAASL3/index.html
96

Advances in human language technology
require an ever increasing amount of data
and annotation. The number of current
state-of-the-art of Arabic linguistic
resources is still insufficient compared to
Arabic’s actual importance as a language.
Many existing Arabic NLP resources are
only available at significant expense.

The required Arabic NLP technologies are
sparse if even nonexistent in many areas
and researchers have to develop their own
tools.





Conferences: ACL/NAACL, SIGIR, AAAI/IJCAI, ANLP,
Coling, HLT, EACL/NAACL, AMTA/MT Summit,
ICSLP/Eurospeech
Journals: Computational Linguistics, Natural Language
Engineering, Information Retrieval, Information Processing and
Management, ACM Transactions on Information Systems,
ACM TALIP, ACM TSLP
University centers: Columbia, CMU, JHU, Brown, UMass,
MIT, UPenn, USC/ISI, NMSU, Michigan, Maryland,
Edinburgh, Cambridge, Saarland, Sheffield, and many others
Industrial research sites: IBM, SRI, BBN, MITRE, MSR,
(AT&T, Bell Labs, PARC)
The Anthology: http://www.aclweb.org/anthology
98
 Full Professor, Faculty of Engineering & IT, BUiD
(Also Professor of Computer Science, FCI, Cairo Univ.)
 Email: [email protected]
 Web:
www.buid.ac.ae/shaalan
http://scholar.cu.edu.eg/?q=shaalan/
Personal Website: http://sites.google.com/site/khaledshaalan
99
Thank you