Issues in Arabic NLP

Download Report

Transcript Issues in Arabic NLP

Arabic NLP: Challenges &
Opportunities
Dr. Samir Tartir
Scientific Day
Faculty of Information
Philadelphia University
May 15th 2013
‫ثمن‬
‫علم‬
‫قِ‬
General Information
• History
– (Classical) Arabic has remained unchanged, intelligible
and functional for more than fifteen centuries.
• Strategically important
– 330 million speakers living in an important region
• huge oil reserves, sacred sites.
– 1.4 billion Muslims use in their prayers.
• Cultural and literary heritage
– Closely associated with Islam
Distribution
Versions
• Classical
• Modern
• Dialects
Arabic Language Characteristics
• Highly structured
• Highly derivational language
– Morphology
• Free word order
• Modern Arabic lacks diacritics (short vowels)
Example*
*Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation,
11/2012
Arabic Language Characteristics
• Synonymy and confusion of non-standardized
terms
– Thermometer: ،‫ ميزان حرارة‬،‫ مقياس حرارة‬،‫ محرار‬،‫محر‬
‫ترمومتر‬
• Technical translation
– Hydrometer: ‫جهاز قياس كثافة السوائل‬
• Uncle, parent…
Letters
•
•
•
•
•
One letter, one sound
Letters change shape
Hamza
No capital letters
Can use normalization
Ambiguity
• Homographs
– ‫قدم‬
• Internal word structure ambiguity
– ‫بعقوبة‬
• Syntactic ambiguity
– ‫قابلت مدير البنك الجديد‬
• Semantic ambiguity
– ‫يحب علي احمد اكثر من ابراهيم‬
• Anaphoric ambiguity
– ‫قابل الصحفي الوزير الذي انتقده‬
NLP
• Automatic summarization •
• Machine translation
•
• Named entity recognition •
(NER)
• Natural language
•
generation
•
• Natural language
•
understanding
•
• Optical character
•
recognition (OCR)
•
• Question answering
Sentiment analysis
Speech recognition
Word sense
disambiguation
Information retrieval (IR)
Speech processing
Text-to-speech
Natural language search
Automated essay scoring
etc
Question Answering**
Hammo et al. QARAB: A Question Answering System to Support the Arabic Language.
Workshop on Computational Approaches to Semitic Languages. ACL 2002
Arabic NLP Issues
• Lack of tools
• Lack of linguistic references
• Lack of training data
Available Tools
• Arabic Treebank
• Arabic WordNet
– MySQL database
– SUMO Ontology
– Java
• Microsoft Arabic Toolkit (ATK)
Summary
•
•
•
•
Arabic is difficult to deal with
Progress has been made
More work is done on different parts
Any progress is valuable
– Business
– Personal
– Governmental
Thank you