Corpus Linguistics for Understanding the Quran

Download Report

Transcript Corpus Linguistics for Understanding the Quran

Corpus Linguistics
for Understanding the Quran
Eric Atwell, Kais Dukes, Nora Abbas,
Abdul-Baquee Muhammad
I-AIBS Institute for Artificial Intelligence
and Biological Systems
School of Computing
University of Leeds
The Challenge: An interdisciplinary
approach to understanding the Quran
(1) Quranic
Studies
(2) Traditional
Arabic
Linguistics
(3)
Computational
Linguistics
(1) What is the Quran?
The last in a series of 5 religious texts
Holy Book
Prophet
Text Dated
Suhuf Ibrahim (Scrolls)
Abraham
?
The Tawrat (Torah)
Moses
1500 BCE?
The Zabur (Psalms)
David
1000 BCE?
The Injil (Gospel)
Jesus
1 CE
The Quran
Muhammad (PBUH)
610-632 CE
(1) What is the Quran?
The central religious text of Islam
- Classical Arabic
- Islamic Law (legal logic)
- Divine guidance & direction
- Scientific & philosophical knowledge
- Has inspired many scientific
achievements, e.g. Algebra and
linguistics
(2) Traditional Arabic Linguistics
Originated in Arabs studying the language of
the Quran (detailed analysis for at least 1000
years):
- Orthography (diacritics and vowelization)
- Etymology (Semitic roots)
- Morphology (derivation and inflection)
- Syntax (origins of dependency grammar)
- Discourse Analysis & Rhetoric
- Semantics & Pragmatics
(3) Computational Linguistics
Where are we now?
Current use of computing to analyze the
Quran is mostly…
- Keyword search (useful)
- Frequency analysis (numerology?)
(3) Computational Linguistics
- How far can we go?
- Is an artificial intelligence system realistic?
Example question-answering dialog system:
Question
How long should I breastfeed my child for?
Answer Mothers should suckle their offspring
for two years, if the father wishes to complete
the term (The Holy Quran, Verse 2:233).
An AI approach to understanding
the Quran
Central Hypothesis
Augmenting the text of the Quran with rich
annotation will lead to a more accurate AI
system.
- Prepare the data by annotating the Quran.
- Use the data to build an AI system for
concept search and question-answering.
Annotating the Quran
Challenges
Orthography - Complex script verified in
Unicode?
Morphology - Arabic is highly inflected and this
is challenging to model by computer
Syntax - Phrase structure or dependency
grammar?
Semantics – lexical semantics, ontology, logic,
lexical frames?
Annotating the Quran
Solutions
- Recent computational advances have made
possible annotating the Quran to very high
accuracy
- Community effort using volunteers
- Leverage existing resources from Traditional
Arabic Grammar
- Automatic annotation followed by manual
verification
Recent Advances: Orthography
Does an accurate digital copy of the
Quran exist?
Encoding Issues
- Missing diacritics
- Simplified script (not
Uthmani)
- Windows code page
1256, not Unicode
Google Search for verse (68:38) on Jan 21, 2008
shows many typos
Recent Advances: Orthography
Tanzil Project (http://tanzil.info)
- Stable version released May 2008
- Uses Unicode XML encoding, including
the special characters designed for the
complex Arabic script of the Quran
- Manually verified to 100% accuracy by a
group of experts who have memorized the
entire text of the Quran
Recent Advances: Orthography
Java Quran API (http://jqurantree.org)
March 2009
- Java classes for
querying the Tanzil
XML of the Quran
- First step towards
software package
for analyzing the
Quran
Recent Advances: Morphology
- Buckwalter Arabic Morphological Analyzer
(2002)
- Morphological Analysis of the Quran at the
University of Haifa, Israel (2004)
- Lexeme & feature based morphological
representation of Arabic (Nizar Habash, 2006)
The Haifa Corpus (2004)
Multiple analysis for each word (up to 5)
rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent
+1P+Sg
rbb+fa&l+Noun+Triptotic+Masc+Sg+Gen
Not a manually verified corpus
Authors reports an F-measure of 86%
Non-standard annotation scheme not familiar to
traditional Arabic linguists (e.g. extracting a list of all
verbs in the corpus is non-trivial)
Arabic text is only encoded phonetically instead of
using the original Arabic. Searching for the possible
morphological analyses for a specific word is not easy
The Quranic Arabic Corpus
http://corpus.quran.com
- Manually verified
(99% accuracy)
- Poplar website with
very positive feedback
- million(s) of visitors
1. Initial tagging using Buckwalter Analyzer
2. Paid annotator working for 3 months
3. Community of volunteers verifying against existing books of
Traditional Arabic Grammar which analyse the Quran
Shows Arabic and English morphological analysis side-by-side,
with phonetic transcription, search and translation.
The Quranic Arabic Corpus
http://corpus.quran.com/
• Kais Dukes Arabic Language Computing Applied
to the Quran – PhD (part-time)
an open-source online focus for linguistic research on Classical Arabic:
morphology - each word shows colour-coded morphological analysis
syntax - each verse shows dependecy parse following Arabic tradition
semantics - entitites and concepts are linked to an ontology
translation - word-for-word English translations to aid understanding
Machine Learning - annotations provide training data for a parser
Impact on society - dozens of researchers collaborated on the analysis
and over a million visitors have used the website this year.
The Quranic Arabic Corpus
Part-of-speech Tagging
Part-of-speech Tag
N
PN
PRON
DEM
REL
ADJ
V
P
PART
INTG
VOC
NEG
FUT
CONJ
NUM
T
LOC
EMPH
PRP
IMPV
INL
Name
Arabic Name
Noun
‫اسم‬
Proper noun
‫اسماء علم‬
Personal pronoun
‫ضمير‬
Demonstrative pronoun ‫اسم اشارة‬
Relative pronoun
‫اسم موصول‬
Adjective
‫صفة‬
Verb
‫فعل‬
Preposition
‫حرف جر‬
Particle
‫حرف‬
Interrogative particle
‫حرف استفهام‬
Vocative particle
‫حرف نداء‬
Negative particle
‫حرف نفي‬
Future particle
‫حرف استقبال‬
Conjunction
‫حرف عطف‬
Number
‫رقم‬
Time adverb
‫ظرف زمان‬
Location adverb
‫ظرف مكان‬
Emphatic lām prefix
‫الم التوكيد‬
Purpose lām prefix
‫الم التعليل‬
Imperative lām prefix
‫الم االمر‬
Quranic initials
‫حروف مقطعة‬
-Part-of-speech tags
adapted from Traditional
Arabic Grammar, and
mapped to English
equivalents (not the other
way around)
- These tags apply to
words in the Quran, as
well as to individual
morphological segments in
the text
The Quranic Arabic Corpus
Verified Uthmani Script
- Unicode Uthmani Script
- Sourced from the verified Tanzil project
The Quranic Arabic Corpus
Phonetics (faja'alnāhumu)
- Phonetic transcription generated algorithmically
- Guided by Arabic vowelized diacritics
The Quranic Arabic Corpus
Interlinear translation
- Word-for-word translation from accepted sources
- Interlinear translation scheme
The Quranic Arabic Corpus
Location Reference (21:70:4)
- Common standard for verses (Chapter:Verse)
- Extended in the QAC corpus to include word numbers
and segment numbers, e.g. (21:70:4:2)
The Quranic Arabic Corpus
Morphological Segmentation
- Division of a single word into multiple segments
- Part-of-speech tag assigned to each segment
- Traditional Arabic Grammar rules used for division
The Quranic Arabic Corpus
Morphological segment features
The Quranic Arabic Corpus
Arabic Grammar Summary
The Quranic Arabic Treebank
Syntactic Annotation
- Dependency Grammar based on ‫( إعراب‬i'rāb)
- Syntactico-semantic roles for each word
The Quranic Arabic Treebank
What’s new about this research?
- First Treebank of Classical Arabic
- Free Treebank of the Quran
- Well-defined formal representation
of Traditional Arabic Grammar using
hybrid constituency/dependency
graphs
Automatic Annotation
Classical Arabic Dependency Parser
- Joakim Nivre (2009)
dependency parsing
using a shift/reduce
queue/stack architecture
with machine learning
-
- Following similar
architecture, but with
hand written rules,
custom parser has an
F-measure of 77.2%
Quran ‘Search for a Concept’ Tool
Nora Abbas developed the first Quran
"search for a concept" tool and
website, Qurany;
Noorhan Abbas. Qurany: A Tool to
Search for Concepts in the Quran
(PDF). MSc by Research Thesis,
School of Computing, Leeds
University, 2009
Quran ‘Search for a Concept’ Tools
The SearchTruth tool 48%

•
Search
Truth http://www.searchtruth.com/
What
the available
Quran tools on the net provide?


What
the main
problem
with
these
TheisHoly
Quran
Viewer
tool
34%tools?

•
Holy Quran Viewer
What about
the Recall value of their results?
http://www.2muslims.com/directory/Detailed/223253.shtml
What is the main reason for these poor results?

The University of Southern California tool 49%

•
MSA-USC Qur’an Database
http://www.usc.edu/dept/MSA/reference/searchquran.html
Quran ‘Search for a Concept’ Tool
• What is a CONCEPT?
• NOT just a “keyword”
• “index term” in a textbook?
Quran ‘Search for a Concept’ Tool
• General/Abstract Concepts:
– Women’s financial status
– Main pillars of Islam
– Characteristics of Paradise
• Concrete Concepts:
– Names of places
• (Makkah, Mecca, Meccah)
– Names of prophets, angels,…etc.
• (Musa, Moses)
– Names of Holy Books
• (The Book (Bible), Bible, New Testament)
Quran ‘Search for a Concept’ Tool
• What does my tool look
like?
6
1
2
3
4
5
Quran ‘Search for a Concept’ Tool
Handling the Concrete Concepts
– Eight Parallel English Translations
– Search for one English word or a
group of words in one search
request
– Search for one Arabic word or a
group of words in one search
request
– Search for a mixed list of Arabic
and English words in one search
request
– Offers a list of synonyms for the
English words
Quran ‘Search for a Concept’ Tool
• General/Abstract Concepts
– It is imported from ‘Mushaf Al Tajweed’ index of topics
published by Dar Al-Maarifa in Syria.
– The tool has 15 main concepts.
– The tool covers all the concepts in both languages Arabic
and English.
– The total number of concepts covered is 1170.
– For example, to represent:
• Women’s financial status
• Main pillars of Islam
• Characteristics of Paradise
Knowledge representation and text
mining of the Qur'an
• Abdul-Baquee Muhammad
• http://www.comp.leeds.ac.uk/scsams/
• http://www.textminingthequran.com/wiki
Qur'anic Applications
Text Mining The Quran
Verse similarity: Allows you to see all verses that share a certain percent of
characters with your input verse.
Quranic Chapter Relatives allows you to see the strongest relatives of a given
Quran Chapter.
Word Cloud: See word clouds of a sura or group of suras of the Qur'anic.
Qur'an Concordance: Concordance over lemma.
Part-of-Speech Display of Sura: View a sura of the Qur'an with color-coded Part of
speech tags.
Quranic word co-occurence: Allows you to enter a quranic terms to finds its most
frequent neighbors.
N-gram Search: Search upto 5-gram phrases of the Quran with a frequency of 5 or
more.
Pronoun References: Given a verse, see all pronoun references within this verse.
List of Concepts: See a list of concepts arising from Pronoun referents in the
Quran.
AI for understanding the Quran
Kais Dukes developed the first online annotated linguistic
resource which shows the Arabic "irab" morphology and
grammar for each word and verse in the Holy Quran, the
Quranic Arabic Corpus including word-by-word morphology
and English gloss, and Ontology of Quranic concepts;
Nora Abbas developed the first Quran "search for a concept"
tool and website, Qurany;
Abdul-Baquee Sharaf developed tools and resources for text
mining the Quran including verse similarity, lemma
concordance and collocation, and text mining the Hadeeth