A Formalism of Arabic Phonetic Grammar, and Application on

Download Report

Transcript A Formalism of Arabic Phonetic Grammar, and Application on

Theory and Implementation of a Large-Scale
Arabic Phonetic Transcriptor, and Applications
PhD Thesis Presentation
By
Muhammad Atiyya
Under The Supervision Of
Prof. Mohsen A. A. Rashwan
Dept. of Electronics & Electrical Communications
Faculty of Engineering, Cairo University
Aug. 2005
Presentation Agenda
Live Show.
60 min.
Thesis Review.
10 min.
Publications & Industrial Apps.
15 min. (+) Discussion.
20 min.
Live Show
…
20 min.
Thesis Review
...
60 min.
The Overall Goal
Input-Output diagram of the targeted Automatic Arabic Phonetic Transcriptor
Important fact
Despite Arabic is an intensively
diacritized language, Modern
Standard Arabic (MSA) is
typically written without
diacritics!
Challenges
Modern standard Arabic (MSA) is typically written without
diacritics.
MSA script is typically full of many common spelling
mistakes.
The highly derivative and inflective nature of Arabic, which
is a morpheme based language.
Part of the phonetic transcription of 65% of Arabic text is
dependent on the syntactic case of the words.
Lexical and Syntax grammars alone produce multiple
solutions at each word of the text. (High Ambiguity)
7.5% of open domain Arabic text are transliterated words
which lack Arabic constraining model. Moreover, some of
these words are analyzable as normal Arabic words!
The NLP problem should theoretically be tackled
combinatorially at all the NLP layers, which is far beyond the
rich of the current state-of-the art of science.
The NLP Layers Ladder
Solutions And Points of Innovation
An exemplar case among only few ones of marrying rulebased and statistical methods in the field of HLT.
A novel methodology for stochastically inferring the
syntactically dependent diacritics using POS tags as i/p
features, without the need for an Arabic syntax analyzer.
A large-scale Arabic POS tagger has been developed, with a
compact Arabic POS tags set naturally derived from an
Arabic lexicon and also proposed as a standard serving for
higher level Arabic NLP tasks.
An APG-constrained stochastic inference novel methodology
for the diacritization of transliterated Arabic strings.
The formalism of Arabic Phonetic Grammar (APG) assertively
and proposing it as a bottom layer of the NLP ladder.
The first time to academically document a large-scale Arabic
phonetic transcriptor.
A schematic diagram of
the proposed automatic
Arabic diacritization system
Statistical Disambiguation
A- The Search Lattice.
Statistical Disambiguation
B- The target of the search process is to get
the solution path with maximum likelihood
that:


Q  arg max P q1L, ,j1jL

S


 L
( i 1), j( i 1) 
 arg max P qi , ji | q ( i  h ), j( i  h ) 
S
 i 1



L
( i 1), j( i 1) 
 arg max log P qi , ji | q ( i  h ), j( i  h ) 
S
 i 1

Statistical Disambiguation
C- To get the MOST likely path in minimum
number of expansions; the A* Search
Algorithm selects to expand next, that
path on top of stack which is sorted in
descending order according to the
estimated likelihood of the whole path
expressed by:

 


f  k, qk , jk , L  g k, qk , jk  h k, qk , jk , L

Statistical Disambiguation
- The g function can be computed as:


k

g k , qk , jk   log P qi , ji | q
i 1
( i 1), j( i1)
( i  N 1), j( i N 1)

the conditional long m-grams probabilities are calculated
using Bayes’-Good_Turing discount-Back-Off methods.
- The h* function can only be heuristically estimated. A
safe estimation (that keeps the admissibility) is:
 L
L  N, k  N 1
  log(Pmax,N )  ( L  k )  log(Pmax,N );
i Lk 1
N 1
 log(P

max,N )   log(Pmax,i )

i  N
i  k 1
h  k , qk , jk , L  
L  N, k  N 1
N 1
 ( L  N  1)  log(P
max,N )   log(Pmax,i );

i  k 1
 L
  log(Pmax,i );
LN
i  k 1


The Arabic Lexicon
Arabic is primarily an
inflective language
which can be
decomposed into a
compact set of
morphemes.
The Arabic lexicon is
the repository where
the linguistic description
of all the Arabic
morphemes along with
their mutual interactivity
are registered as
extensively as possible
in a compact structured
format.
Kinds of Arabic morphemes
Morphemes
P:
Rd:
Frd:
Fid :
Rf :
Ff :
Ra :
Fa :
S:
260 prefixes.
4,600 derivative roots.
1,000 regular derivative patterns.
300 irregularly derived words.
260 roots of fixed words.
300 fixed words.
240 roots of Arabized words.
290 Arabized words.
550 Arabic suffixes.
Body
P
S
Derivative
Rd
Frd
Non-derivative
Fixed
Fid
Rf
Arabized
Ff
Ra
Fa
Canonical morphological structure of Arabic words
w  q  (t : p, r, f , s)
The multiplicity of possible lexical analyses of a sample input Arabic word
The Arabic lexical disambiguation trellis
The Arabic POS Tags are the tokens that
convey the basic context-free syntactic
features of input surface text words.
The criteria of extracting the Arabic POS tags
set from the Arabic lexicon:
All the existing morpho-syntactic features must be named and
registered, which aims to the completeness of the resulting POS tags
set.
All the named and registered features must be atomic, which aims to
compactness and avoids redundancy in the resulting tags set.
All the named and registered features can be ensured upon the POS
labeling of the morphemes in our Arabic lexical knowledge base.
POS Labeling of the Arabic Lexicon
The root morphemes of all kinds do not participate to tagging, and are hence not
Arabic POS labeled.
POS labels of Arabic morphemes are vectors not simple scalars.
Only ensured Arabic POS tags are considered in the POS labeling of morphemes.
Arabic POS Tagging
APOSw  Concat APOS( p), APOS(t : f ), APOS(s)
POS tags are the most essential input features for all
kinds of natural language computational syntax parsers
The Arabic POS Tags–Syntactic Diacritics search trellis
The need for diacritizing transliterated words:
Foreign names and terminology frequently
appear as transliterated Arabic strings in real-life
Arabic text especially in news domain at a rate of
7.5% (i.e. 1/14).
Why is it difficult?
These words do not are not governed by Arabic
linguistic models (e.g. Morphological, Syntactic).
Why is the Look-Up tables based approach
insufficient?
Time variance nature of the transliterated words.
Lack of completeness and bad coverage.
Lack of tolerability to spelling variance.
Attaching Arabic infixes is not supported.
Compliance with the Arabic phonology is not
guaranteed.
Statistical approach to the diacritization of
transliterated Arabic words.
The Search Lattice
Advantages of the Statistical Approach:
Reducing the manual intervention into cleanly
& economically building & refining diacritized
training corpus.
Completeness, Coverage, and infixes
attachment problems of the look-up tables
approach are overcome due to the ability of
backing-off to even shorter m-grams
Dominantly frequent words (long m-grams) are
exactly retrieved as in the look-up table
approach.
Ensuring the compliance with Arabic phonology
Our training diacritized corpus is validated against
the Arabic phonology.
During the search process; any intermediate path
that is incompliant with the Arabic phonology is
pruned.
The S/W component which is used for these two
purposes is called Arabic Phonetic Grammar (APG)
checker.
While the APG checker is necessary for correct
syllabification done by Arabic TTS systems, APG
checking also enhances the computational
efficiency of A* search.
Formalized APG in BNF format
(Terminals are written in italic capitals)
W := ystart[ymid#][yend]
ystart := cstart fvowel
ymid := ymid,regular|ymid,sokoon|ymid,silent
yend := yend,sokoon|yend,silent|yend,layyina|yend,tanween
ymid,regular := cmid[SHADDA]fvowel
ymid,sokoon := cmid SOKOON cmid fvowel
ymid,silent := cmid BYPASS
yend,sokoon := (cend SOKOON)|(cmid SOKOON cend SOKOON)|(cmid SHADDA SOKOON)
yend,silent := cmid (SOKOON|fvowel|ftanween|(SHADDA ftanween)) cend BYPASS
yend,layyina := cmid[SHADDA]flayyina
yend,tanween := cend[SHADDA]ftanween
cstart := (HMZA|BAA|TAA|...|HA|WAW|YAA)|(ALIF|HMZe)
cmid := (cstart - {ALIF,HMZe})|(HMZs|HMZy|HMZw)
cend := cmid|Yend|TAAM
fvowel := (FATEHA[ALIF VWL])|(KASRA[YAA VWL])|(DHAMMA[WAW VWL])
flayyina := FATEHA YAA YAAL
ftanween := TNWa|TNWo|TNWe
This approach is valid with transliterated words in other languages
and dialects given that the corresponding phonetic grammar checker
is available.
Adding a layer of Phonetic grammar to the NLP layers ladder is
proposed.
Arabic Text Normalization
(A Pre Processing)
Arabic text normalization formal grammar.
Phonetic Concatenation
(A Post Processing)
Overall performance evaluation
A  (1  f T )  AL  (1  f S )  f S  AS   f T  AT
f S  0.65
f T  0.075
A  0.32375 AL  0.60125 AL  AS  0.07500 AT
Performance Evaluation: AL
Performance Evaluation: As
Performance Evaluation: AT
Strict performance estimation
The resulting word diacritization is assumed wrong, if “any
lexical or syntactic diacritic is incorrectly inferred for Arabic
words, or if the evaluation rank of diacritization is less than
very good for transliterated words”.
AStrict  0.895
Lenient performance estimation
The resulting word diacritization is assumed alright, if “all its
lexical diacritics are correctly inferred – regardless to the
inferred syntactic diacritics - for Arabic words, or if the
evaluation rank of diacritization is intelligible or higher for
transliterated words”.
ALenient  0.97
AStrict  APerceived  ALenient
Building annotated corpora for
supervised training
)The Language Resources (LR) Issue(
Size
Quality
Automation
Standards
Validation
LR ’s being the driver of HLT.
The NEMLAR project; a relevant example.
Proposals for future work
Perceived HLT performance as a reference for
evaluation. (*)
Building phonetic transcriptors of language
dialects using the same methodology as that of
transliterated words. (**)
Lexical semantic tagging. (***)
Modeling the mutual interactive relations among
the multiple Arabic NLP layers for a
combinatorial analysis.(*****)
Publications
&
Industrial Applications
(10 min.)
Discussion
?!
15 min. (+)
To probe further…
Visit
http://www.RDI-eg.com/Technologies/NLP
Contact the author at
[email protected]
or at
[email protected]
Thanks for your attention