Transcript Slide 1
ACL 4
NCLT Seminar Presentation, 7th
June 2006
John Tinsley
Morphological Analysis of
Spanish Using Finite-State
Transducers
Introduction
What is this project about?
Provide morphological information on Spanish
strings
Generate strings from morphologcal descriptions
What were my aims?
Robust, fast, application – easily integrated into
other systems
80% token coverage on unrestricted text
100% coverage of Spanish morphology
Design Methodology
Formalisation
Implementation
Discovery of Spanish morphological rules
Coding of morphological model with Xerox
Finite-State Tools
Evaluation
Check for accuracy & well-formedness
Assess language coverage
Formalisation
Spanish Morphology
- Verbs
Inflected for person, tense/mood, number
Regular verbs
3 regular conjugations identified by infinitive
endings
‘-ar’, ‘-er’, and ‘-ir’
Irregular verbs
66 distinct irregularities
Varying degrees of irregularity
Spanish Morphology
- Nouns
Inflected for number, gender
7 types of noun
Feminine, masculine, neutral, derivative,
profession, number invariant, proper
Irregularities
All arise via pluralisation
Accentuation, character alterations
Spanish Morphology
- Adjectives
Inflected for number, gender
4 types of adjective
Neutral, derivative, profession, irregular
Adverbs derived from adjectives by
addition of suffix ‘mente’
Implementation
Xerox-Finite State Tools
- lexc
Lexicon compiler
Compiles ‘continuation classes’ into lexical
transducers
Xerox Finite-State Tools
- xfst
Xerox finite-state tool
Compiles regular expressions into
networks
Regular expression replace rules
[ String -> Replacement || left-context _ right-context ]
Xerox Finite-State Tool
- example
conocer - ‘to know’
1st person, pres. ind. ‘conozco’
Lexical transducer mappings
conoc:conoc
er+Verb:ε
+PresInd:^PresInd
+1P+Sg:o
Xerox Finite-State Tool
- example cont…
Lexical
Surface
conocer+Verb+PresInd+1P+Sg
conoc^PresIndo
Composed replace rule
[ c -> {zc} || _ ^PresInd ]
Triggered by the ^PresInd tag
Makes required changes, remove trigger
Verb Lexicon
Coded in lexc
Model has 3 regular paths
66 varieties of irregularity
e.g. poder ‘to be able to’
LEXICON Irreg43
0:^UE^VSoue^PRET1^FR
ErV ;
[o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?*
[%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]
Noun Lexicon
LEXICON NounFem
!STEM
acción
! Feminine Nouns
!CONT. CLASS
fIsNounEs ;
LEXICON fIsNounEs
+Noun:0
! feminine pluralised with 'es'
fNounPluralES ;
LEXICON fNounPluralES
+Sg+Fem:0
+Pl+Fem:^NZ^NOes
#;
#;
[z -> c || _ %^NZ]
[ó -> o || _ ?^<5 %^NO ]
! GLOSS
! action
Adjective Lexicon
Same process as noun lexicon
Uses the same replace rules
One exception for adverbs
LEXICON nIsAdjS
+Adj:0
+Adj|+Adv:^AAOmente
nAdjPluralS ;
#;
[o -> a || _ %^NAO %^AAO {mente}]
Other Transducers
Overgeneration Filter
llover ‘to rain’
~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ]
Capitalisation
[ a (->) A || .#. _ ]
Trigger Remover
[ %^IE -> 0 ]
Execution script
Evaluation
Testing
Accuracy
Maintaining integrity of existing rules
Projection
Subtraction
Well-formedness
Ensuring tag order
Assessing Coverage
Aim – 80% on unrestricted text
Statistical predictions (Crystal 1997)
Corpus compilation and processing
Europarl, 3 corpora
(http://people.csail.mit.edu/koehn/publications/europarl/ )
Phase 1 – augmentation
Phase 2 – 81% coverage
Final assessment – 84.15% coverage
Further Details
Class
# of forms
Nouns
547
Verbs
304
Adjectives
183
Other
378
• Generates approx. 44,000 unique morphological descriptions
• Evaluation corpus – 1.26 analyses per input token on average
Possible improvements
Increase coverage
lexicon augmentation
Disambiguation using POS tagger
More derivational morphology
Deal with different dialects of Spanish
References
(Beesley & Karttunen 2003) Beesley, K. and Karttunen, L.,
Finite State Morphology, CSLI Publications, United States, 2003.
(Claret 2005) Los Verbos Castellanos Conjugados, Sexta
Edición, Editorial Claret, Barcelona, 2005
(Crystal 1997) Crystal, D., The Cambridge Encyclopedia of
Language. (2nd. ed.) Cambridge University Press, 1997
Europarl - Europarl Parallel Corpus
http://people.csail.mit.edu/koehn/publications/europarl/ - Last
Accessed 19/05/2006
(Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990.
(Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J.
Collection Bescherelle - Les verbes espagnols. Hatier, 1997.
Real Academia Española – http://www.rae.es/ - Last
Accessed 25/05/2006
Conclusions
Demonstration
LEXICON ArVerbs
!STEM
abord
!CONT. CLASS
ArV ;
LEXICON ArV
ar+Verb:0
ArConj ;
LEXICON ArConj
!TAGS
+PresInd:^PresInd
+PretInd:^PretInd
!CONT.CLASS
ArPresInd ;
ArPretInd ;
LEXICON ArPresInd
+1P+Sg:o^1PSg
+2P+Sg:as^2PSg
+3P+Sg:a^3PSg
! Present Indicative
#;
#;
#;
!GLOSS
!to approach