Transcript Slide 1

ACL 4
NCLT Seminar Presentation, 7th
June 2006
John Tinsley
Morphological Analysis of
Spanish Using Finite-State
Transducers
Introduction


What is this project about?
 Provide morphological information on Spanish
strings
 Generate strings from morphologcal descriptions
What were my aims?
 Robust, fast, application – easily integrated into
other systems
 80% token coverage on unrestricted text
 100% coverage of Spanish morphology
Design Methodology

Formalisation


Implementation


Discovery of Spanish morphological rules
Coding of morphological model with Xerox
Finite-State Tools
Evaluation


Check for accuracy & well-formedness
Assess language coverage
Formalisation
Spanish Morphology
- Verbs


Inflected for person, tense/mood, number
Regular verbs



3 regular conjugations identified by infinitive
endings
‘-ar’, ‘-er’, and ‘-ir’
Irregular verbs


66 distinct irregularities
Varying degrees of irregularity
Spanish Morphology
- Nouns


Inflected for number, gender
7 types of noun


Feminine, masculine, neutral, derivative,
profession, number invariant, proper
Irregularities


All arise via pluralisation
Accentuation, character alterations
Spanish Morphology
- Adjectives


Inflected for number, gender
4 types of adjective


Neutral, derivative, profession, irregular
Adverbs derived from adjectives by
addition of suffix ‘mente’
Implementation
Xerox-Finite State Tools
- lexc


Lexicon compiler
Compiles ‘continuation classes’ into lexical
transducers
Xerox Finite-State Tools
- xfst



Xerox finite-state tool
Compiles regular expressions into
networks
Regular expression replace rules
[ String -> Replacement || left-context _ right-context ]
Xerox Finite-State Tool
- example



conocer - ‘to know’
1st person, pres. ind. ‘conozco’
Lexical transducer mappings




conoc:conoc
er+Verb:ε
+PresInd:^PresInd
+1P+Sg:o
Xerox Finite-State Tool
- example cont…
Lexical
Surface

conocer+Verb+PresInd+1P+Sg
conoc^PresIndo
Composed replace rule
[ c -> {zc} || _ ^PresInd ]


Triggered by the ^PresInd tag
Makes required changes, remove trigger
Verb Lexicon



Coded in lexc
Model has 3 regular paths
66 varieties of irregularity

e.g. poder ‘to be able to’
LEXICON Irreg43
0:^UE^VSoue^PRET1^FR
ErV ;
[o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?*
[%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]
Noun Lexicon
LEXICON NounFem
!STEM
acción
! Feminine Nouns
!CONT. CLASS
fIsNounEs ;
LEXICON fIsNounEs
+Noun:0
! feminine pluralised with 'es'
fNounPluralES ;
LEXICON fNounPluralES
+Sg+Fem:0
+Pl+Fem:^NZ^NOes
#;
#;
[z -> c || _ %^NZ]
[ó -> o || _ ?^<5 %^NO ]
! GLOSS
! action
Adjective Lexicon



Same process as noun lexicon
Uses the same replace rules
One exception for adverbs
LEXICON nIsAdjS
+Adj:0
+Adj|+Adv:^AAOmente
nAdjPluralS ;
#;
[o -> a || _ %^NAO %^AAO {mente}]
Other Transducers

Overgeneration Filter

llover ‘to rain’
~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ]

Capitalisation
[ a (->) A || .#. _ ]

Trigger Remover
[ %^IE -> 0 ]

Execution script
Evaluation
Testing

Accuracy

Maintaining integrity of existing rules



Projection
Subtraction
Well-formedness

Ensuring tag order
Assessing Coverage



Aim – 80% on unrestricted text
Statistical predictions (Crystal 1997)
Corpus compilation and processing

Europarl, 3 corpora
(http://people.csail.mit.edu/koehn/publications/europarl/ )



Phase 1 – augmentation
Phase 2 – 81% coverage
Final assessment – 84.15% coverage
Further Details
Class
# of forms
Nouns
547
Verbs
304
Adjectives
183
Other
378
• Generates approx. 44,000 unique morphological descriptions
• Evaluation corpus – 1.26 analyses per input token on average
Possible improvements
Increase coverage


lexicon augmentation

Disambiguation using POS tagger

More derivational morphology

Deal with different dialects of Spanish
References







(Beesley & Karttunen 2003) Beesley, K. and Karttunen, L.,
Finite State Morphology, CSLI Publications, United States, 2003.
(Claret 2005) Los Verbos Castellanos Conjugados, Sexta
Edición, Editorial Claret, Barcelona, 2005
(Crystal 1997) Crystal, D., The Cambridge Encyclopedia of
Language. (2nd. ed.) Cambridge University Press, 1997
Europarl - Europarl Parallel Corpus
http://people.csail.mit.edu/koehn/publications/europarl/ - Last
Accessed 19/05/2006
(Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990.
(Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J.
Collection Bescherelle - Les verbes espagnols. Hatier, 1997.
Real Academia Española – http://www.rae.es/ - Last
Accessed 25/05/2006
Conclusions
Demonstration
LEXICON ArVerbs
!STEM
abord
!CONT. CLASS
ArV ;
LEXICON ArV
ar+Verb:0
ArConj ;
LEXICON ArConj
!TAGS
+PresInd:^PresInd
+PretInd:^PretInd
!CONT.CLASS
ArPresInd ;
ArPretInd ;
LEXICON ArPresInd
+1P+Sg:o^1PSg
+2P+Sg:as^2PSg
+3P+Sg:a^3PSg
! Present Indicative
#;
#;
#;
!GLOSS
!to approach