Language Resources for Maltese

Download Report

Transcript Language Resources for Maltese

Language Resources
for Maltese
Mike Rosner
Dept. Artificial Intelligence
University of Malta
[email protected]
Jan 2008
Language Resources for Maltese
1
Malta
Jan 2008
Language Resources for Maltese
2
Team
• Mike Rosner, Dept AI, UoM
• Ray Fabri, Inst. Linguistics, UoM
• Duncan Attard, RA, Dept AI, UoM
• Albert Gatt, Aberdeen and UoM
…. and others
Jan 2008
Language Resources for Maltese
3
Outline
• Maltese Language
• MLRS
– Corpus
– Lexicon
• Conclusion
• Demo
Jan 2008
Language Resources for Maltese
4
Maltese Language
• National language of the Maltese Islands (along with
English).
– c.1M native speakers (Malta, Australia, Canada, UK)
• Real language
• Mixed Language
– Arabic: kelb (dog)
– Romance: karozza (car)
– English: swiċċ; ners; owkej
• Latin script + some special characters
– ċ, ġ, ħ, ż, għ, ie
• Vowels are written (unlike Arabic)
– kiteb
Jan 2008
Language Resources for Maltese
5
Semitic Morphology
• Root-and-template based
• Root has 3 consonants
e.g. "k t b"
• Template is a pattern of consonants and
vowels e.g. CVCVC
• Vocalism = 2 vowels e.g. "i e"
• Word formed by interdigitation
• interdigitate(ktb, ie, CVCVC) → kiteb
Jan 2008
Language Resources for Maltese
6
Semitic Morphology
•
•
•
•
•
ħadem
ħaddiem
ħidma
ħadem
ħaddem
Jan 2008
work (verb);
worker;
work (noun);
be worked (verb passive);
caused to work.
Language Resources for Maltese
7
Plural Formation
Sound Plural
formed by suffixes:
(a) Romance
karozza/karozzi (car)
tappit/tappiti (carpet)
(b) Semitic
ikla/ikliet (food)
Jan 2008
Broken Plural
change of stem
drop of vowel
qamar/qmura
tifel/tfal
ġdid/ġodda (new)
tappit/twapet (carpet)
Language Resources for Maltese
8
Morpho-Syntactic Features
• Verb-less sentences
Il-karozza ġdid/the car is new
• Construct state (inalienable possession)
Id it-tifel/the boy's hand
• Sun-letters
ix-xemx/the sun
it-tifel/the boy
Jan 2008
Language Resources for Maltese
9
Construct State
• Id it-tifel fil-but
• Id
it-tifel
• hand (def) the boy
fil-but
in the pocket
• The boy's hand (is) in the pocket
Jan 2008
Language Resources for Maltese
10
Verbs with Semitic Inflections
•
•
•
•
•
Italian Borrowing
spjega explain (It.
spiegare)
jispjega he explains
nispjegaw we explain
spjegat she explained
spjegajt I explained,
etc.
Jan 2008
•
•
•
•
•
English Borrowing
ixxuttja kick a football
(Eng. shoot)
jixxuttja he kicks
nixxuttjaw we kick
ixxuttjat she kicked
ixxuttjajt I kicked, etc.
Language Resources for Maltese
11
Clitic Pronouns
• bgħatthielux
• bgħat − t −
• send
past
1SM
hie −
to her
lu − x
it
not
• I didn't send it to her
Jan 2008
Language Resources for Maltese
12
Summary
• Mixed language
• Morphology and syntax more mixed
together than in other European
languages (typical of Semitic langs)
• Empirical work needs to be carried out to
establish correct morphosyntactic
description.
• Lack of systematic language resources
Jan 2008
Language Resources for Maltese
13
Language Resources
• Natural language processing systems and
tools,
• Linguistic research that yields new
knowledge about the language itself, and
• Language-related industries such as
software localization, translation,
publishing etc.
Jan 2008
Language Resources for Maltese
14
Maltese Language
Resource Server (MLRS)
• RTDI National Project
• Main Deliverables:
– Maltese National Corpus (Server)
– Computational Lexicon (Server)
• Subsidiary Deliverables - tools for access,
creation and maintenance of resources
– Tokeniser
– Part of Speech Tagger
– NP Chunker
Jan 2008
Language Resources for Maltese
15
Same Data, Different Services
Lexicon
Server
Lexicographer
Interface
Jan 2008
API
Teacher
Interface
Language Resources for Maltese
Researcher
Interface
16
Corpus
• Representative
• Accessible to
– contributors
– editors
– other users
• Multiple levels of annotation
• Word extraction
Jan 2008
Language Resources for Maltese
17
2 Dimensional Corpus
Annotation Level
Text Category
Level 1
Level 2
Level 3
Law
News
Gov
Jan 2008
Language Resources for Maltese
18
Levels of Annotation
Level 0 Level 1
Source: Text:
ASCII, UTF8
PDF,
HTML,
Word, ..
any!
Jan 2008
Level 2
Text
Structure
token,
sentence,
paragraph
Level 3
Level 4
Syntax
Semantics
POS tags, entities
NP
relations
chunks
parse
trees
Language Resources for Maltese
19
c. 20 Text Categories
Academic
Blog
Chat
Religion
Jan 2008
Language Resources for Maltese
20
Corpus Website
Jan 2008
Language Resources for Maltese
21
Wordlist Management
•
•
•
•
User submits text, files or page URLs.
These resources are scanned and the
words extracted from them and
displayed.
User edits the resulting lists of extracted
words manually.
User submits final version for
incorporation into the wordlist database.
Jan 2008
Language Resources for Maltese
22
Current Corpus
• 50M words at level 0, predominantly news,
legal, government. Some fiction.
• Submission requires a signed agreement
from contributors.
• Level 0
– catalogue: visible to all
– contents: only visible to submitter.
• Level 1 and higher
– catalogue and contents: visible to all
Jan 2008
Language Resources for Maltese
23
Morphosyntactic Annotation
Level III
• Tagset: a predetermined collection of tags for
Maltese (Albert Gatt/Ray Fabri)
• Brill Tagger (Brill 1996)
• Training phase – hand tagging.
• Each tag can be regarded as a set of
attribute/value pairs
• For example, the tag NCS stands for
{Cat=noun, Type=common, Num=sing}
Jan 2008
Language Resources for Maltese
24
category
person
common
undefined singular masculine none
k elb
plural
feminine bound-t
k elba
dual
unspecifiedclitic1
k lieb
collective
bound-t + clitic1 ħbiżtejn
unspecified
ħobż
martri
mart
sieqek
martek
undefined singular masculine undefined
Iżrael
plural
feminine
Spanja
unspecifiedunspecified
Il-Maltin
proper
indicative 1
2
3
3. pseudo-predicate
4. modifier
Jan 2008
attachment
masculine none
feminine clitic1
clitic2
clitic1 + clitic2
imperative 2
singular undefined none
plural
clitic1
clitic2
clitic1 + clitic2
undefined undefined undefined undefined clitic1
adjective undefined singular masculine undefined
plural
feminine
unspecifiedunspecified
adverb
active
5. participle
example
type
1. noun
2. verb
attributes
level 1
number gender
passive
singular
plural
undefined undefined undefined undefined
undefined singular masculine undefined
plural
feminine
undefined singular masculine undefined
Language
Resources
plural
femininefor Maltese
k iteb
k itibni
k itibli
k itibhuli
ik teb
ik tibni
ik tibli
ik tibhuli
għandek , fini
k bir
k bira
k bar
intelliġenti
immedjatament
rieqed, rieqda
reqdin
mik tub, armat,
pinġut, garantit
25
Lexicon - Aims
• Broad coverage
• Support for different kinds of lexical
information
– Syntactic (Part of Speech + other)
– Phonetic Spelling
– Translation (En)
• Interaction with linguist over Internet
Jan 2008
Language Resources for Maltese
26
Lexicon Construction: Workflow
• Extract wordlists from text (automatic)
• Identify/correct headwords (semiautomatic)
– Alignment techniques (Dalli 2001)
– Automatic prefix/suffix recognition (Attard
2004)
• For each headword, construct lexical entry
(manual)
• Led (Lexicon Editor)
Jan 2008
Language Resources for Maltese
27
Lexicon Editor
Jan 2008
Language Resources for Maltese
28
Object Description Language
• OO language for handling dependencies
between lexical fields.
• Primarily affects linguist interface.
• An ODL description contains the following
parts in order:
– Enumeration Declarations
– Class Declarations
– Rules (Optional)
– Macro Definitions (Optional)
Jan 2008
Language Resources for Maltese
29
ODL Example
enum Number { Singular, Plural, Dual }
class NOUN
{
Cat
Type
Number
= noun;
= common | proper;
= *; }
class PRONOUN: NOUN
{ Case = *; }
if (Number == Plural){ !Gender }
Jan 2008
Language Resources for Maltese
30
Current Status
• Website (http://mlrs.cs.um.edu.mt)
– User Classes (public; linguist; administrator)
• Corpus
– Web interface
– Tools level0; level1; level 2
– Collection approx 50MB @ level 0
• Lexicon
– Editor/Browser
– ODL version 0
Jan 2008
Language Resources for Maltese
31
Future Work
• Manual annotation:
– POS annotation to train tagger
– Migration of level 0 to level 1
• Morphological component
– Morphological analyser/synthesiser
– Relationships between lexical entries
• HPSG integration. Stefan Muller, Saarbruecken.
• Compatibility/Integration with existing lexical resources
(cf WordNet)
• Language-enabled tools.
– Spellchecker
– IE
– Translation
Jan 2008
Language Resources for Maltese
32
Inheritance and Morphology
j a s l u
{ pers=1, mamma=wasal, num=plur }
Jan 2008
Language Resources for Maltese
33
Conclusion
• Cross-disciplinary (Ling/CLing/CS) project
presents challenges.
• Training of automatic tagger has been a
bottleneck.
• Stable funding/support required beyond
life of project
Jan 2008
Language Resources for Maltese
34
Valletta in Winter
Jan 2008
Language Resources for Maltese
35