UGTag: morphological analyzer and tagger for Ukrainian

Download Report

Transcript UGTag: morphological analyzer and tagger for Ukrainian

UGTag: morphological analyzer and
tagger for Ukrainian language
Natalia Kotsyba
Andriy Mykulyak
Igor V. Shevchenko
UGtag as a set of NLP tools
• developed within the Polish-Ukrainian parallel corpus to
provide grammatical annotation for its Ukrainian part
• inspired by a functionally similar TaKIPI toolset for Polish
• unified output format for both language parts of the corpus
• suitable for search with such programmes as Poliqarp
Differences:
• interactive annotation of texts with manual disambiguation
• modular design allows plugging-in additional grammatical
dictionaries as well as modification of the existing ones
• code for UGTag was written from scratch
Programme architecture
UGTag package
• enriches raw texts with grammatical information taken
by default from UGD (Ukrainian Grammatical
Dictionary)
• data in UGD are stored in a relational database
• 180 thousand lemmas
• 56 thousand endings
• more than 2000 paradigmatic classes
• major part of the data was transformed into a set of
XML files and adjusted for specific UGTag needs
• any compatible dictionary can be used instead or along
Stages of analysis
• pre-processing stage: tokenization and
chunking
• morphological tagging
• disambiguation
Process of analysis
Stage
reader
tokenizer
tagger
sentencer
Role
separate (different) external
representations of the text from
its internal representation (one or
more character sequences). In
other words, it converts text to a
standard format.
To split sequences into tokens
(smallest meaningful pieces of
text)
Add morphological information
to tokens (additionally it can
split or group some tokens based
on their meaning – e.g.
abbreviations, complex words
like „зелено-червоний”)
Group tokens into sentences
Input
Text in different
formats
Character sequence
List of tokens
List of annotated
tokens
disambiguator Choose appropriate grammatical List of annotated
interpretation of the token
tokens (optionally
augmented with list of
sentences)
writer
Convert list of tokens to the
List of tokens, list of
format most appropriate for the sentences
user
Output
One or more
sequences of
characters (usually
one sequence per
line of the input
file)
List of tokens
List of tokens with
morphological
information
attached to each
token
List of sentences
List of tokens with
most probable
annotation
File with
annotations in
specified format
Premorphological analysis
Procedures that do not involve the use of the grammatical dictionary
Reading phase and input formats
• plain, HTML or XML texts  XML files structured according to the
XCES standard
• strips all tags from input HTML or XML files and turns them into raw
texts
• user-defined file readers that take into account logical mark-up of
input XML files and incorporate it into the output XML format
• file reader separates the external representation of texts from their
unified internal representation fed to the tokenizer
• extract the text itself, possibly portioning it in chunks for further
processing.
Tokenizer
• first divides chunks into blocks delimited by whitespace
characters
• block can consist of one or more tokens, e.g. a quote and a
word with no white space in between (”token).
• next divides blocks into tokens that are minimal structural
units
• five categories of tokens: words, numbers, punctuation marks,
whitespace characters and unrecognized tokens
• word is a sequence of alphabetical characters with an optional
hyphen
Grammatical dictionary
• structure of grammatical information in UGD was rearranged and further
division into finer categories was carried out and implemented to meet
the requirements of the intended tagsets:
• compatible with MULTEXT-EAST, V.4
• common tagset for Polish and Ukrainian [Kotsyba, Turska, Shypnivska
2008] slightly modified and simplified to achieve this compatibility
• the category of degree of comparison for adjectives and adverbs was
reintroduced, and adjectives and adverbs were regrouped and
relemmatized accordingly
• category of predicatives was regrouped based on the conclusions in
[Derzhanski, Kotsyba 2008]
• word splitting: original UGD collocations with white space characters or
hyphens treated as individual units
• information about those combinations is preserved and can be used for
syntactic analysis in the future
Morphological analysis
• users can watch the progress of tagging as it goes
• tagged tokens of different categories are displayed in
the screen colour coded
• unrecognized tokens (red)
• words with only one available grammatical
interpretation (green)
• words with multiple grammatical interpretations (blue)
• panel in the top right corner displays grammatical
characteristics of the selected item
• manual disambiguation is possible for words with
multiple available interpretations
Automatic disambiguation
• rudimentary automatic disambiguation based on
statistical analysis for a small but frequently used
word class of prepositions
• “до” 15 grammatical interpretations, one for
preposition and 14 for all possible grammatical
characteristics of the invariable noun “до”
(musical note)
• “на” colloquial use as interjection
• further disambiguation policy foresees
combination of rules and statistical analysis of
manually disambiguated data
Enriching the dictionary database
• during annotation UGTag automatically creates a list of
words not found in the dictionary and displays it to the
user allowing him to add them to one of user
dictionaries
• list of words not unrecognized by the active built-in
dictionary is displayed
• user can select a word from this list and add it to the
dictionary
• programme gives hints as to the paradigm of the word
• definition of the wordforms can be done manually
Adding a new word
Sentencing
• sentence splitting is rule-based and some of
those rules require grammatical information
• implemented so far rules are partially based
on Rudolf’s work for Polish [Rudolf 2004]
• heuristics that use popular abbreviations and
words starting with the capital letter, whose
meaning is also taken into the account
Writing phase and writing format
• two output tag formats for resulting XML files
• default format is based on TaKIPI 1.8 for Polish,
extended for Ukrainian specific features
[Kotsyba, Turska, Shypnivska 2008]
• retains maximum grammatical information that
can be provided by Polish and Ukrainian
grammatical dictionaries
• MULTEXT-East compatible tagset which a more
course granulation of grammatical information,
[Derzhanski, Kotsyba 2009]
Plans for further development
• depend on results of extensive experimenting
with real corpus texts
• enriching the dictionary database using both
manual and automatic ways
• enhancing the quality of automatic
disambiguation
• preliminary syntactic parsing, word grouping
complex words like numerals: “двадцять три”
(twenty three) currently recognized as separate
words (“двадцять” and “три”), complex passive
structures, prepositional phrases, etc.