Building of the Polish Wordnet The First Steps of The Project

Download Report

Transcript Building of the Polish Wordnet The First Steps of The Project

Ewa Rudnicka, Marek Maziarz, Maciej Piasecki
G4.19 Research Group
Institute of Informatics,
Wrocław University of Technology
nlp.pwr.wroc.pl
plwordnet.pwr.wroc.pl
What is a wordnet?
WordNet – a lexico-semantic
database
Princeton WordNet (Fellbaum 1998)
a huge electronic lexical database –
a kind of thesaurus, yet of a much more advanced
structure
Words grouped into synonym sets called synsets
Synsets linked via different lexico-semantic
relations such as synonymy, near-synonymy,
hypernymy/hyponymy, meronymy/holonymy, antonymy,
fuzzynymy)
the integratation of lexical data gathered from the
existing resources such as traditional and electronic
dictionaries as well as from corpora
psycholinguistic principles – the structure of
human lexical memory (cf. Miller 1998)
taxonomic hierarchies for nouns, entailment
relations for verbs
Multi-lingual wordnets
multi lingualdatabases consistingofinter linked national mono lingualwordnets:
-
-
'
'/
-
EuroWordNet - transfer method – translation from Princeton WordNet
Dutch Spanish Italian French German Czech and Estonian (cf. Vossen 2002)
,
,
,
,
,
MultiWordNet semi-automatic acquisition method from the Princeton WordNet
-
Italian Spanish Portuguese Romanian and Latin (Bentivogli et. al )
,
,
,
.
IndoWordNet Sinha et al.2006, Bhattacharyya 2010)
expansion approach from Hindi wordnet;16 out of 22 languages of India
plWordNet (Słowosieć)
plWordNet – developed fairly independently of
Princeton WordNet by applying a unique corpusbased method
one of the biggest existing wordnets
the emphasis on relations between lexical units,
not between synsets
much more relations, some of them specially
designed to cover the pecularities of
morphosyntactic structure of Polish
(cf. Piasecki et al. 2009, Maziarz et al. 2012)
plWordNet vs. Princeton WordNet
Basic common concepts:
lemma – base form representing different inflectional forms and
different meanings
Lexical unit – lemma plus sense pair (in wordnets marked with number)
Synset – a set of synonymous lexical units
Differences:
plWN – synsets built of lexical units sharing the same constitutive
relations (such as hyponymy, hypernymy, meronymy, holonymy)
PWN – a synset represents a 'lexicalised concept' (cf. Miller 1998); synsets
built of lexical units linked by synonymy relation, understood as a
conceptual relation established on the basis of linguist's intuitions and
dictionary definitions
Mapping plWordNet on Princeton WordNet
Linking plWordNet synsets with Princeton Wordnet synsets
Defining a set of inter-lingual relations
Setting a hierarchy of inter-lingual relations
Designing mapping procedure
Mapping direction: plWordNet > Princeton WordNet
Domains selected for mapping:
person, artefact, location, family relationships, food, time,
vocabulary connected with thinking and communication
a novel perspective – linking two independent systems
the main challenge – different philosophical, theoretical and
methodological assumptions
Inter-lingual relations
hierarchy
A set of inter-lingual relations inspired by:
- inter-lingual relations from EuroWordNet (Vossen 2002)
- intra-lingual relations from plWordNet (Maziarz et al. 2011)
1. Synonymy
2. Partial synonymy
3. Inter-register synonymy
4. Hyponymy
5. Hypernymy
6. Meronymy
7. Holonymy
Inter-lingual relations (1)
Synonymy (only one per one synset) - for large correspondence in sense and
position in the source wordnet structure combined with many indirect interlingual links between the source and target synsets
Inter-register synonymy - for I-synonyms as defined above, but differing in
stylistic register
Partial synonymy - in the case of partial correspondence of meanings and/or
structures
Partial synonymy
Inter-lingual relations (2)
Inter-lingual hyponymy defined in terms of inclusion of set denotation:
a hyponym refers to an object which is included in the denotation set
of a hypernym
Inter-lingual hypernymy defined in terms of inclusion of set denotation;
a hypernym refers to an object that includes hyponyms in its
denotation set
Inter-lingual meronymy for parts, elements or materials of bigger wholes
Inter-lingual holonymy for a whole made of smaller parts, elements or materials
Mapping procedure (1)
Recognizing the sense of a source synset:
- checking its position in the network structure
(all existing relations with an emphasis on hypernym(s) and hyponyms;
definitions, commentaries; comparing other synsets contaning the given
lemma)
Example:
{zagranica 1, obczyzna 1, obce terytorium 1}:
- is a hyponym of {obszar 1, terytorium 1, obręb 1, strefa 1, zona 1, rejon 3}
commentary: 'ograniczona część przestrzeni, zwykle dużych rozmiarów, określona
powierzchnia czegoś (np. obszar państwa)
'a limited part of an area, usually of big size, a set surface of sth (e.g. state territory)
- is a meronym of {świat 3, nieznane 1} – 'world, unknown territory'
- is a fuzzynym of {granica państwa 1} – 'state border'
Mapping procedure (2)
Searching for a target synset:
– choosing candidates for a target synset with the help of intuitions,
automatic prompts and dictionaries:
e.g. {foreign country 1} - 'any state of which one is not a citizen' – is a hyponym of
{state 1, nation 1, country 1, land 9, commonwealth 2, res publica 1, body politic 1}
- 'a politically organized body of people under a single government'
- verifing candidates for a target synset (comparing hyper and hyponymic
structures (and other if such exist) with the source synset (checking the existing
and/or potential inter-lingual relations; definitions, commentaries; dictionaries)
{state 1, ..} is an inter-lingual hyponym of {państwo 1, kraj 1} 'zorganizowana politycznie społeczność, zamieszkująca określone terytorium, z
niepodległą formą rządów' – 'a politically organised community, inhabiting a certain
territory, with an independent form of government'
Mapping procedure (3)
Choosing a target synset and an inter-lingual relation: {foreign country 1}
Synonymy – no (different meaning, structures and relations)
Hyponymy – no (meaning, structures and relations do not qualify as a subtype)
Meronymy – yes (meaning, structures and relations qualify as a part)
Linking the source synset with the target synset:
Results of inter-lingual mapping
About 46 500 inter-lingual links/relations between synsets which
amounts to about 50 000 relations between lexical units
•
•
•
•
•
•
•
Synonymy 15268
Partial synonymy –
971
Inter-register synonymy - 676
Hyponymy 23677
Hypernymy 3526
Meronymy –
1898
Holonymy 555
• Mapped branches: people, artefacts, places,food, time units,
communication (partly), states and processes (partly), body parts
(partly), group names (partly)
Mapping direction: plWordNet – Princeton WordNet
Types of differences between plWN
and PWN
Inter-lingual lexico-grammatical differences:
- marked forms (diminutives, augmentatives)
- lexicalised gender
- lexical gaps
Differences in the definition of synonymy and synset:
- 'Mixed' PWN synsets – marked and unmarked forms, feminine and masculine,
countable and uncountable, hypernym and hyponym- hypernymy and (plWN) vs.
and/or (PWN)
Other differences:
- synset definitions incompatible with relations (PWN)
- different relations used for coding the same conceptual dependencies
- more fine-grained meaning differentiation
- differences boiling down to the content and size of resources
Marked forms
Differences in lexicalisation
Hyponymy
Different relations for coding the
same conceptual dependencies
References
Fellbaum, Ch. (ed). 1998. WordNet: An Electronic Lexical Database. MIT Press:
Cambridge, Massachusets.
Maziarz, M., Piasecki, M. and S. Szpakowicz. 2012. Approaching plWordNet
2.0. Proceedings of the 6th Global Wordnet Conference, Matsue. pp. 189196. accepted for publication.
Piasecki, M., Szpakowicz, S. and B. Broda. 2009. A Wordnet from the Ground
Up. Oficyna Wydawnicza Politechniki Wrocławskiej: Wrocław.
Princeton WordNet http://wordnet.princeton.edu/wordnet/
Słowosieć http://plwordnet.pwr.wroc.pl/wordnet/
Vossen, P. (ed). 2002. EuroWordNet. General Document. Amsterdam.