Transcript Diapositiva 1 - Max Planck Institute for Psycholinguistics
N. Calzolari
Nicoletta Calzolari
Istituto di Linguistica Computazionale - CNR - Pisa [email protected]
Nijmegen, August 2010 1
After the “Grosseto Workshop” (1985): a turning point
All started with the situation we had in the late ‘80s – early ‘90s With all the
Xxx-Lex GeneLex AcquiLex MultiLex Xxx-Lex A. Zampolli: Let’s be coherent: EAGLES ISLE
N. Calzolari Nijmegen, August 2010
N. Calzolari
Key issues: Do conditions exist for standardisation effort?
Reusability
as key concept true also today To avoid duplication of efforts, costs, etc.
To allow synergies, integration, exchange of data, ...
To provide a model for new data creation & acquisition Decide on
“feasible”
areas & state
priorities
time this is changing over The feasibility of formulation of consensual standards as a
strong sign of maturity
in the field we can’t propose standards if there are not enough results on which to base them
EAGLES was launched in ‘93
Nijmegen, August 2010 3
Main Results in Lexicon & Corpus WGs
First Phase (www.ilc.pi.cnr.it/EAGLES96/home.html)
Standard for
morphosyntactic encoding of lexical entries
, in a multi-layered structure, with applications for all the EU languages Standard for
subcategorisation in the lexicon
: a set of standardised basic notions using a frame-based structure Proposal for a basic set of notions in
lexical semantics
: focus on requirements of Information Systems and MT
Corpus Encoding Standard (CES)
from TEI Standard for
morphosyntactic annotation
of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata Preliminary recommendations for
syntactic annotation
of corpora
Dialogue annotation ,
for integration of written and spoken annotation
N. Calzolari Nijmegen, August 2010 4
Content vs. Format/Representation
N. Calzolari Work on lexical description deals with two aspects
Linguistic description
(
Formal representation format
) of lexical items (
content
) of lexical descriptions concentrated on , not disregarding the formal representation of the proposal more on format/representation issues In
LMF :
on the abstract meta-model Nijmegen, August 2010 5
N. Calzolari
Flexibility in the Recommendations
e.g. Morphosyntax Level Information Type Recommendation
L-0 Part-of-Speech
L-1 Morphosyntactic features agreement L-2 Language-specific (or refined) features Obligatory Recommended Optional
Nijmegen, August 2010 6
N. Calzolari
MERITS
Strengths
(from EAGLES-ISLE)
Standardisation as a necessary component of any strategic programme to create a
coherent market
Leading & academics participated ( > 150 EU groups )
Bottom-up community created standards To avoid wasting time
reinventing basic/consolidated knowledge May be true also for many
“humanities” users
, not interested in debates on specific lexical approaches Work otherwise duplicated among many projects, done
just once
collaborative manner (
overall cost-effectiveness
) in a Allows the field to be
more competitive :
Concentrate efforts on innovative areas
Engage in new/advanced technology
Nijmegen, August 2010 7
Why Standards for Language Resources?
(from EAGLES-ISLE)
To ensure: of systems (& data), through compatible interfaces and important for
workflows
of components
training
essential for a
LR Infrastructure
based on consensual technical specifications and models (“gold standards”) for
evaluation campaigns evaluation
&
validation
based on agreed criteria transition from prototypes to N. Calzolari Nijmegen, August 2010 8
N. Calzolari
Applications: requirements for systems & enabling technologies
Machine Translation Information Extraction Information Retrieval Summarisation Natural Language Generation Word Clustering Multiword Recognition + Extraction Word Sense Disambiguation Proper Noun Recognition Parsing Coreference … For
HLT
knowledge of is essential Nijmegen, August 2010 9
N. Calzolari
The Multilingual ISLE Lexical Entry (MILE)
(from EAGLES) Basic requirements for the design of the MILE :
Discover and list the (maximal) set of
basic notions
needed to describe the MILE (up to which level standardisation is feasible?)
Granularity
The leading principle: the
edited union
(
redundancy
is not a problem) of existing lexicons/models
Modular & layered Allow for under-specification (& hierarchical structure)
Nijmegen, August 2010 10
MILE – Modularity The building-block model
Allow to express different dimensions of lexical entries Enable modular specification of lexical entries Create ready-to-use packages to be combined in different ways Lexical entry 1 Lexical entry 2 Lexical entry 3 Lexical Classes as the main building blocks of the lexical architecture Lexical Objects syntactic frame Sem feature
Done in
LMF slot phrase Syn feature
N. Calzolari Nijmegen, August 2010 11
N. Calzolari
The MILE Data Categories User-adaptability and extensibility
MLC:SemanticFeature instance_of HUMAN ARTIFACT EVENT ANIMAL GROUP Core AGE UserDefined MAMMAL
OK in ISOCat Nijmegen, August 2010 12
N. Calzolari
MILE Lexical Data Category Registry A
library
of pre-instantiated objects
Define (an
ontology of)
represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc.
specify the relevant attributes define the relations with other classes hierarchically structured To be done … in ISOCat Can be used “
off the shelf
” or as a departure point for the definition of new or modified categories
DC Selections
Nijmegen, August 2010 13
ISO - LMF Lexical Markup Framework
Designed to accommodate as many models of lexical representation as possible Its pros: : abstract high-level specification ISO24613 Based on constants defined in
Data Category Registry
: low-level specifications ISO12620 Not a monolithic model, rather a
modular
framework
LMF library
provides the hierarchy of lexical objects (with structural relations among them)
Data Category Registry
provides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined) N. Calzolari Nijmegen, August 2010 14
Builds on EAGLES/ISLE
ISO LMF
Structural skeleton, with the basic hierarchy of information in a lexical entry
Core Package Constraint Expression Morphology
+ various extensions
NLP Syntax NLP Semantic NLP Paradigm class
Modular framework
LMF specs comply with modelling UML principles an XML DTD allows implementation
NLP MWE pattern LIRICS
N. Calzolari
NEDO Asian Lang.
MRD NICT Language Grid Service Ontology
Nijmegen, August 2010
ICT KYOTO NLP Multilingual notations LexInfo New initiatives …
15
Principles of LMF:
Lexicon
1
Lem m a
1 1 1
Lexical Entry
0..* 0..1
1 1 1 0..*
Sense
0..* 0..1
List Of Com ponents
0..* 1 {ordered} 2..*
Com ponent
{ordered}
Word Form Mettere entrata PAROLE in XML LMF compliant
0..*
Stem OrRoot
0..* 0..*
Morphological Features
Nijmegen, August 2010
Form
0..* 1 0..*
Related Form Derived Form Referred Root
0..*
Form Representation
Barcelona, IEC,
7-8 juliol de 2009
Monica Monachini
Nijmegen, August 2010
DCR
Barcelona, IEC,
7-8 juliol de 2009
Monica Monachini
Mapping experiment
To prove that the
model is able to represent many best practices
To test the expressive potentialities, the adequacy of architectural model & linguistic objects Major best practices: OLIF PAROLE/SIMPLE LC-Star (Speech Lexicon) WordNet - EuroWordNet FrameNet BDef formal database of lexicographic definitions derived from Explanatory
Dictionary of Contemporary French from Monica Monachini
N. Calzolari Nijmegen, August 2010 18
Bio
Lexicon
SIMPLE model & ISO-LMF standard
A unique large-scale computational lexicon in the biomedical domain in terms of
coverage
&
typology of information
Designed to meet Bio- Text
Mining requirements
Populated with info from available biomedical resources
BL
Including both
domain specific & general language
words
Semi-automatically populated
from corpora: Population toolkit available
Rich linguistic information
ranging over different linguistic descriptions levels
Conformant to international lexical representation standards
from Monica Monachini
N. Calzolari Nijmegen, August 2010 19
The BioLexicon: why
LMF proved to be able to provide systems in the biomedical domain with a substantial lexicon covering Biomedical term variants (orthographic, semantic, geographical, …) Terminological verbs and their combinatorial properties (subcategorization frames and predicate-argument structure) Word derivations to activation vs activate) (e.g.
N. Calzolari Nijmegen, August 2010 20
KYOTO: the lexical resource perspective
KYOTO objectives
“ … facilitating the exchange of information across languages, domains and cultures” “ … allow definition of word meaning in a shared Wiki platform”
from the point of view of linguistic resources …
needs to , both general & domain-related, under the form of
lexical repositories and ontologies
Nijmegen, August 2010
ICT-211423
Source Documents
KYOTO SYSTEM
Linear MAF/SYNAF Term extraction
Tybot
Generic TMF Linear SEMAF Domain editing
Wikyoto
Concept User LMF API OWL API
from Piek Vossen
N. Calzolari
Domain Wordnet Domain ontology Wordnet Ontology
Nijmegen, August 2010
Semantic annotation Fact extraction
Kybot
Linear Generic FACTAF Fact User
22
A common representation format for WordNets
Wn Wn
Seven WordNets
IT JP similar but not identical hampered interoperability to be accessed both intra- and inter-linguistically to support easier integration Wn ES Wn EU Wn JP Wn CH Wn IT endow WordNet with a
representation format allowing easy access, integration & interoperability among resources
Wn ES Wn EU Wn CH Wn NL Wn EN Wn NL Wn EN
Nijmegen, August 2010
ICT-211423
A common representation format:
Data Categories LexicalResource 1..1
GlobalInformation 1..* Lexicon 1..1
1..* LexicalEntry 0..1
Lemma 0..* Sense 0..1
Monolingual ExternalRefs 1..* Monolingual ExternalRef 0..1
Meta Meta 0..1
Definition 0..* Statement 0..* Synset 0..1
SynsetRelations 0..1
1..* SynsetRelation 0..1
Monolingual ExternalRefs 1..* Monolingual ExternalRef 0..1
Meta Meta 0..1
SenseAxes 1..* SenseAxis 0..1
Interlingual ExternalRefs 1..* Interlingual ExternalRef 0..1
Meta
Nijmegen, August 2010 N. Calzolari
from Monica Monachini
Centralized
A list of 85 sem.rels as a result of a mapping of the KYOTO WordNet grid Inter-WN Intra-WN
Nijmegen, August 2010
ICT-211423
WordNet-LMF Multilingual level - Cross-lingual Relations
groups monolingual synsets
IWN
SWN
WN3.0
< fire_1 flame_1 flaming_1 > 13480848-n specifies the type of correspondence
N. Calzolari
from Monica Monachini
Nijmegen, August 2010
link to ontology/(ies)
Domain
WnES
Kyoto Knowledge Base
Domain Domain Domain WnJP
WnIT
WnNL Domain Ontology Domain WnEN Domain WnEU
Nijmegen, August 2010
Domain WnCH ICT-211423
LMF and Named Entity Lexicon
can be useful within to : Find answers Validate answers Construction of a multilingual NE lexicon automatically acquired degree of structure → Dynamic source, huge amount of NEs, some NEs extracted from Wikipedia and linked to entries of LRs and ontologies
from Monica Monachini
N. Calzolari Nijmegen, August 2010 28
Named Entity Lexicon
from Monica Monachini
N. Calzolari
Nijmegen, August 2010
29
LexInfo & Previous Models
LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006] LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]
Lexical Markup Framework (LMF):
ontology) [Francopoulo et al. 2007] ISO standardised model for representing machine readable lexica (agnostic about connection with building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
From Paul Buitelaar
N. Calzolari Nijmegen, August 2010 30
LMF: ILC infrastructure
N. Calzolari
Nijmegen, August 2010
31
Desiderata for Semantic Roles
Martha Palmer
First step:
What are semantic roles? Why do we need standards? Start with Lirics
Consistently recognizable
Learnable
N. Calzolari
Nijmegen, August 2010
32
Some steps for a “new generation” of LRs
From huge efforts in building static, large-scale, general-purpose LRs
To dynamic
user needs LRs rapidly built on-demand, tailored to specific From closed, locally developed and centralized resources To LRs residing over
distributed
places, accessible on the web, choreographed by agents acting over them From Language Resources
To
Language Services
BUT
• Need of tools to make this vision operational & concrete N. Calzolari Nijmegen, August 2010 33
Lexical WEB & Content Interoperability
As a critical step for semantic mark-up in the SemWeb
NomLex Global WordNet GRID WordNets WordNets ComLex WordNets SIMPLE-WEB SIMPLE Bio Lexicon Standards for Interoperability
N. Calzolari
LMF FrameNet Enough??
Nijmegen, August 2010
Lex_y Lex_x
with
intelligent agents
34
A new paradigm of R&D in LRs & LT Distributed Language Services
Open & distributed infrastructures for LRs & LT
Adopting the paradigm of
accumulation of knowledge
so successful in more mature disciplines, based on sharing LRs & LTs Ability to build on previous achievements, allowing
cooperation of many groups on common tasks effective
Exchange and integrate information across repositories Create new resources on the basis of existing on demand A new
scenario
implying
content interoperability
standards development of
architectures
enabling accessibility
supra-national cooperation
N. Calzolari Nijmegen, August 2010 35
A few Issues for discussion: “content”, guidelines, tools, priorities, ...
For
Semantic Web
&
“content” interoperability:
is the field
‘mature’ enough to converge
also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?
For the standards to have impact, ensure their
usability
support focusing on & gain industry
requirements of industrial applications
To have
Guidelines
which are a adaptation of lexicons, …)
“usable product”
(to assist in creation or Facilitate acceptance of the standards providing an
open-source reference implementation platform & tools
, related
web services
and test suites Relation with
Spoken language
community Define
further steps
necessary to converge on common
priorities
N. Calzolari Nijmegen, August 2010 36
Limits observed & needs of further work
: Data Categories ( & others:
From Japanese NEDO: DC not defined in LMF & LMF non operational Asian, African DCs Need of DC organised (easy to use) IsoCat & Profiles Need of an ontology of DCs with structure/dependencies, and constraints
Otherwise the model remains too abstract, and doesn’t say anything on how to implement concretely the different layers
Link with Ontologies: relations Need of easy, Need of to make it operational, also for creating standard compliant resources: more important than the model!
More
, also
Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards Need of Need of differently motivated) to produce standard-compliant resources (unless N. Calzolari Nijmegen, August 2010 37
Strengths
Good set of … : Granularity of basic notions, Many languages already compliant with EAGLES morpho-syntax, etc.
Many projects today using LMF Unified Lexicon experiment between Speechdat & Parole, at ELRA (possible because EAGLES compliant) to access LRs based on standards An open infrastructure of LRT need standards New topics being constantly added: N. Calzolari Nijmegen, August 2010 38
Future requirements & planning
To make
LMF of commonly used lexicons into LMF
for LMF lexicons
related to LMF, with particular reference to the Lexus tool Need to address another layer The ontological layer in a lexicon How other An in a environment to to allow broad discussion on these topics to ease dissemination of LMF and information mapped from each N. Calzolari Nijmegen, August 2010 39
FLaReNet Mission: structure the area of LR & LT of the future
Individual Subscribers Institutional Members from Worldwide Forum for LRs & LTs
Consolidate
methods, approaches, common practices, architectures
Integrate
so far partial solutions into broader infrastructures A
“roadmap”
: a
plan of coherent actions
as
input to policy development
For the
EU, national organisations
&
industry
As a
model for the LRs/LTs of the next years
Strengthening the
language product market
, e.g. for new products & innovative services Indicating N. Calzolari Nijmegen, August 2010 40
Some results from FLaReNet Vienna Forum: International Cooperation
Standards & Interoperability: topics for cooperation A metadata catalogue should involve every party Common repositories for LRT universally & easily accessible Try to connect ongoing work done by many groups A – where to find the most frequently used and preferred schemes –major help to achieve standardisation
For a new world-wide language infrastructure
Create the means to plug together different LR & LT, in a
web-based resource and technology grid
Access to LRT is critical: involves – and has impact on – all the community With the possibility to easily create new workflows Create conditions to easily share and re-use technologies, to have more open (source) tools available for use also to under-funded groups N. Calzolari Nijmegen, August 2010 41
FLaReNet & the LRE MAP … at
Special Highlight: Contribute to building the
Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation.
The Map will be a of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure. First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years.
When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research
Go to !
!
N. Calzolari Nijmegen, August 2010 42