Diapositiva 1 - Max Planck Institute for Psycholinguistics

Download Report

Transcript Diapositiva 1 - Max Planck Institute for Psycholinguistics

N. Calzolari

Nicoletta Calzolari

Istituto di Linguistica Computazionale - CNR - Pisa [email protected]

Nijmegen, August 2010 1

 

After the “Grosseto Workshop” (1985): a turning point

All started with the situation we had in the late ‘80s – early ‘90s With all the

Xxx-Lex GeneLex AcquiLex MultiLex Xxx-Lex A. Zampolli: Let’s be coherent: EAGLES ISLE

N. Calzolari Nijmegen, August 2010

N. Calzolari

Key issues: Do conditions exist for standardisation effort?

Reusability

as key concept  true also today To avoid duplication of efforts, costs, etc.

To allow synergies, integration, exchange of data, ...

To provide a model for new data creation & acquisition Decide on

“feasible”

areas & state

priorities

 time this is changing over The feasibility of formulation of consensual standards as a

strong sign of maturity

in the field  we can’t propose standards if there are not enough results on which to base them

EAGLES was launched in ‘93

Nijmegen, August 2010 3

Main Results in Lexicon & Corpus WGs

First Phase (www.ilc.pi.cnr.it/EAGLES96/home.html)

Standard for

morphosyntactic encoding of lexical entries

, in a multi-layered structure, with applications for all the EU languages Standard for

subcategorisation in the lexicon

: a set of standardised basic notions using a frame-based structure Proposal for a basic set of notions in

lexical semantics

: focus on requirements of Information Systems and MT

Corpus Encoding Standard (CES)

from TEI Standard for

morphosyntactic annotation

of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata Preliminary recommendations for

syntactic annotation

of corpora

Dialogue annotation ,

for integration of written and spoken annotation

N. Calzolari Nijmegen, August 2010 4

Content vs. Format/Representation

N. Calzolari Work on lexical description deals with two aspects

Linguistic description

(

Formal representation format

) of lexical items (

content

) of lexical descriptions concentrated on , not disregarding the formal representation of the proposal more on format/representation issues In

LMF :

on the abstract meta-model Nijmegen, August 2010 5

N. Calzolari

Flexibility in the Recommendations

e.g. Morphosyntax Level Information Type Recommendation

L-0 Part-of-Speech

 

L-1 Morphosyntactic features agreement L-2 Language-specific (or refined) features Obligatory Recommended Optional

Nijmegen, August 2010 6

N. Calzolari

MERITS

Strengths

(from EAGLES-ISLE)

Standardisation as a necessary component of any strategic programme to create a

coherent market

Leading & academics participated ( > 150 EU groups ) 

Bottom-up community created standards To avoid wasting time

reinventing basic/consolidated knowledge May be true also for many

“humanities” users

, not interested in debates on specific lexical approaches Work otherwise duplicated among many projects, done

just once

collaborative manner (

overall cost-effectiveness

) in a Allows the field to be

more competitive :

Concentrate efforts on innovative areas

Engage in new/advanced technology

Nijmegen, August 2010 7

Why Standards for Language Resources?

(from EAGLES-ISLE)

To ensure: of systems (& data), through compatible interfaces and  important for

workflows

of components

training

 essential for a

LR Infrastructure

based on consensual technical specifications and models (“gold standards”)  for

evaluation campaigns evaluation

&

validation

based on agreed criteria transition from prototypes to N. Calzolari Nijmegen, August 2010 8

N. Calzolari

Applications: requirements for systems & enabling technologies

Machine Translation Information Extraction Information Retrieval Summarisation Natural Language Generation Word Clustering Multiword Recognition + Extraction Word Sense Disambiguation Proper Noun Recognition Parsing Coreference …  For

HLT

knowledge of is essential Nijmegen, August 2010 9

N. Calzolari

The Multilingual ISLE Lexical Entry (MILE)

(from EAGLES) Basic requirements for the design of the MILE :

Discover and list the (maximal) set of

basic notions

needed to describe the MILE (up to which level standardisation is feasible?)

Granularity

The leading principle: the

edited union

(

redundancy

is not a problem) of existing lexicons/models

Modular & layered Allow for under-specification (& hierarchical structure)

Nijmegen, August 2010 10

MILE – Modularity The building-block model

Allow to express different dimensions of lexical entries Enable modular specification of lexical entries Create ready-to-use packages to be combined in different ways Lexical entry 1 Lexical entry 2 Lexical entry 3 Lexical Classes as the main building blocks of the lexical architecture Lexical Objects syntactic frame Sem feature

 Done in

LMF slot phrase Syn feature

N. Calzolari Nijmegen, August 2010 11

N. Calzolari

The MILE Data Categories User-adaptability and extensibility

MLC:SemanticFeature instance_of HUMAN ARTIFACT EVENT ANIMAL GROUP Core AGE UserDefined MAMMAL

 OK in ISOCat Nijmegen, August 2010 12

N. Calzolari

MILE Lexical Data Category Registry A

library

of pre-instantiated objects

 Define (an

ontology of)

  represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc.

specify the relevant attributes   define the relations with other classes hierarchically structured  To be done … in ISOCat  Can be used “

off the shelf

” or as a departure point for the definition of new or modified categories 

DC Selections

Nijmegen, August 2010 13

ISO - LMF Lexical Markup Framework

   Designed to accommodate as many models of lexical representation as possible Its pros:  : abstract high-level specification ISO24613 Based on constants defined in

Data Category Registry

: low-level specifications ISO12620 Not a monolithic model, rather a

modular

framework 

LMF library

provides the hierarchy of lexical objects (with structural relations among them) 

Data Category Registry

provides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined) N. Calzolari Nijmegen, August 2010 14

Builds on EAGLES/ISLE

ISO LMF

Structural skeleton, with the basic hierarchy of information in a lexical entry

Core Package Constraint Expression Morphology

+ various extensions

NLP Syntax NLP Semantic NLP Paradigm class

  

Modular framework

LMF specs comply with modelling UML principles an XML DTD allows implementation

NLP MWE pattern LIRICS

N. Calzolari

NEDO Asian Lang.

MRD NICT Language Grid Service Ontology

Nijmegen, August 2010

ICT KYOTO NLP Multilingual notations LexInfo New initiatives …

15

Principles of LMF:

Lexicon

1

Lem m a

1 1 1

Lexical Entry

0..* 0..1

1 1 1 0..*

Sense

0..* 0..1

List Of Com ponents

0..* 1 {ordered} 2..*

Com ponent

{ordered}

Word Form Mettere entrata PAROLE in XML LMF compliant

0..*

Stem OrRoot

0..* 0..*

Morphological Features

Nijmegen, August 2010

Form

0..* 1 0..*

Related Form Derived Form Referred Root

0..*

Form Representation

Barcelona, IEC,

7-8 juliol de 2009

Monica Monachini

Nijmegen, August 2010

DCR

Barcelona, IEC,

7-8 juliol de 2009

Monica Monachini

Mapping experiment

  To prove that the

model is able to represent many best practices

To test the expressive potentialities, the adequacy of architectural model & linguistic objects Major best practices: OLIF PAROLE/SIMPLE LC-Star (Speech Lexicon) WordNet - EuroWordNet FrameNet BDef formal database of lexicographic definitions derived from Explanatory

Dictionary of Contemporary French from Monica Monachini

N. Calzolari Nijmegen, August 2010 18

Bio

Lexicon

SIMPLE model & ISO-LMF standard

A unique large-scale computational lexicon in the biomedical domain in terms of

coverage

&

typology of information

Designed to meet Bio- Text

Mining requirements

Populated with info from available biomedical resources

BL

Including both

domain specific & general language

words

Semi-automatically populated

from corpora: Population toolkit available

Rich linguistic information

ranging over different linguistic descriptions levels

Conformant to international lexical representation standards

from Monica Monachini

N. Calzolari Nijmegen, August 2010 19

The BioLexicon: why

 LMF proved to be able to provide systems in the biomedical domain with a substantial lexicon covering  Biomedical term variants (orthographic, semantic, geographical, …)  Terminological verbs and their combinatorial properties (subcategorization frames and predicate-argument structure)   Word derivations to activation vs activate) (e.g.

N. Calzolari Nijmegen, August 2010 20

KYOTO: the lexical resource perspective

KYOTO objectives

  “ … facilitating the exchange of information across languages, domains and cultures” “ … allow definition of word meaning in a shared Wiki platform” 

from the point of view of linguistic resources

needs to , both general & domain-related, under the form of

lexical repositories and ontologies

Nijmegen, August 2010

ICT-211423

Source Documents

KYOTO SYSTEM

Linear MAF/SYNAF Term extraction

Tybot

Generic TMF Linear SEMAF Domain editing

Wikyoto

Concept User LMF API OWL API

from Piek Vossen

N. Calzolari

Domain Wordnet Domain ontology Wordnet Ontology

Nijmegen, August 2010

Semantic annotation Fact extraction

Kybot

Linear Generic FACTAF Fact User

22

  

A common representation format for WordNets

Wn Wn

Seven WordNets

IT JP similar but not identical hampered interoperability to be accessed both intra- and inter-linguistically to support easier integration Wn ES Wn EU Wn JP Wn CH Wn IT endow WordNet with a

representation format allowing easy access, integration & interoperability among resources

Wn ES Wn EU Wn CH Wn NL Wn EN Wn NL Wn EN

Nijmegen, August 2010

ICT-211423

A common representation format:

Data Categories LexicalResource 1..1

GlobalInformation 1..* Lexicon 1..1

1..* LexicalEntry 0..1

Lemma 0..* Sense 0..1

Monolingual ExternalRefs 1..* Monolingual ExternalRef 0..1

Meta Meta 0..1

Definition 0..* Statement 0..* Synset 0..1

SynsetRelations 0..1

1..* SynsetRelation 0..1

Monolingual ExternalRefs 1..* Monolingual ExternalRef 0..1

Meta Meta 0..1

SenseAxes 1..* SenseAxis 0..1

Interlingual ExternalRefs 1..* Interlingual ExternalRef 0..1

Meta

Nijmegen, August 2010 N. Calzolari

from Monica Monachini

Centralized

A list of 85 sem.rels as a result of a mapping of the KYOTO WordNet grid Inter-WN Intra-WN

Nijmegen, August 2010

ICT-211423

WordNet-LMF Multilingual level - Cross-lingual Relations

groups monolingual synsets

IWN

00001251-n

SWN

09686541-n ID CDATA #REQUIRED> and sharing the same

WN3.0

< fire_1 flame_1 flaming_1 > 13480848-n specifies the type of correspondence

N. Calzolari

from Monica Monachini

Nijmegen, August 2010

link to ontology/(ies)

Domain

WnES

Kyoto Knowledge Base

Domain Domain Domain WnJP

WnIT

WnNL Domain Ontology Domain WnEN Domain WnEU

Nijmegen, August 2010

Domain WnCH ICT-211423

LMF and Named Entity Lexicon

 can be useful within to :   Find answers Validate answers  Construction of a multilingual NE lexicon automatically acquired degree of structure → Dynamic source, huge amount of NEs, some NEs extracted from Wikipedia and linked to entries of LRs and ontologies

from Monica Monachini

N. Calzolari Nijmegen, August 2010 28

Named Entity Lexicon

Wikip LR Onto

from Monica Monachini

N. Calzolari

Nijmegen, August 2010

29

LexInfo & Previous Models

LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006]  LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007] 

Lexical Markup Framework (LMF):

ontology) [Francopoulo et al. 2007] ISO standardised model for representing machine readable lexica (agnostic about connection with building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]

From Paul Buitelaar

N. Calzolari Nijmegen, August 2010 30

LMF: ILC infrastructure

N. Calzolari

Nijmegen, August 2010

31

Desiderata for Semantic Roles

Martha Palmer

First step:

   What are semantic roles? Why do we need standards? Start with Lirics 

Consistently recognizable

Learnable

N. Calzolari

Nijmegen, August 2010

32

Some steps for a “new generation” of LRs

  From huge efforts in building static, large-scale, general-purpose LRs

To dynamic

user needs LRs rapidly built on-demand, tailored to specific From closed, locally developed and centralized resources To LRs residing over

distributed

places, accessible on the web, choreographed by agents acting over them  From Language Resources

To

Language Services

BUT

Need of tools to make this vision operational & concrete N. Calzolari Nijmegen, August 2010 33

Lexical WEB & Content Interoperability

 As a critical step for semantic mark-up in the SemWeb

NomLex Global WordNet GRID WordNets WordNets ComLex WordNets SIMPLE-WEB SIMPLE Bio Lexicon Standards for Interoperability

N. Calzolari

LMF FrameNet Enough??

Nijmegen, August 2010

Lex_y Lex_x

with

intelligent agents

34

A new paradigm of R&D in LRs & LT Distributed Language Services

Open & distributed infrastructures for LRs & LT

Adopting the paradigm of

accumulation of knowledge

so successful in more mature disciplines, based on sharing LRs & LTs Ability to build on previous achievements, allowing

cooperation of many groups on common tasks effective

Exchange and integrate information across repositories Create new resources on the basis of existing on demand A new

scenario

  implying

content interoperability

standards development of

architectures

enabling accessibility 

supra-national cooperation

N. Calzolari Nijmegen, August 2010 35

A few Issues for discussion: “content”, guidelines, tools, priorities, ...

 For

Semantic Web

&

“content” interoperability:

is the field

‘mature’ enough to converge

also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?

     For the standards to have impact, ensure their

usability

support focusing on & gain industry

requirements of industrial applications

To have

Guidelines

which are a adaptation of lexicons, …)

“usable product”

(to assist in creation or Facilitate acceptance of the standards providing an

open-source reference implementation platform & tools

, related

web services

and test suites Relation with

Spoken language

community Define

further steps

necessary to converge on common

priorities

N. Calzolari Nijmegen, August 2010 36

        

Limits observed & needs of further work

: Data Categories ( & others:

From Japanese NEDO: DC not defined in LMF & LMF non operational  Asian, African DCs Need of DC organised (easy to use)  IsoCat & Profiles Need of an ontology of DCs with structure/dependencies, and constraints 

Otherwise the model remains too abstract, and doesn’t say anything on how to implement concretely the different layers

Link with Ontologies: relations Need of easy, Need of to make it operational, also for creating standard compliant resources: more important than the model!

More

, also

  Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards Need of Need of differently motivated) to produce standard-compliant resources (unless N. Calzolari Nijmegen, August 2010 37

Strengths

    Good set of … : Granularity of basic notions, Many languages already compliant with EAGLES morpho-syntax, etc.

Many projects today using LMF Unified Lexicon experiment between Speechdat & Parole, at ELRA (possible because EAGLES compliant) to access LRs based on standards   An open infrastructure of LRT need standards New topics being constantly added: N. Calzolari Nijmegen, August 2010 38

Future requirements & planning

To make

LMF of commonly used lexicons into LMF

for LMF lexicons

related to LMF, with particular reference to the Lexus tool Need to address another layer The ontological layer in a lexicon How other An in a environment to to allow broad discussion on these topics to ease dissemination of LMF and information mapped from each N. Calzolari Nijmegen, August 2010 39

FLaReNet Mission: structure the area of LR & LT of the future

Individual Subscribers Institutional Members from Worldwide Forum for LRs & LTs

 

Consolidate

methods, approaches, common practices, architectures

Integrate

so far partial solutions into broader infrastructures A

“roadmap”

: a

plan of coherent actions

as

input to policy development

For the

EU, national organisations

&

industry

As a

model for the LRs/LTs of the next years

Strengthening the

language product market

, e.g. for new products & innovative services Indicating N. Calzolari Nijmegen, August 2010 40

Some results from FLaReNet Vienna Forum: International Cooperation

Standards & Interoperability: topics for cooperation  A metadata catalogue should involve every party   Common repositories for LRT universally & easily accessible   Try to connect ongoing work done by many groups A – where to find the most frequently used and preferred schemes –major help to achieve standardisation

For a new world-wide language infrastructure

 Create the means to plug together different LR & LT, in a

web-based resource and technology grid

   Access to LRT is critical: involves – and has impact on – all the community With the possibility to easily create new workflows Create conditions to easily share and re-use technologies, to have more open (source) tools available for use also to under-funded groups N. Calzolari Nijmegen, August 2010 41

FLaReNet & the LRE MAP … at

   

Special Highlight: Contribute to building the

 Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation.

The Map will be a of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure. First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years.

When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research

Go to !

!

N. Calzolari Nijmegen, August 2010 42