Transcript ELSE

ELSE
Evaluation
in Language and Speech
Engineering
January 98 - April 99
e l s e
ELSE Participants
•
•
•
•
•
•
•
•
•
e
MIP - U. of Odense
UDS
U. di Pisa
EPFL
XRCE
U. of Sheffield
Limsi (CNRS)
CECOJI (CNRS)
ELRA & ELSNET
l s e
•
•
•
•
•
•
•
•
Denmark
Germany
Italy
Switzerland
France
United Kingdom
France
France
Comparative Technology
Evaluation Paradigm
•
•
•
•
•
•
Successfully used US DARPA (since 1984)
Shorter scale in Europe (Sqale, Grace…)
Choose task / system or component
Gather participants
Organize campaign (protocols/metrics/data)
Mandatory if technology insufficient:
– MT, IR, summarization… (cf recognition 80s)
e
l s e
Limsi-CNRS
Knowledge gained from
evaluation campaigns
• Knowledge shared by participants in WS:
– How to get the best results ?
– Methodology advantages / disadvantages
• Funding agencies (DARPA / others)
– Level of technology / applications
– Progress vs investment
– Set priorities
e
l s e
Limsi-CNRS
Knowledge gained from
evaluation campaigns
• Industry
–
–
–
–
e
l s e
Compare with State-of-the-Art (developers)
Select technologies (integrators)
Easier market intelligence (SMEs)
Consider applications (end-users)
Limsi-CNRS
Locuteurs
l1
1f
l0
7f
l1
9m
l0
5f
l1
4f
l1
3f
l0
1m
l1
7f
l0
6f
l0
2m
Taux d'Erreur
TEST Q0
60
50
40
30
20
Q0-1
Q0-2
Q0-3
Q0-4
10
0
Powerful tool
• Go deeper into conceptual background
– Metrics, protocols...
•
•
•
•
e
Contrastive evaluation scheme
Accompany research
Problem-solving approach
Interest for speech and NL communities
l s e
Limsi-CNRS
Resources & evaluation
by-products
• Training and test data
– Must be of high quality (used in test)
• Evaluation toolkits
–
–
–
–
–
e
l s e
Expensive: of interest for all
Interest for remote users (domain, country)
Compare with state-of-the-art
Induce participation in evaluation campaign
Measure progress
Limsi-CNRS
Relationship / usage-oriented
evaluation
• Technology evaluation
– Generic task
– Attract enough participants
– Close enough to practical application
• Usage evaluation
– Specific application / specific language
– User satisfaction criteria
e
l s e
Limsi-CNRS
Relationship / usage-oriented
evaluation
• Technology insufficient: no application
• Technology sufficient: possible application
• Efforts for usage evaluation are larger than
for technology evaluation
• Technology evaluation (10s) generic and
organized centrally
• Usage evaluation (1000s) specific organized
by each application developer / user
e
l s e
Limsi-CNRS
Relationship / Long Term Research
• Different objectives / time scale
• Meeting points placed in the future
• LTR: high risk but high profit investment
e
l s e
Limsi-CNRS
ELSE results
• What ELSE proposes ?
 abstract architecture (generic IR/IE)
(profiling, querying and presentation)
 control tasks
1) can be easily performed by a human
2) arbitrary composite functionality possible
3) formalism for task result description
4) measures easy to understand
 6 tasks or a global task to start with...
e
l s e
Limsi-CNRS
6 Control tasks to start with...
1.
2.
3.
4.
5.
6.
e
l s e
Broadcast News Transcription
Cross Lingual IR / IE
Text To Speech Synthesis
Text Summarization
Language Model Evaluation
Word Annotation task (POS, Lemma,
Syntactic Roles, Senses etc.)
Limsi-CNRS
or a global task to start with...
• ”TV News on Demand” (NOD)
(Inspired from BBN "Rough'n'Ready”)
- segments radio and TV Broadcast
- combines several recognition techniques
(speaker Id, OCR, speech transcription,
Named Entities etc.)
- detects topics
- summarizes
- searches/browse and retrieves information
e
l s e
Limsi-CNRS
Multilingualism
• 15 Countries
• 2 Possible solutions:
– 1) Cros Lingual Functionality Requirement
– 2) All participants evaluate on 2 languages
- their own
- one common pivotal language
(English ?)
e
l s e
Limsi-CNRS
Results Computation
• Multidimensional evaluation (multiple
mixed evaluation criteria)
• Baseline Performance (contrastive)
• dual result computation (quality)
• Reproducible (automated evaluation toolkit
needed)
e
l s e
Limsi-CNRS
Language Resources
• Human Built Reference Data (cost +
consistency check + guidelines)
• Minimal Size (chunck selective evaluation)
• Minimal Quality Requirement
• Language Phenomena Representativity
• Reusable & Multilingual
• By-products of evaluation become
Evaluation Resources
e
l s e
Limsi-CNRS
Actors in the infrastructure
ELRA
European Commission
Evaluators
Participants(EU / non EU)
L. R. Producers
Research
Industry
Citizens
Users & Customers
e
l s e
Limsi-CNRS
Need for a Permanent Infrastructure ?
• Problem with Call for Proposals mechanism
– Limited duration (FPs) / Share of cost by
participants
• Permanent organization
–
–
–
–
–
–
e
l s e
General policy / Strategy / Ethical aspects
scoring software
Label attribution / Quality insurance & control
Production of Language Resources (dev,test)
Distribution of Language Resources (ELRA)
Cross-over FPs
Limsi-CNRS
Evaluation in the Call for Proposals
• Evaluation campaigns: 2 years
• Proactive scheme: Select topics (research /
industry) e.g. TV News on Demand or
several tasks (BNT, CLIM, etc.)
• Reactive scheme: Select projects, Identify
generic technologies among projects
(clusters ?), resources contracted out of
project budgets, a posteriori negociation
e
l s e
Limsi-CNRS
Multilinguality
• Each participant should address at least two
languages (own + common language)
• One language common to all participants
– Compare technologies on same language/data
– Compare languages on same technology
– English: spoken by many people, large market,
cooperation with USA
– Up to 4 languages for each consortium
– Other languages in future actions
e
l s e
Limsi-CNRS
Proactive vs Reactive ?
• ELSE views:
– Proactive
– Single Consortium
– Permanent Organization (Association +
Agency)
– English as common language
e
l s e
Limsi-CNRS
Estimated Cost
•
•
•
•
100% EC funding for infrastructure org, LR
Participants: share of system development
Reactive: Extra funding for evaluation
Proactive:
– 600 Keuro average each topic (3,6 Meuro total)
•
•
•
•
e
l s e
90 Keuro organization
180 Keuro LR production
300 Keuro participants (up to 10)
30 Keuro supervision permanent organization
Limsi-CNRS
Questions ?
–
–
–
–
–
Are you interested by the concept ?
Would you be interested to participate ?
Would you be interested to provide data ?
Would you be ready to pay for participating ?
Would you be ready to pay for accessing the
results (and by products, e.g. data and tools)
of an evaluation ?
– Would you be interessed in paying for specific
evaluation services ?
e
l s e
Limsi-CNRS