Transcript ELSE
ELSE Evaluation in Language and Speech Engineering January 98 - April 99 e l s e ELSE Participants • • • • • • • • • e MIP - U. of Odense UDS U. di Pisa EPFL XRCE U. of Sheffield Limsi (CNRS) CECOJI (CNRS) ELRA & ELSNET l s e • • • • • • • • Denmark Germany Italy Switzerland France United Kingdom France France Comparative Technology Evaluation Paradigm • • • • • • Successfully used US DARPA (since 1984) Shorter scale in Europe (Sqale, Grace…) Choose task / system or component Gather participants Organize campaign (protocols/metrics/data) Mandatory if technology insufficient: – MT, IR, summarization… (cf recognition 80s) e l s e Limsi-CNRS Knowledge gained from evaluation campaigns • Knowledge shared by participants in WS: – How to get the best results ? – Methodology advantages / disadvantages • Funding agencies (DARPA / others) – Level of technology / applications – Progress vs investment – Set priorities e l s e Limsi-CNRS Knowledge gained from evaluation campaigns • Industry – – – – e l s e Compare with State-of-the-Art (developers) Select technologies (integrators) Easier market intelligence (SMEs) Consider applications (end-users) Limsi-CNRS Locuteurs l1 1f l0 7f l1 9m l0 5f l1 4f l1 3f l0 1m l1 7f l0 6f l0 2m Taux d'Erreur TEST Q0 60 50 40 30 20 Q0-1 Q0-2 Q0-3 Q0-4 10 0 Powerful tool • Go deeper into conceptual background – Metrics, protocols... • • • • e Contrastive evaluation scheme Accompany research Problem-solving approach Interest for speech and NL communities l s e Limsi-CNRS Resources & evaluation by-products • Training and test data – Must be of high quality (used in test) • Evaluation toolkits – – – – – e l s e Expensive: of interest for all Interest for remote users (domain, country) Compare with state-of-the-art Induce participation in evaluation campaign Measure progress Limsi-CNRS Relationship / usage-oriented evaluation • Technology evaluation – Generic task – Attract enough participants – Close enough to practical application • Usage evaluation – Specific application / specific language – User satisfaction criteria e l s e Limsi-CNRS Relationship / usage-oriented evaluation • Technology insufficient: no application • Technology sufficient: possible application • Efforts for usage evaluation are larger than for technology evaluation • Technology evaluation (10s) generic and organized centrally • Usage evaluation (1000s) specific organized by each application developer / user e l s e Limsi-CNRS Relationship / Long Term Research • Different objectives / time scale • Meeting points placed in the future • LTR: high risk but high profit investment e l s e Limsi-CNRS ELSE results • What ELSE proposes ? abstract architecture (generic IR/IE) (profiling, querying and presentation) control tasks 1) can be easily performed by a human 2) arbitrary composite functionality possible 3) formalism for task result description 4) measures easy to understand 6 tasks or a global task to start with... e l s e Limsi-CNRS 6 Control tasks to start with... 1. 2. 3. 4. 5. 6. e l s e Broadcast News Transcription Cross Lingual IR / IE Text To Speech Synthesis Text Summarization Language Model Evaluation Word Annotation task (POS, Lemma, Syntactic Roles, Senses etc.) Limsi-CNRS or a global task to start with... • ”TV News on Demand” (NOD) (Inspired from BBN "Rough'n'Ready”) - segments radio and TV Broadcast - combines several recognition techniques (speaker Id, OCR, speech transcription, Named Entities etc.) - detects topics - summarizes - searches/browse and retrieves information e l s e Limsi-CNRS Multilingualism • 15 Countries • 2 Possible solutions: – 1) Cros Lingual Functionality Requirement – 2) All participants evaluate on 2 languages - their own - one common pivotal language (English ?) e l s e Limsi-CNRS Results Computation • Multidimensional evaluation (multiple mixed evaluation criteria) • Baseline Performance (contrastive) • dual result computation (quality) • Reproducible (automated evaluation toolkit needed) e l s e Limsi-CNRS Language Resources • Human Built Reference Data (cost + consistency check + guidelines) • Minimal Size (chunck selective evaluation) • Minimal Quality Requirement • Language Phenomena Representativity • Reusable & Multilingual • By-products of evaluation become Evaluation Resources e l s e Limsi-CNRS Actors in the infrastructure ELRA European Commission Evaluators Participants(EU / non EU) L. R. Producers Research Industry Citizens Users & Customers e l s e Limsi-CNRS Need for a Permanent Infrastructure ? • Problem with Call for Proposals mechanism – Limited duration (FPs) / Share of cost by participants • Permanent organization – – – – – – e l s e General policy / Strategy / Ethical aspects scoring software Label attribution / Quality insurance & control Production of Language Resources (dev,test) Distribution of Language Resources (ELRA) Cross-over FPs Limsi-CNRS Evaluation in the Call for Proposals • Evaluation campaigns: 2 years • Proactive scheme: Select topics (research / industry) e.g. TV News on Demand or several tasks (BNT, CLIM, etc.) • Reactive scheme: Select projects, Identify generic technologies among projects (clusters ?), resources contracted out of project budgets, a posteriori negociation e l s e Limsi-CNRS Multilinguality • Each participant should address at least two languages (own + common language) • One language common to all participants – Compare technologies on same language/data – Compare languages on same technology – English: spoken by many people, large market, cooperation with USA – Up to 4 languages for each consortium – Other languages in future actions e l s e Limsi-CNRS Proactive vs Reactive ? • ELSE views: – Proactive – Single Consortium – Permanent Organization (Association + Agency) – English as common language e l s e Limsi-CNRS Estimated Cost • • • • 100% EC funding for infrastructure org, LR Participants: share of system development Reactive: Extra funding for evaluation Proactive: – 600 Keuro average each topic (3,6 Meuro total) • • • • e l s e 90 Keuro organization 180 Keuro LR production 300 Keuro participants (up to 10) 30 Keuro supervision permanent organization Limsi-CNRS Questions ? – – – – – Are you interested by the concept ? Would you be interested to participate ? Would you be interested to provide data ? Would you be ready to pay for participating ? Would you be ready to pay for accessing the results (and by products, e.g. data and tools) of an evaluation ? – Would you be interessed in paying for specific evaluation services ? e l s e Limsi-CNRS