SAPIR Kick-off Meeting
Download
Report
Transcript SAPIR Kick-off Meeting
Main Mono and Bilingual Tasks:
Track Organisation and Results Analysis
Giorgio M. Di Nunzio
Nicola Ferro
Carol Peters
University of Padua
Italy
[email protected]
University of Padua
Italy
[email protected]
ISTI-CNR, Area di Ricerca Pisa
Italy
[email protected]
CLEF 2007 Workshop
Budapest, Hungary, 19–21 September 2007
Outline
①
•CLEF Infrastructure: DIRECT
②
•Track Overview
③
•Monolingual Tasks
④
•Bilingual Tasks
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
2
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
3
Information Hierarchy
Wisd
om
Pape
rs
Know
ics
t
s
ledg
i
e Stat
Infor
Data
mati
on
ur
Meas
es
and ons
s
t
i
n
rime Collect
e
p
x
E
tal
n
e
m
ri
Expe
experimental collections and the experiments are data, since they are the raw, basic
elements needed for any further investigation
performance measurements are information, since they are the result of
computations and processing on the data,
descriptive statistics and the hypothesis tests are knowledge, since they are a further
elaboration of the information carried by the performance measurements
theories, models, algorithms, and techniques are wisdom, since they provide
interpretation, explanation, and formalization of the content of the previous levels.
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
4
Approach to the Evaluation (1/2)
Introduce a conceptual model
it makes clear what are the entities entailed by the information space of
an evaluation campaign, their features, and their relationships
logical models can be derived from it to manage and preserve the
experimental data
commonly agreed data formats for exchanging information can be
derived from it
Develop common metadata formats
they provide meaning to the data, and thereby enable their sharing and
re-use
they allow to keep track of the lineage of the managed information
Adopt a unique identification mechanism
it allows for explicit citation and easy access to the scientific data and it
supports the enrichement of the scientific data
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
5
Approach to the Evaluation (2/2)
Provide common tools for statistical analyses
they allow for judging whether measured differences between retrieval methods
can be considered statistically significant
a uniform way of performing statistical analyses on experiments make the analysis
and assessment of the experiments comparable too
Design and develop a Digital Library System (DLS) for IR scientific data
it is well suited for managing and making accessible the scientific data and the
experiments produced during the course of an evaluation campaign
it also provides tools for analyzing, comparing, and citing the scientific data of an
evaluation campaign, as well as curating, preserving, annotating, enriching, and
promoting the re-use of them
Give to organizations responsible for evaluation initiatives an active role in
this process
they should take a leadership role in developing a comprehensive strategy for longlived digital data collections and drive the research community through this process
in order to improve the way of doing research
they should take care also of defining guiding principles, policies, best practices for
making use of the scientific data produced during the evaluation campaign itself
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
6
Internationalization of the User Interface
Bulgarian
Petya Osenova, Kiril Simov
Czech
Pavel Pecina
English
Marco Dussin
French
Jacques Savoy
German
Thomas Mandl
Indonesian
Mirna Adriani
Italian
Marco Dussin
Portuguese
Paulo Rocha, Diana Santos
Spanish
Julio Villena Román
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
7
Identification: Digital Object Identifiers (DOI)
10.2415/AH-BILI-X2BG-CLEF2007.JHU-APL.APLBIENBGTD4
DOIs
allow us to uniquely identify a digital object
are persistent and actionable
aim especially at the intellectual property
We assign DOIs to:
collections − prefix 10.2453
topics − prefix 10.2452
experiments − prefix 10.2415
pools − prefix 10.2454
statistical tests − prefix 10.2455
http://www.medra.org
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
8
DOI Resolution
http://dx.doi.org
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
9
Experiment Metrics
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
10
Experiment Statistics
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
11
Experiment Plots
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
12
Task Statistics
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
13
Task Plots
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
14
Appendices (1/2)
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
15
Appendices (2/2)
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
16
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
17
Participation
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
18
Participation by Country
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
19
Tasks and Collections
Monolingual and bilingual tasks have principally offered for
Central European languages: Bulgarian, Czech and Hungarian
Language
Task
Collection
Bulgarian
Monolingual BG, Bilingual X2BG
Sega 2002, Standart 2002, Novinar 2002*
Cezch*
Monolingual CS, Bilingual X2CS
Mlada fronta DNES 2002, Lidové Noviny 2002
Hungarian
Monolingual HU, Bilingual X2HU
Magyar Hirlap 2002
English
Bilingual X2EN (Indian sub-task)
LA Times 2002*
Topics in 16 languages
European languages: Bulgarian, Czech, English, French, Hungarian,
Italian and Spanish
non-European languages (for X2EN): Amharic, Chinese, Indonesian,
Oromo
Indian sub-task: Bengali, Hindi, Marathi, Tamil and Telugu
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
20
Participation by Task
172 submitted runs
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
21
Runs by Source Language
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
22
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
23
Monolingual Bulgarian
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
24
Monolingual Czech
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
25
Monolingual Hungarian
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
26
Monolingual English*
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
27
Approaches to Monolingual Retrieval
Linguistic
Stemmers:
both light and
aggressive
Morphological Lemmatizer
StemmingFeed-back:
vs 4-grams
Main emphasis: Relevance
NLP techniques
stemming probabilistic
impact on individual
topics
RF
Indexing:
Named
Entity
Recognition
but
notinformation
on average RF
morphological
analysis
mutual
word-based or
blind relevance 4-grams
feedback
relevance feed-back
can be detrimental
word
decompounding
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
28
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
29
Bilingual X English
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
30
Approaches to Bilingual X2EN
Main emphasis:
bilingual dictionaries
bilingual dictionaries and
machine translation
pivot languages
Best Bilingual
English
coverage
of lexicons
translation
ambiguity
about
query expansion
with RF
system
is
use of pivot languages
resolution
Afaan Oromo
withstemmer
a graph
parallel
corpora
Bilingual Hungarian to English
88% of the best
based approach
stop list creation
monolingual
system
bilingual dictionary
lexicon coverage with a
bilingual Oromo-English
pattern-based approach
exploiting Wikipedia to remove
dictionary creation
improbable translations
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
31
Bilingual X2EN: Indian Subtask
limited
bilingual
linguistic
dictionary
Hindi-English
and resources
Telugu-English
statistical
MT system
bilingual
dictionaries
dictionaries
in oneapproach
phoneme-based
OOVtrained
usingcreated
a on
rule-based
transliterations
parallel
aligned
stop
list
creation
week
to for
generate
transliteration
equivalent
and
English
edit
sentences
stemming
andcombined
n-gram with
queries
distances
TFIDF
approach
language models
boolean
operators
stemmers
translation
anddisambiguation
morphological via a
analyzers
page-rank
if available
style algorithm
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
32