Презентация на обучение

Download Report

Transcript Презентация на обучение

RECENT TRENDS IN METADATA
GENERATION
Milena Dobreva, Nikola Ikonomov
IMI-BAS
SEEDI Conference, Cetinje, September 2007
Warm-up question:
• Do we need this talk?
We are all metadata experts!
SEEDI Conference, Cetinje, September 2007
A Metadata Metaphor?
• We know there are
− Many standards
− Too many ad-hoc solutions
• We even know how to use
some of them (or at least
which is the right one for
our project)
• BUT we typically do not
know
− How to save time and human
effort in creating and editing
metadata?
SEEDI Conference, Cetinje, September 2007
The current picture
• We can not avoid looking for answers to
the question of saving time/effort,
because
− We live in the time of data deluge
− The number of digitally born objects grows
rapidly
i.e. the demand for metadata and quality
grows
SEEDI Conference, Cetinje, September 2007
Metadata in the Digital Library
Context: the DELOS project
reference model
SEEDI Conference, Cetinje, September 2007
Metadata seems to be part only of
the CONTENT, but it influences all
core concepts
• Content is the entry point for all the concepts related to the content
that is managed and disseminated by the DL e.g. collections,
information space model, metadata, ontologies;
• User is the root for concepts like roles, communities, profiles, etc., that
represent aspects of the DL users;
• Functionality is the entrance to that part of the model which concerns
DL functions;
• Architecture regards software components, hosting nodes and how
these are linked and constrained;
• Quality groups qualitative parameters characterizing the digital library
behavior within a given operational domain;
• Policy covers all the concepts that are related to established procedures
or plans of actions governing the DL, such as collection management,
preservation, access rights, etc.
SEEDI Conference, Cetinje, September 2007
Definitions
• Recall: proportion of relevant documents,
which are retrieved out of all relevant
documents;
• Precision: proportion of retrieved and relevant
documents;
• Accuracy: denotes the quantity of retrieved
docs which are matching exactly the topic.
SEEDI Conference, Cetinje, September 2007
Automatic extraction of metadata
• A group of NLP methods – text analysis
aimed at extraction of specific metadata
elements
• Various elements
• Measurement: through information
retrieval measures (accuracy, recall,
precision)
SEEDI Conference, Cetinje, September 2007
Current research
Extracted elements
for metadata
Supported file
formats
Source
Evaluation of the results
document titles
HTML files where
<title> tag is not used
properly
[Hu et al..
2005a]
20.9%–32.6% improvement of
baseline method (using lines
in largest size)
document titles
Word, PowerPoint
[Hu et al.
2005], [Hu
et al. 2006]
0,810 precision and 0,875
recall for Word documents;
0,837 precision and 0,845
recall for PowerPoint
documents
headline titles
financial articles
[Debnath,
Giles. 2005]
n.a.
SEEDI Conference, Cetinje, September 2007
Current research (cont’d)
descriptive metadata:
scanned medical
title, author,
journal articles at
affiliation, and abstract NLM (US National
Library of Medicine)
[Mao et al.
2004a]
document titles
postscript files
[Giuffrida et Title with 92% accuracy,
al. 2000]
author(s) with 87% accuracy,
affiliation(s) with 75%
accuracy, author-affiliations
with 71% accuracy and table
of contents with 76% accuracy
assigns Dublin Core +
GEM (Gateway to
Educational Materials)
tags: Title, Keyword,
Description, Grade,
Duration, Essential
Resources, PedagogyTeaching Method, and
Pedagogy-Group
educational materials
[Yilmazel et
al. 2004]
SEEDI Conference, Cetinje, September 2007
n.a.
web-based survey (users
evaluate the metadata quality)
Current research (cont’d)
Title, Author,
Affiliation, Address,
Note, Email, Date,
Asbtract, Phone,
Keyword, Web,
Degree, Pubnum
header part of
research papers
reference metadata in
scholarly publications:
author, title, journal,
volume, number
(issue), year, and page
information
92,9% accuracy; precision for
various elements is between
0,795 and 0,969 recall –
between 0,622 and 0,991
[Council et
al 2005]
precision 0.7845, recall 0.8955
[Liu et al
2006]
n.a.
[Day et al.
2005]
0,9239 accuracy
DublinCore
acknowledgements in
research papers
tables
[Han et al.
2003]
PDF
SEEDI Conference, Cetinje, September 2007
Conclusions
• These tools are all used for processing of
English texts – the Balkan languages impose
more challenges
• The quality of achieved results is not high
enough yet, but this is a field of active work
• Integration of image and text processing is
another direction for future work.
SEEDI Conference, Cetinje, September 2007
Thank you for your attention!
SEEDI Conference, Cetinje, September 2007