Title of the presentation

Download Report

Transcript Title of the presentation

Aligning Thesauri for an
integrated Access to Cultural
Heritage Collections
Antoine ISAAC
(including slides by Frank van Harmelen)
STITCH Project
UDC Conference
June 5th, 2007
Aligning Thesauri for an integrated Access to CH Collections
Background
• CATCH
• Continuous Access To Cultural Heritage
• Funded by NWO
• 10 computer science research projects applied to the Cultural
Heritage field
• STITCH
• SemanTic Interoperability To access Cultural Heritage
• Exchanging and integrating metadata
Beware: this is research!
Aligning Thesauri for an integrated Access to CH Collections
Agenda
• The Semantic Interoperability problem
• Demo
• Semantic Web solutions for interoperability
• Conceptual vocabulary alignment
• Conceptual vocabulary representation
Aligning Thesauri for an integrated Access to CH Collections
KB Illustrated Manuscripts
Aligning Thesauri for an integrated Access to CH Collections
KB Illustrated Manuscripts
Aligning Thesauri for an integrated Access to CH Collections
BNF Mandragore
Aligning Thesauri for an integrated Access to CH Collections
BNF Mandragore
Aligning Thesauri for an integrated Access to CH Collections
The Semantic Interoperability Problem
• Trend: simultaneous access to different collections
• Problem: conceptual heterogeneity
• No standard vocabulary/thesaurus
• “classical ruins” vs. “landscape with ruins”
• “the Virgin Mary”
vs. “Saint Mary”
• We don’t really want it
different vocabularies for different domains, traditions, tasks
• Practical consequence:
• Searching for “the Virgin Mary” misses “Saint Mary”
• Unless we know both vocabularies
Aligning Thesauri for an integrated Access to CH Collections
Old situation
Aligning Thesauri for an integrated Access to CH Collections
Vocabulary alignment
• Find semantic correspondences between vocabulary
elements
• “classical ruins” ≈ “landscape with ruins”
• “the Virgin Mary”
= “Saint Mary”
Aligning Thesauri for an integrated Access to CH Collections
New situation
Aligning Thesauri for an integrated Access to CH Collections
Demo
• http://stitch.cs.vu.nl/rp33333/MANDRA-SV-ICEmandraNewNONE , amphibians
• Wheat
[Screenshots at the end of these slides]
Aligning Thesauri for an integrated Access to CH Collections
Agenda
• The Semantic Interoperability problem
• Demo
• Semantic Web solutions for interoperability
• Conceptual vocabulary alignment
• Conceptual vocabulary representation
Aligning Thesauri for an integrated Access to CH Collections
Vocabulary alignment
• Find correspondences between vocabulary elements
• “klassieke ruïnes” ≈ “landschap met ruïnes”
• “maagd Maria”
= “Heilige Moeder”
• STITCH aim: doing it (semi-)automatically
• Vocabularies are big
• They evolve over time
• Using techniques from Semantic Web research domain
• Problem comparable to ontology alignment
• Techniques already investigated there
• Linguistics, statistics
Aligning Thesauri for an integrated Access to CH Collections
Automatic alignment techniques
•
•
•
•
Lexical
Structural
Statistical
Background knowledge
Aligning Thesauri for an integrated Access to CH Collections
Lexical alignment
• Labels of entities, textual definitions
Long brain tumor
More specific Long tumor
than
Aligning Thesauri for an integrated Access to CH Collections
Automatic Alignment Techniques
•
•
•
•
Lexical
Structural
Statistical
Background knowledge
Aligning Thesauri for an integrated Access to CH Collections
Statistical alignment
• Object information (e.g. book indexing)
Aligning Thesauri for an integrated Access to CH Collections
Statistical alignment: KB collections
(4951 1152 613) Nederlands - Nederlandse taalkunde
(280 714 243) Diabetes mellitus - suikerziekte
Aligning Thesauri for an integrated Access to CH Collections
Automatic Alignment Techniques
•
•
•
•
Lexical
Structural
Statistical
Background knowledge
Aligning Thesauri for an integrated Access to CH Collections
Alignment using shared background knowledge
• Using a shared conceptual reference to find links
background
knowledge
thesaurus 1
thesaurus 2
Aligning Thesauri for an integrated Access to CH Collections
Alignment: no universal solution
• No single technique gives an ideal solution
• Different techniques have to be selected/combined,
depending on the application case
• Poor vs. rich semantic structure
• Extensive vs. limited lexical coverage
• Existence of collections described by several vocabularies
• Alignment is a difficult research problem
Aligning Thesauri for an integrated Access to CH Collections
Agenda
• The Semantic Interoperability problem
• Demo
• Semantic Web solutions for interoperability
• Conceptual vocabulary alignment
• Conceptual vocabulary representation
Aligning Thesauri for an integrated Access to CH Collections
Representing Vocabularies
Many different models and formats to represent
vocabularies
• Need for standard formats to develop standardized
tools and methods
• Alignment process
• Browsing/information retrieval tools using vocabularies
• Need to represent features commonly used by these
tools
• Especially lexical information and semantic links
Aligning Thesauri for an integrated Access to CH Collections
SKOS (Simple Knowledge Organisation System)
• World Wide Web Consortium (W3C)
• Model to represent simple conceptual vocabularies
(thesauri, classification schemes) on the Semantic Web
• Comparable to Dublin Core, for conceptual vocabularies
• SKOS offers building blocks to create XML/RDF data
• Concepts and ConceptSchemes
• Lexical properties (prefLabel, altLabel)
• Semantic relations (broader, related)
• Notes (scopeNote, definition)
Aligning Thesauri for an integrated Access to CH Collections
SKOS: Small UDC Example
skos:Concept
http://www.udcc.org/udc/class_512
skos:prefLabel
512@zxx
skos:prefLabel
Algebra@en
skos:broader
http://www.udcc.org/udc/class_51
• Beware: this is a standard, not everything can be represented!
E.g. for UDC, difficult to represent all types of auxiliaries
Is -2 Evidence of religion a standard concept?
Aligning Thesauri for an integrated Access to CH Collections
Agenda
• The Semantic Interoperability problem
• Demo
• Semantic Web solutions for interoperability
• Conceptual vocabulary alignment
• Conceptual vocabulary representation
Aligning Thesauri for an integrated Access to CH Collections
Conclusion: New opportunities for making
knowledge accessible
• Integration of collections at the semantic level
• Semantic integration and vocabulary alignment
• Representation and publication of conceptual
vocabularies
• SKOS is an open, web-compatible standard
• Semantic Web research can help Cultural
Heritage
• Vision: a global network of interconnected collections and
vocabularies that can be exploited by standard tools?
• Or somewhere in-between present situation and the vision
Aligning Thesauri for an integrated Access to CH Collections
Discussion: UDC and Semantic Interoperability?
• UDC as pivot language (spine) for multilingual access
• Ideal for multilingual scenarios
• Compatible with common information needs
• “Front-office” scenario
• Aligning initial vocabularies to UDC
• Using UDC in the access system
• MSAC
• Multilingual Subject Access to Catalogues of National Libraries
• UDC as a searching/browsing means, with other vocabularies
Aligning Thesauri for an integrated Access to CH Collections
Discussion: UDC and Semantic Interoperability?
• “Back-office” scenario?
• UDC as a background resource for automatic pairwise
alignment between the initial vocabularies
• Multilingual information, rich semantic structure
• Both scenarios require more accessible UDC
• And experimentation…
Aligning Thesauri for an integrated Access to CH Collections
Thanks!
Aligning Thesauri for an integrated Access to CH Collections
Links
• STITCH
• Demo collections
http://stitch.cs.vu.nl
• BNF Mangragore
http://mandragore.bnf.fr
• KB illuminated manuscripts http://www.kb.nl/manuscripts/
• Library-originated integration projects:
• MSAC search interface
• MACS project
http://sigma.nkp.cz
http://macs.cenl.org
• Semantic web links
• Semantic Web at W3C
• SKOS
http://www.w3.org/2001/sw/
http://www.w3.org/2004/02/skos/
• Semantic Web projects dealing with Cultural Heritage
• MuseumFinland
http://www.museosuomi.fi/
• eCulture
http://e-culture.multimedian.nl/
Aligning Thesauri for an integrated Access to CH Collections
Demo (1)
Subject vocabulary, collection 1
Subjects
Aligning Thesauri for an integrated Access to CH Collections
Demo (2)
Hierarchical path
from root to selected
subject
Possible
specialization for
selected subject
Aligning Thesauri for an integrated Access to CH Collections
Demo (3)
Semantic alignment
of subjects activated
Document from
Collection 2
Aligning Thesauri for an integrated Access to CH Collections
Demo (4)
Subject from voc2 aligned to
voc1:amphibians”
Back