Authority services Interthesaurus correlations

Download Report

Transcript Authority services Interthesaurus correlations

Integrated Information Management and Access
- new chances for museums, archives and libraries
Martin Doerr
Center for Cultural Informatics
Institute of Computer Science
Foundation for Research and Technology - Hellas
Singapore,
August 1, 2008
ICS-FORTH August 1, 2008
1
Integrated Information Management
Overview
 Information Integration – a utility perspective
 Museum and Library Information
 Key-words, Finding Aids and Thesauri
 Do we talk about the same thing?
 Understanding events, contexts and stories
 CIDOC CRM, simple implementations
ICS-FORTH August 1, 2008
2
Information Integration Management
A Perspective of Utility
Memory institutions maintain Digital Repositories (“Digital Memories”)
Information systems preserving and providing access to primary
information sources, scientific and scholarly information and literature,
such as digital libraries of publications, indices of archives of social or
scientific activities, or documentation of physical collections.
Digital Repositories are necessarily heterogeneous to optimize their
function for different information forms and access needs, but the
knowledge they contain forms a logical whole.
To get information and learn from information we need
 uniform access,
 retrieval by human criteria and
 connection of disparate information assets (e.g., painting & biography)
ICS-FORTH August 1, 2008
3
Information Integration Management
A Perspective of Utility
Information integration provides a syntactically and semantically
homogeneous layer on top, be it physical or virtual, manual or
automated.
Multiple standard formats can coexist, if information can be transformed or
merged. One format does not ensure that the information is connected!
Standardization and transformation go hand in hand. For both,
documentation (metadata) needs to be provided, adapted or “cleaned”:
 legacy data to standard form, from one standard to another, “tune” data so
that they can be transformed.
 Ultimate integration cost: manual creation/ adaptation of metadata.
 Better integration is not always more work, but needs more foresight.
 Bad decisions cost most.
ICS-FORTH August 1, 2008
4
Information Integration Management
A Perspective of Utility
 Levels of Integration: From one platform, I can…
1.
2.
3.
4.
5.
read everything, if I have the ID : syntactic integration, The Web
get everything that refers to the words X, Y, Z: Google and others
get everything about a particular person, thing, place, fact, or concept
learn, if there are things, facts with given characteristics
learn about associations and contexts of things across documents
For instance,
 What species is this object?
 Which professions had the relatives of van Gogh? Which where the
clients of van Gogh’s paintings?
 Were German soldiers in Russia before WWII ?
 Which antique art objects may Michelangelo have seen? (25 years
project !)
ICS-FORTH August 1, 2008
5
Information Integration Management
A Perspective of Utility
 The traditional library task:
 Collect and preserve documents and provide finding aids
 The job is solved, when the (one, best) document is handed out. “All you
need is in this document”.
 But understanding lives from relationships. Museum information has
complex relationships. Relationships may be categorical or factual:
 Categorical (e.g., “smoking causes cancer”). : Richly exploited by Semantic Web
technology. Use and integration limited to research results. Not useful for primary
research itself.
 Factual associations concatenate information assets to meaningful (“epistemic”)
networks (“stories”): support context-based hypothesis building, cross-disciplinary
search etc. (e.g. “John smoked with 20”, …30.. 40”. “John had lung cancer with 60”)
ICS-FORTH August 1, 2008
6
Information Integration Management
Library, Archive, Museum Information
 The typical library contents: “The whole stories”
 Secondary literature (research results)
 Facts brought into causal context
 Categorical: theories and hypotheses
 Fiction.
 The typical archive contents: “The needle in the haystack”
 Primary sources, “bits and pieces” (letters, legal documents, administration acts,
images, scientific records).
 factual, kept in the sequence of creation, as by the creator or responsible.
 The typical museum information: “Museum objects rarely talk”
 Factual documentation of properties and context per object, references,
classification
 Highly heterogeneous, disparate.
ICS-FORTH August 1, 2008
7
Museum Information
“A Monet is not like a Dinosaur”

Museum objects may be:
 Unique in form, valuable out of context
— Valued art objects: “La Pie by Monet”, aesthetic minerals, exceptional life
forms, curiosities.
 Unique by particular context, not valuable out of context,
valuable only as illustration or symbol,
— Historical heirlooms, relics of saints, “John Lennon’s T-Shirt”
 Not unique, not particularly valuable . Used as example of a
category out of the particular context
— Most objects in Natural History, ethnology, archeology.
 Unique by rarity, valuable as evidence out of a particular context
— Most objects in paleontology, many unique archeological objects:
“6th left rib from a T. Rex”
ICS-FORTH August 1, 2008
8
Information Integration Management
The Museum Information Problem
The ultimate goal of users seeking information is not to get an “object”
but to understand a topic.
 Understanding lives from relationships:
 objects are interpreted by context (e.g., bone finds in Evan’s “bathtubs”)
 contexts are interpreted by objects (e.g., many arrowheads in Troy IV)
 objects are interpreted by categories (e.g., Evan’s Minoan “bathtubs”)
 categories are supported by examples (e.g., the shape of a kris)
 categories may be based on rare evidence (e.g., a hominid tooth)
 We need to integrate museum, archives, libraries in a sensible way
to find integrated knowledge and produce new knowledge, to provide
evidence for new hypotheses or verify or challenge old hypotheses.
ICS-FORTH August 1, 2008
9
Information Integration Management
Library and Museum Information
Museum and library information has complex interrelations. Museum
and library information overlaps, and otherwise is different.
 Libraries document literature in order to facilitate access to it.
 Museum documentation classifies and describes museum objects, their
context and relevance. It refers to literature. Museums produce regularly
(secondary) literature.
 Museum objects are referred to and published in literature. Literature
may describe museum objects, their context and theories about and
related to them. Literature describes concepts that are exemplified or
illustrated by museum objects.
No standard documentation format yet for that!
 Libraries may also produce literature. Libraries may document and
curate rare objects as museums do. Most museums maintain libraries.
ICS-FORTH August 1, 2008
10
Information Integration Management
Archive, Library and Museum Information
Libraries
provide
Museums
publish
document
features & context
finding aids
illustrate,
exemplify
using
Books
refer to
Objects
are about
make narratives
from
Archives
provide
finding aids
primary Documents
ICS-FORTH August 1, 2008
11
Key-words, Finding Aids and Thesauri
The second level of integration
Why is Google (i.e. Search Engines!) good?
 Low cost, no data tuning, scalable
 Find easily secondary literature, esp. if abundant
 Find things by usual category names
 No user training, no access language
=> Recommendation: You should always provide a good search engine !
 Why is Google bad?
 User must know all synonyms
 Names are not things: Rare things are covered under frequent names
(e.g., “George Bush”, a S/W called “Volcano”)
 Relations only by aggregation of terms appearing in the source
(e.g., “First known Turkish - Greek marriage in Crete” (1635) ),
 No control on relevance, no statistics possible, no related sources
ICS-FORTH August 1, 2008
12
Key-words, Finding Aids and Thesauri
The second level of integration
 Finding Aids:
 Assumption: User knows a topic, characterized by a noun, or knows
associations of the topic uncorrelated to the problem to be solved
(e.g. “organic farming” for “host-parasite studies”, an author for a topic,
or: search object by date of acquisition, because I don’t remember the name)
 Dublin Core Metadata Elements makes 15 relationships to terms
explicit (type, classification, creator, publisher, date, format etc.)
 It increases precision
 It increases recall if additional terms in the metadata are added
ICS-FORTH August 1, 2008
13
Key-words, Finding Aids and Thesauri
The second level of integration
 Is Dublin Core better than Google?
 Literature search by Author-Title: Google is sufficient or better
 Type, format, subject, coverage: DC only better if terms not in the content
 Relationship: DC better if not connected by relevant term cluster
 Non-verbose, non-digital objects: DC provides the minimal metadata!
 By Shakespeare or about Shakespeare: DC disambiguates!
 What Dublin Core does not?
 Not appropriate for museum objects (no place, finding info, material)
 No typed relationships, no context information
 No notion of identity (separation of URI and name, American library tradition)
=> DC has significant benefit for non-verbose digital objects.
ICS-FORTH August 1, 2008
14
Key-words, Finding Aids and Thesauri
The second level of integration
 Thesauri of controlled terms (categories)
 Subjects, object types, place types, person roles, event types
 Good for secondary literature search, metadata fields (libraries!)
 Bad: A “new language” users must learn, expensive to create
 invisible thesauri enhance search engines
 “Museums do not like thesauri”:
 Not suited for factual knowledge!!
 Cultural terminology is a dynamic research tool (“every PhD a new
typology”) to conclude from form to function or time etc.
 Only few high-level terms are stable and useful for finding aids
Recommendation: Small thesauri for museums (that users can see on
one page) increase power of metadata and improve search results.
ICS-FORTH August 1, 2008
15
Do we talk about the same Thing?
Co-reference can connect documents!
Such networks hide stories!
(complementary information)
?
?
ICS-FORTH August 1, 2008
?
16
Do we talk about the same Thing?
Hypertext is wrong: Documents contain links!
Linking documents
by co-reference
Primary link
corresponding to
one document
CIDOC CRM
Core Ontology
Deductions
Instance of
Integration by
Factual Relations
Donald Johanson Johanson's Expedition
Discovery of
Lucy
real world
nodes (KOS)
Cleveland Museum
of Natural History
AL 288-1
Lucy
Ethiopia
Hadar
Documents in
Digital Libraries
ICS-FORTH August 1, 2008
17
Do we talk about the same Thing?
Co-reference links via authority files
Join across sources by transitivity
of co-reference
local ids
Find “friends of
a friend”
Not scalable!
Content
.
.
.
.
output: “George”
id
Source 2
L
i
n
k
Join
local ids
Content
query
.
.
.
.
input: “Martin”
ICS-FORTH August 1, 2008
Source 1
match
Dyn amic
li nk
t
a
b
l
match
e
“Κώστας” /
“Kostas”
.
.
.
.
Authority service
18
Do we talk about the same Thing?
Co-reference links without authority files
Join across sources by transitivity
of co-reference
local ids
local ids
Find “friends of
a friend”
make a
co-reference
Content
.
.
.
.
output: “George”
make a
co-reference
Source 2
Join
.
.
.
.
“Κώστας” /
“Kostas”
local ids
Content
query
match
.
.
.
.
input: “Martin”
ICS-FORTH August 1, 2008
Source 1
19
Do we talk about the same Thing?
The third level of integration
 Do we talk about the same thing?
 Documents are connected if they refer to the same things people, places,
events = “Co-reference”. The hypertext model is wrong.
 Authority files cannot catch up, they simplify procedure but do not solve
it. The scale is incredible.
 Curation of direct co-reference links (co-reference clusters) needed.
 Not more expensive than a search engine index
 Duplicate detection, data cleaning and Web 2.0 methods can help
massively generate co-reference links
Recommendation: Prepare for co-reference in documentation practice!
(tag names, link locally etc. )
ICS-FORTH August 1, 2008
20
Understanding Events, Contexts, Stories
The Fourth Level of Integration
 So far, by integration nothing learned yet beyond what I manually
collect from each source.
 Co-reference: Allows for tracing stories, but not for querying stories.
 Understanding lives from relationships.
 Is there a global model of relationships?
(social, economic, material,
geographic, biological relations…, thousands of documentation formats)
 Dominance of the mesoscopic, human activity scale.
 Identification, classification, part-whole, reference, participation in
meetings => these relations integrate museum and library information!
 Confirmed by museums, e-science, historians.
ICS-FORTH August 1, 2008
21
Information Integration Management
Context as a network of related “meetings”
time
“LAOKOON”
(copy)
(in Vatican museum)
“…noble simplicity,
silent grandeur…”
(in a library)
Winkelmann’s
death
Winkelmann
writes….
Winkelmann
unknown Roman
Winkelmann
sees “Laokoon”
(archive information)
“LAOKOON”
unknown Roman Published
copies “Laokoon” Inference
(in a library)
Greece
ICS-FORTH August 1, 2008
Rome
Winkelmann’s
birth
Germany
Winkelmann’s
mother
(archive information)
space
22
The CIDOC CRM
ISO21127
The CIDOC Conceptual Reference Model (ISO21127:2006)
 Developed by the CRM Special Interest Group of the International Committee
for Documentation (CIDOC) of the International Council of Museums (ICOM),
following an initiative of ICS-FORTH, Heraklion, Crete.

Is an extensible core ontology describing the underlying semantics of over a
hundred database schemata and structures from all museum disciplines,
archives and libraries. (Now extended by FRBROO, modeling IFLA’s FRBR).


It is result of 15 years interdisciplinary work and agreement.
In essence, it is a generic model of recording of “what has happened” in human
scale, i.e. a class of discourse.

By it we can generate huge, meaningful networks of knowledge by a simple
abstraction: history as meetings of people, things and information.

It bears surprise: Minimal or no specialization allows for covering new domains.
ICS-FORTH August 1, 2008
23
The CIDOC CRM
Historical Archives….
Type:
Title:
Title.Subtitle:
Date:
Creator:
Publisher:
Subject:
Text
Protocol of Proceedings of Crimea Conference
II. Declaration of Liberated Europe
February 11, 1945.
The Premier of the Union of Soviet Socialist Republics
The Prime Minister of the United Kingdom
The President of the United States of America
State Department
Postwar division of Europe and Japan
Metadata
Documents
About…
ICS-FORTH August 1, 2008
“The following declaration has been approved:
The Premier of the Union of Soviet Socialist Republics,
the Prime Minister of the United Kingdom and the President
of the United States of America have consulted with each
other in the common interests of the people of their countries
and those of liberated Europe. They jointly declare their mutual
agreement to concert…
….and to ensure that Germany will never again be able to
disturb the peace of the world…… “
24
The CIDOC CRM
Images, non-verbose objects…
Type:
Title:
Date:
Publisher:
Source:
Copyright:
References:
Image
Allied Leaders at Yalta
1945
United Press International (UPI)
The Bettmann Archive
Corbis
Churchill, Roosevelt, Stalin
Photos, Persons
Metadata
About…
ICS-FORTH August 1, 2008
25
The CIDOC CRM
Places and Objects
TGN Id: 7012124
Names: Yalta (C,V), Jalta (C,V)
Types:
inhabited place(C), city (C)
Position: Lat: 44 30 N,Long: 034 10 E
Hierarchy: Europe (continent) <– Ukrayina (nation) <– Krym (autonomous republic)
Note:
…Site of conference between Allied powers in WW II in 1945; ….
Source: TGN, Thesaurus of Geographic Names
Places, Objects
About…
Title:
Yalta, Crimean Peninsula
Publisher: Kurgan-Lisnet
Source:
Liaison Agency
ICS-FORTH August 1, 2008
26
The CIDOC CRM
Explicit Events, Object Identity, Symmetry
E52 Time-Span
E39 Actor
E53 Place
7012124
February 1945
P82 at some time
within
E7 Activity
E39 Actor
“Crimea Conference”
E38 Image
P86 falls within
E65 Creation
Event
E39 Actor
*
P81 ongoing throughout
E31 Document
“Yalta Agreement”
E52 Time-Span
11-2-1945
ICS-FORTH August 1, 2008
27
The CIDOC CRM
Data Example (e.g. from Extraction)
Epitaphios GE34604 (entity E22 Man-Made Object)
P30 custody transferred through, P24 changed ownership through
Transfer of Epitaphios GE34604 (entity E10 Transfer of Custody, E8 Acquisition Event
Multiple Instantiation !
P28 custody surrendered by
Metropolitan Church of the Greek Community of Ankara (entity E39 Actor )
P23 transferred title from
Metropolitan Church of the Greek Community of Ankara (entity E39 Actor )
P29 custody received by
Museum Benaki
(entity E39 Actor )
P22 transferred title to
Exchangeable Fund of Refugees (entity P40 Legal Body )
P2 has type
national foundation (entity E55 Type )
P14 carried out by
Exchangeable Fund of Refugees (entity E39 Actor )
P4 has time-span
GE34604_transfer_time
(entity E52 Time-Span )
P82 at some time within
1923 - 1928
(entity E61 Time Primitive)
P7 took place at
(entity E53 Place )
Greece
P2 has type
(entity E55 Type )
nation
republic (entity E55 Type )
TGN data
P89 falls within
Europe
(entity E53 Place )
P2 has type
ICS-FORTH August 1, 2008
continent (entity E55 Type )
28
The CIDOC CRM
Top-level Entities relevant for Integration
refer to / identifie
E55 Types
refer to / refine
E39 Actors
E28 Conceptual Objects
E18 Physical Thing
participate in
affect or / refer to
location
E2 Temporal Entities
E52 Time-Spans
ICS-FORTH August 1, 2008
at
E53 Places
29
The CIDOC CRM
Example: The Temporal Entity Hierarchy
ICS-FORTH August 1, 2008
30
The CIDOC CRM
A Classification of its Relationships
 Identification of real world items by real world names.
 Classification of real world items.
 Part-decomposition and structural properties of Conceptual &
Physical Objects, Periods, Actors, Places and Times.
 Participation of persistent items in temporal entities.
— creates a notion of history: “world-lines” meeting in space-time.
 Location of periods in space-time and physical objects in space.
 Influence of objects on activities and products and vice-versa.
 Reference of information objects to any real-world item.
ICS-FORTH August 1, 2008
31
The CIDOC CRM
What is an ontology?

Ontologies are formalized knowledge: clearly defined concepts and relationships
about real possible states of affairs of a domain. “Semantics” is the world they refer
to (“ontological commitment”), and not a set of logical rules! (e.g., what is an event?)

Ontologies describe a reality, independent from context and performance!
Information models are not ontologies! They abbreviate, denormalize, select.
E.g.: “DC.creator”, “DC.Date”, “birthday/birthplace”, “destination” in the MIDAS
schema (UK monuments records).

Ontologies can be understood by people and processed by machines to enable data
exchange, data integration, query mediation:
 Local information systems may export information in a CRM compatible form (CRM Core
or more).
 Local information systems may answer queries by a subset of CRM concepts.
 Exported information may be merged in another database (“data warehouse”).
Complementary information can thus be easily integrated.
ICS-FORTH August 1, 2008
32
Interoperability of Museum Information
towards a network of knowledge
There cannot be one database schema for all ALM information. A global core
ontology is a high-level explanation, not a format, allowing for automated
correlation, mediation, transformation, generation of integrated views.
A particular Installation should have a core schema, compatible with the core
ontology, following an informed decision about its integration and access
capabilities, for instance, CRM Core, MuseumDat,or a similar CRM-compatible
schema. DC and CRM Core can be combined.
With CRM, we know at any time what extension to more functionality means,
e.g., FRBRoo/ FRBRCore. (DC extension simply failed!).
CRM Core(or MuseumDat): A low-cost entry to CRM compatibility.
— As easy as Dublin Core, but appropriate to relate ALM
— start with finding aids
— add co-reference – manual, automated, Web 2.0
— add NLP to recover more events.
— Add more sophisticated relationships.
ICS-FORTH August 1, 2008
33
Interoperability of Museum Information
CRM Core metadata elements
ICS-FORTH August 1, 2008
34
Interoperability of Museum Information
Integration with CRM Core (Network View)
E84 Information Carrier
The “Monument to
Balzac”(S1296)
P62 depicts
P108B was
produced by
P2 has type
E55 Type
bronze
E21 Person
Honoré de Balzac
P62 depicts
E84 Information Carrier
The “Monument to Balzac”
(plaster)
P16B was used for
E12 Production
E12 Production
E52 Time-Span
1925
Rodin making “Monument
to Balzac” in 1898
E55 Type
P120B
occurs
after
plaster
P14 carried out by
E52 Time-Span
1917
P4 has
time-span
E69 Death
Rodin’s death
P2 has type
P134 continued
Bronze casting
“Monument to
Balzac” in 1925
P4 has time-span
P108B was
produced by
E40 Legal Body
Rudier (Vve Alexis)
et Fils
P4 has time-span
P7 took
place at
E52 Time-Span
E53 Place
1898
France
(nation)
P2 has type
E55 Type
P14 carried out by
companies
E21 Person
P100B died in
P98B was born
E67 Birth
Rodin’s birth
ICS-FORTH August 1, 2008
Auguste Rodin
P2 has type
E55 Type
sculptors
E52 Time-Span
P4 has time
-span
1840
35
Metadata View
Artist (CRM Core).
Category = E21 Person
Classification = artists
Classification = sculptors
Identification =Rodin, Auguste
Identification =ID: 500016619
Event
Role in Event =P98B was born
Identification= Rodin‘s birth
Event Type = E67 Birth
Date = 1840
Event
Role in Event =P100B died in
Identification= Rodin‘s death
Event Type = E69_Death
Date = 1917
Related event
Role in Event =P120 occurs before
Identification= Bronze casting Monument to Balzac in
1925
Work (CRM Core).
Category = E84 Information Carrier
Classification =sculpture (visual work)
Classification =plaster
Identification =The Monument to Balzac (plaster)
Description =Commissioned to honor one of France's greatest novelists, Rodin spent seven years preparing
for Monument to Balzac. When the plaster original was exhibited in Paris in 1898, it was widely attacked.
Rodin retired the plaster model to his home in the Paris suburbs. It was not cast in bronze until years after his
death.
Event
Role in Event =P108B was produced by
Identification= Rodin making Monument to Balzac in 1898
Event Type = E12 Production
Participant
Identification =Rodin, Auguste
Identification =ID: 500016619
Participant Type = artists
Participant Type = sculptors
Date = 1898
Place = France (nation)
Related event
Role in Event =P134B was continued by
Identification= Bronze casting Monument to Balzac in 1925
Event
Role in Event =P16B was used for
Identification= Bronze casting Monument to Balzac in 1925
Event Type = E12 Production
Participant
Identification =Rudier (Vve Alexis) et Fils
Participant Type = companies
Thing Present
Identification =The Monument to Balzac (S.1296)
Thing Present Type =bronze
Thing Present Type =sculpture (visual work)
Date = 1925
Related event
Role in Event =P120B occurs after
ICS-FORTH August 1, 2008
Identification= Rodin's death
Relation
To = Honore de Balzac
Relation type
refers to
36
The CIDOC CRM
Why an Integration layer on Top?
 Information acquisition needs:
— sequence and order, completeness, case-specific language and constraints to guide
and control data entry.
— ergonomic documentation units, optimized to specialist needs
— work-flow on series of analogous items, item-centric.
— Low interoperability needs (capability to be mapped!)
 Integration / comprehension needs epistemic networks:
— break up document boundaries, relate facts to wider context,
— match shared identifiers of items, aggregate alternatives
— no preference direction of search, no cardinality constraints.
— High interoperability needs (mapping to a global schema)
 Interpretation, story-telling, hypothesis building
— explore context, paths, analogies (orthogonal to data acquisition)
— present in order, resolve alternatives (enforce constraints)
— deduction and induction
ICS-FORTH August 1, 2008
37
Epistemic Networks on DLs
Metadata at sources and indirect co-reference links
•Easy update
•Scalable, peer-to-peer
•Slow querying,
•Concatenation of facts,
•Alternatives management
Core Ontology
(e.g., CIDOC CRM)
surrogate nodes
Donald
Johanson
Johanson's
Expedition
Hadar
Lucy
Ethiopia?
extracted,
normalized
metadata
Sources
ICS-FORTH August 1, 2008
indirect
co-reference
links
38
Interoperability of Museum Information
Conclusions
 Historical information is factual and contextual. Metadata formats for
cultural heritage data must be adequate to the scientific discourse.
 We need small thesauri for museums. Better invest in Gazetteers
(placenames), and authority files.
 CRM Core already captures first sensible Museum-Archive-Library
connection. Immense benefit over Dublin Core, with similar effort.
 The co-reference problem is widely ignored (or even feared ?). Its
scale is extraordinary. Traditional KOS and data cleaning are not
enough. We need Web 2.0 methods.
 Capacity to link and transform information is crucial to integrate
information in long-terms, beyond platforms. The CRM shows how to
do that. Understand the historical perspective of information.
ICS-FORTH August 1, 2008
39