Transcript Slide 1

The Dream of a
Global Network of Knowledge
Martin Doerr
Center for Cultural Informatics
Institute of Computer Science
Foundation for Research and Technology - Hellas
Amsterdam, Netherlands
November 17, 2011
1
A Global Network
Introduction
Digital Libraries take on different forms and roles.

Initially collection management systems,




In addition, data services



literature collections,
digitized resources
resource libraries (Perseus etc), on-line corpora
scientific data collections
research systems (e.g., GIS integrated data)
“Metadata” Aggregation Services: a new paradigm using semantic networks


integrate diverse forms of information assets and pointers to them for the
support of research and interested public
New grand challenges
Library access paradigm still dominates!
ICS-FORTH November 17, 2011
2
A Global Network
Library, Archive, Museum Information

The typical library contents: “The whole stories”, access widely solved!





Primary literature: Fiction.
Categorical: theories and hypotheses
Secondary literature (research results)
Facts brought into causal context
The typical museum information: “Museum objects rarely talk”



Factual documentation of
properties and context per object, references, classification
Highly heterogeneous,
About things taken out of original context, distributed over the world
ICS-FORTH November 17, 2011
3
A Global Network
Library, Archive, Museum Information

The typical archive contents: “The needle in the haystack”




Primary sources, “bits and pieces” (letters, legal documents, administration acts, images,
scientific records).
factual, kept in the contextual sequence of creation, as by the creator or responsible.
kept due to mandate related to functions.
Similarly, library content itself: “What is in the book?”

parts of book content (citations!) as primary source of investigation

access: not much more than keyword search, if a digital form exists…
ICS-FORTH November 17, 2011
4
A Global Network
Epistemology of Integration
exhibit
Libraries
provide
Museums
publish
document
features & context
finding aids
illustrate,
exemplify
Books
using
are about
refer to
Objects,
pub lish
Sites
contain narratives
made from
refer to
document
manage
Archives
provide
finding aids
SMRs
primary Documents
ICS-FORTH November 17, 2011
5
A Global Network
Traditional Information Access
The traditional library task:


Collect and preserve documents and provide finding aids
The job is solved, when the (one, best) document is handed out. “All you want is in this
document”.
The digital analogue: implementing “finding aids”:

Assumption: User knows a topic, characterized by a noun, or knows associations of a
thing he knows it exists. Associations may be known properties, but not directly
correlated to the problem to be solved (e.g. “organic farming” for “host-parasite
studies”.)

Semantic interoperability is limited to the aggregation task: Metadata are mainly
homogeneous (DC, VRA, etc.), the only challenge discussed is the matching of
terminologies (KOS).
…still THE dominant global information integration paradigm
ICS-FORTH November 17, 2011
6
A Global Network
Problems

No support to learn from the aggregated sources, to retrieve by contexts,

e.g., Who was the employer of Donald Johanson when he found Lucy?

e.g., Which plant species are documented for the Black Sea coast for 6000 BC? (Critical
climate hypothesis connected to detecting the Black Sea flood in 5600 BC)



e.g., Which resolution had Galileo’s telescope when he observed...
But understanding lives from relationships. Cultural information has
complex relationships. Relationships may be categorical or factual:

Categorical (e.g., “smoking causes cancer”). : Richly exploited by Semantic Web
technology. Use and integration limited to research results. Not useful for primary
research itself.

Factual associations concatenate information assets to meaningful (“epistemic”)
networks (“stories”): support context-based hypothesis building, cross-disciplinary search
etc. (e.g. “John smoked with 20”, …30.. 40”. “John had lung cancer with 60”)
Knowledge of Factual associations is the “food” of scholarly research
ICS-FORTH November 17, 2011
7
A Global Network
What Can IT Do Now?

Access to categorical knowledge is well solved, if hypotheses have names:
subject search, keyword search.


content management systems & search engines
Increasing account of structured categorical knowledge built in form of
thesauri, ontologies (life sciences!)
 access by terms and browsing broader/narrower terms
 access by categorical relationships more rarely touched

Access to facts is idiosyncratic to diverse systems and limited to:
 structured data services – no general access paradigm
 KOS (authors lists, gazetteers)
 “surfing and browsing” on the Internet or in Digital Libraries
ICS-FORTH November 17, 2011
8
A Global Network
What Can IT Do Now?

New promises: Semantic Networks, Semantic Web
 RDF Triple Stores
 Open World Systems: Billions of facts under any number of schemata in one database
 Linked Open Data (LoD): Thousands of triple stores to be accessed

Shift to metadata rich of facts
 from Archives, Libraries, Museums, Digital Libraries
 from research databases -> difference of data and metadata blurs

A global network of knowledge ?...
or a perfect intellectual chaos…?
ICS-FORTH November 17, 2011
9
A Global Network
time
“LAOKOON”
(copy)
(in Vatican museum)
Semantic Networks
“…noble simplicity,
silent grandeur…”
(in a library)
Winkelmann’s
death
Winkelmann
writes….
Winkelmann
1755
unknown Roman
Winkelmann
sees “Laokoon”
(archive information?)
“LAOKOON”
unknown Roman Published
copies “Laokoon” Inference
(in a library?)
Greece
ICS-FORTH November 17, 2011
Rome
Winkelmann’s
birth
Germany
Winkelmann’s
mother
(archive information?)
space
10
A Global Network

3 Grand Challenges
We need a rich, integrating global schema– a core and extensions of
any depth
 Con: impossible – everybody has his own conceptualization
 Pro: CIDOC-FRBR work empirically proves opposite

“Knitting” the network : without co-ref resolution facts/triples do not
connect
 Con: impossible – automatic means limited, human labor not scalable
 Pro or Con?: LoD
 Pro: Human labor scales if massively organized

End-users need to query effectively large Triple Stores
 Con: impossible to write ad hoc rich SPARQL statements, impossible to
memorize hundreds of properties
 Pro: use another, simple global schema for querying
ICS-FORTH November 17, 2011
11
A Global Network
A Global Schema: The CIDOC CRM
 Developed by the CRM Special Interest Group of the International Committee
for Documentation (CIDOC) of the International Council of Museums (ICOM)
 Is an extensible core ontology of 86 classes and 137 properties describing the
underlying semantics of over a hundred database schemata and structures from
all museum disciplines, archives and libraries,
 Extended by FRBROO, modeling IFLA’s FRBR, and soon FRSAD,FRAD,
(RDFS integration with DC, Europeana EDM, ORE exists)
 It is result of 15 years interdisciplinary work and agreement.
 In essence, it is a generic model of recording of “what has happened” in human
scale, i.e. a class of discourse.
 By it we can generate huge, meaningful networks of knowledge by a simple
abstraction: history as meetings of people, things and information.
 An interlingua to transform, transport and merge information from most data
structures with clear meaning.
ICS-FORTH November 17, 2011
12
A Global Network
Explicit Events, Object Identity, Symmetry
E52 Time-Span
E39 Actor
E53 Place
7012124
February 1945
P82 at some time
within
E7 Activity
E39 Actor
“Crimea Conference”
E38 Image
P86 falls within
E65 Creation
Event
E39 Actor
*
P81 ongoing throughout
E31 Document
“Yalta Agreement”
E52 Time-Span
1945-02-11
ICS-FORTH November 17, 2011
13
A Global Network
Data example (RDF-like form)
Epitaphios GE34604 (entity E22 Man-Made Object)
P30 custody transferred through, P24 changed ownership through
Transfer of Epitaphios GE34604 (entity E10 Transfer of Custody, E8 Acquisition Event)
Multiple
P28 custody surrendered by
Metropolitan Church of the Greek Community of Ankara (entity E39 Actor)
P23 transferred title from
Metropolitan Church of the Greek Community of Ankara (entity E39 Actor)
P29 custody received by
Museum Benaki
(entity E39 Actor)
P22 transferred title to
Exchangeable Fund of Refugees
(entity E40 Legal Body)
P2 has type
national foundation
(entity E55 Type)
P14 carried out by
Exchangeable Fund of Refugees
(entity E39 Actor)
P4 has time-span
GE34604_transfer_time
(entity E52 Time-Span)
P82 at some time within
1923 – 1928
(entity E61 Time Primitive)
P7 took place at
Greece
(entity E53 Place)
P2 has type
nation
(entity E55 Type)
republic
(entity E55 Type)
TGN data
P89 falls within
Europe
(entity E53 Place)
P2 has type
continent (entity E55 Type)
ICS-FORTH November 17, 2011
Instantiation
14
A Global Network
CRM Top-level classes useful for integration
E55 Types
refer to / refine
E39 Actors
E28 Conceptual Objects
E18 Physical Thing
participate in
affect or / refer to
location
E2 Temporal Entities
E52 Time-Spans
ICS-FORTH November 17, 2011
at
E53 Places
15
A Global Network
The CIDOC CRM
The types of relationships

Identification of real world items by real world names

Observation and Classification of real world items

Part-decomposition and structural properties of Conceptual & Physical
Objects, Periods, Actors, Places and Times

Participation of persistent items in temporal entities

creates a notion of history: “world-lines” meeting in space-time

Location of periods in space-time and physical objects in space

Influence of objects on activities and products and vice-versa

Reference of information objects to any real-world item
ICS-FORTH November 17, 2011
16
A Global Network
The Hierarchy of Participation Properties
P33 used specific technique (was used by)
P16 used specific object (was used for)
P142 used constituent (was used in)
P146 separated from (lost member by)
P25 moved (moved by)
P96 by mother (gave birth)
P14 carried out by (performed)
P143 joined (was joined by)
P145 separated (left by)
P11 had participant (participated in)
P29 custody received by (received custody through)
P22 transferred title to (acquired title through)
P23 transferred title from (surrendered title through)
P28 custody surrendered by (surrendered custody through)
P144 joined with (gained member by)
P99 dissolved (was dissolved by)
P12 occurred in the presence of
(was present at)
P13 destroyed (was destroyed by)
P93 took out of existence (was taken out of existence by)
P124 transformed (was transformed by)
P100 was death of (died in)
P112 diminished (was diminished by)
P31 has modified (was modified by)
P110 augmented (was augmented by)
P108 has produced (was produced by)
P123 resulted in (resulted from)
Generalization
P92 brought into existence (was brought into existence by)
ICS-FORTH November 17, 2011
P95 has formed (was formed by)
P98 brought into life (was born)
P94 has created (was created by)
P135 created type (was created by)
17
A Global Network
Schema Integration by Property Generalization
CIDOC
Conceptual Reference Model (CRM)
Access all data from any level
by CRM property generalization
Few concepts,
high recall
Thing
Actor
was present at
Dublin Core
Event
happened at
CDWA
Special concepts,
high precision
Acquisition
used object
MIDAS
automatic
data export
ICS-FORTH November 17, 2011
Data
18
A Global Network
Knitting the Network: Extracted Relations & Co-reference
Linking documents
via co-reference, not
hyperlinks!
Primary link
extracted from
one document
TimeSpan
Actor
CRM:
Thing
Event
Deductions
global classification
of relationships
Place
Fact
Integration
Discovery of
Lucy
Johanson's Expedition
Donald Johanson
Cleveland Museum
of Natural History
AL 288-1
Lucy
Ethiopia
Hadar
Fact
Extraction
Documents,
Data,
Metadata
ICS-FORTH November 17, 2011
19
A Global Network
Co-reference Knowledge and Reality
symbolic level
(“vocabulary”)
interpretion
(“speakers”)
real world
(“objects”)
M.Smith
born 2-5-65
ICS-FORTH November 17, 2011
M.Smith
born 2-5-65
20
A Global Network
Theory of Co-reference

A group of “speakers”(a database)” shares unique identifiers for a set of
things. Another group “matches” their identifiers to mean the “same as”.

The transitive closure of “same as” – “not same as” exhibits “impossible
worlds”, the only indication of false knowledge at the data level.

Ultimate knowledge is what the author meant by “her/him/it” – a part-ofspeech, a database key, an occurrence of a name or URI.

Co-reference is primary knowledge, true research, not a “cleaning” issue.


Co-reference is more fundamental than schema integration: Supports integration without
schema. Schema integration can be seen as co-reference problem.
Co-reference is more fundamental than Reference KOS: No description elements are
needed. Reference KOS can help co-reference. Co-reference can be distributed!
 Automatic “duplicate detection” is based on/ improved by co-reference,
 “Negotiation with the speakers” is the ultimate confirmation = scholarly research.
ICS-FORTH November 17, 2011
21
A Global Network
Co-reference Problem
Query “Friends of a Friend”
Content
has friend
1. query
“Kostas”
input: “Martin”
Read output:
find “Kostas”,
guess
“Κώστας”
Source 1
2. query
Content
has friend
input: “Κώστας”
“Κώστας”
output: “George”
Source 2
ICS-FORTH November 17, 2011
22
A Global Network
Co-reference via Authority
Join across sources by transitivity
of co-reference
local ids
first match
Content
query
Authority service
.
.
.
.
input: “Martin”
friendof-afriend
Source 1
local ids
Content
resulting
link
ids
L
i
n
k
t
a
b
l
“Κώστας” /
“Kostas”
match
.
.
.
.
e
.
.
.
.
output: “George”
second match
Source 2
ICS-FORTH November 17, 2011
23
A Global Network
Curating Co-reference without Authority
Join across sources by transitivity
of co-reference
local ids
Content
make a
co-reference local ids
query
.
.
.
.
input: “Martin”
friendof-afriend
“Κώστας” /
“Kostas”
match
Source 1
local ids
Content
make a
co-reference
.
.
.
.
.
.
.
.
output: “George”
Source 2
ICS-FORTH November 17, 2011
24
A Global Network
Managing Co-reference Clusters
explicit initial “same as” (n-1)
explicit redundant “same as”
New link
connecting
clusters !
implicit link ( n(n-1)/2 )
reference occurrence
What happens ?
“M. Doerr”
“M. Dörr”
Authority files are good “attractors” of co-reference links, but do not solve co-reference !
ICS-FORTH November 17, 2011
25
A Global Network
A New Service: Global Co-reference Indices
 Co-reference links should be persistent and public. Primary Co-reference links
should be curated and preserved in local databases: “co-reference indices”.
 Use NER and duplicate-detection algorithms to prepopulate co-reference
indices. Use appropriate belief values for generated data.
 Automated, global, distributed consistency control services are feasible.
 Co-reference indices are much larger than ontologies, but not larger than
search engines.
 Mobilize general users and domain experts to enhance and verify co-reference
information by social tagging to scale-up human labor and precision.
 Install global supervision by open consortia setting the rules and doing central
services.
 Then the network may converge to consistent global knowledge.
Linked Open Data has no co-reference concept so-far.
It will lead to a proliferation of URIs.
ICS-FORTH November 17, 2011
26
A Global Network
Last Problem: How to query 250 properties?
 Humans think consciously in “compressed relations” (G.Fauconnier “The Way We
Think”), in particular omitting events:
“What do we have from New Guinea?”
 There are a few “Fundamental Categories” that partition our concepts
(Ranganathan, “Who, When, Where, What..) and disambiguate most words
e.g., a “”museum” is a “who”, a “where” or a “what”
 If we implement a simple semantic network with few compressed relationships, we
cannot integrate knowledge, because the intermediates are missing, and we cannot
manage the immense number of redundant relations
 If we implement a CIDOC CRM network, end-users cannot write queries
Solution:
 Define a new “datamodel” of “Fundamental Categories” and “Fundamental
Relationships” for querying only!
 implemented as automated deductions from a CRM-based network
ICS-FORTH November 17, 2011
27
A Global Network
How to query with 250 properties?
 Fundamental Categories:

Thing, Actor, Time, Place, Event (E2), Type
 Fundamental Relationships:








has type /is type of
is similar to or same with
is part of (is member of) / has part (has member)
has met
from (has founder or has parent) / is origin, founder, parent, provider or creator of
had (=owns, keeps) / were owned/kept by
at
refers to or is about / is referred by/ is referred to at
 Relationships change interpretation depending on category of domain and
range.
ICS-FORTH November 17, 2011
28
A Global Network
Thing is about Thing Path Expression

Following this schema, we have implemented over a hundred deductions such as:

Thing -> P130F.shows_features_of (0,n) OR P130B.features_are_also_found_on (0,n) ->
{
E24.Physical_Man-Made_Thing -> P62F.depicts -> Thing
OR
E24.Physical_Man-Made_Thing -> P128F.carries(0,n) -> E73.Information Object
-> P67F.refers_to-> Thing
OR
D1.Digital_Object -> {L11B.was_output_of -> D3.Formal_Derivation -> L10F.had_input > D1.Digital_Object ->}(0,n) L11B.was_output_of -> { D7.Digital_Machine_Event
-> P9B.forms_part_of(0,n) ->}(0,1) D2.Digitization_Process -> L1F.digitized ->
E18.Physical_Thing
}
It works!!!
ICS-FORTH November 17, 2011
29
A Global Network
Conclusions
After 50 Years of “Artificial Intelligence” research and 15 years “Semantic Web”,
the Global Network of Knowledge is still a dream.
Today, we have the chance to lay foundations for global knowledge network(s!)
with




a limited consistency,
with a tendency to converge to something more consistent
a limited common language,
a limited way to globally explore deep relationships
For that, we have to
 Overcome intellectual barriers in conceptual modelling (“quick & dirty”, W3C “beliefs”,
ignoring empirical scientific methods, political thinking, domain blindness)
 Organize domain communities to curate collectively data and co-reference by new
awarding methods
 Invest in technology and methodology for a long data life-cycle by mapping, and
transforming data “for ever”, as we do since antiquity…
ICS-FORTH November 17, 2011
30