Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?

Download Report

Transcript Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?

Developing a Metadata
Infrastructure for Information
Access:
What, Where, When and Who?
Prof. Ray R. Larson
University of California, Berkeley
School of Information
Overview
 Metadata as Infrastructure
– What, Where, When and Who?
 What are Entry Vocabulary Indexes?
– Notion of an EVI
– How are EVIs Built
 Time Period Directories
– Mining Metadata for new metadata
 4W Demo
 New Project: Bringing Lives to Light
Metadata as Infrastructure
 The difference between memorization and
understanding lies in knowing the context
and relationships of whatever is of interest.
When setting out to learn about a new topic,
a well-tested practice is to follow the
traditional “5Ws and the H”: Who?, What?,
When?, Where?, Why?, and How?
Metadata as Infrastructure
 The reference collections of paper-based libraries
provide a structured environment for resources,
with encyclopedias and subject catalogs,
gazetteers, chronologies, and biographical
dictionaries, offering direct support for at least
What, Where, When, and Who.
 The digital environment does not yet provide an
effective, and easily exploited, infrastructure
comparable to the traditional reference library.
What?
Searching texts by topic, e.g. Dewey, LCSH, any subject
index, or category scheme applied to documents.
Two kinds of mapping in every search:
• Documents are assigned to topic categories, e.g. Dewey
• Queries have to map to topic categories, e.g. Dewey’s
Relativ Index from ordinary words/phrases to Decimal
Classification numbers.
Also mapping between topic systems, e.g. US Patent
classification and International Patent Classification.
‘What’ searches involve mapping
to controlled vocabularies
Texts
Thesaurus/
Ontology
Building a Search Term Recommender
Start with a
collection of
documents.
Classify and
index with
controlled
vocabulary
Index
Or use a preindexed
collection.
For:
“Wirtschaftspolitik”
Problem:
Controlled
Index
Vocabularies
can be
difficult for
people to use.
In Library of Congress subj
Use: “Economic
Policy”
“pass mtr veh spark ign eng”
Solution:
Entry Level
Vocabulary
Index
Indexes.
pass mtr veh
spark ign eng”
EVI
= “Automobile”
“What” and Entry Vocabulary
Indexes
 EVIs are a means of mapping from user’s
vocabulary to the controlled vocabulary of a
collection of documents…
Building and Searching EVIs
Domains to select
from: Engineering,
Medicine, Biology,
Social science, etc.
User selects a
subject domain of
interest.
Has an Entry
Vocabulary
Module been
built?
User has question
but is unfamiliar
with the domain
he wants to
search.
YES
Use an existing
EVI.
NO
Download a set
of training data.
Extract terms (words
and noun phrases) from
titles and abstracts.
Build associations
between extracted terms
& controlled
vocabularies.
Map user’s query to
ranked list of
controlled
vocabulary terms
For noun
phrases
Internet DB indexed
with a controlled
vocabulary.
Part of speech
tagging
Building an Entry Vocabulary
Module (EVI)
User selects search
terms from the ranked
list of terms returned by
the EVI.
Searching
Technical Details
Download a
set of
training data.
Extract terms
(words and noun
phrases) from titles
and abstracts.
Build associations
between extracted
terms & controlled
vocabularies.
For noun phrases
Internet DB
indexed with a
controlled
vocabulary.
Part of speech
tagging
Building an Entry Vocabulary Module (EVI)
Association Measure
t
¬t
C
a
c
¬C
b
d
Where t is the occurrence of a term and C is the
occurrence of a class in the training set
Association Measure
 Maximum Likelihood ratio
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d)
- logL(p,a,a+b) – logL(p,c,c+d)]
where
logL(p,n,k) = klog(p) + (n – k)log(1- p)
a
and p1= a+b
c
p2=c+d
Vis. Dunning
a+c
p= a+b+c+d
Alternatively
 Because the “evidence” terms in EVIs can
be considered a document, you can also use
IR techniques and use the top-ranked
classes for classification or query expansion
Find
Plutonium
In Arabic
Chinese
Greek
Japanese
Korean
Russian
Tamil
Digital library resources
Statistical association
W(c, t)  2[logL(p 1 , a, a  b)  ...
EVI example
User
Query
“Automobile”
EVI 1
EVI 2
Index term:
“pass mtr
veh
spark ign
eng”
Index term:
“automobiles”
OR
“internal
combustible
engines”
But why stop
there?
Index
EVI
Index
EVI
“Which EVI
do I use?”
Index
EVI
Index
Index
EVI
Index
EVI
EVI to EVIs
EVI2
Index
EVI
Index
Index
EVI
Why not treat language the
same way? In Arabic
Find
Plutonium
Chinese
Greek
Japanese
Korean
Russian
Tamil
Support for the Learner with a
Query
Facet
Vocabulary
Displays
WHAT
Thesaurus
e.g. LCSH
Crossreferences
WHERE
Gazetteer
Map
WHEN
Period directory Timeline
WHO
Biograph. dict. Personal
e.g. Who’s Who relations
Any catalog: Archives,
Libraries, Museums, TV,
Publishers
Any resource:
Audio, Images, Texts, Numeric
data, Objects,
Virtual reality, Webpages
It is also difficult to move
between different media forms
Texts
EVI
Thesaurus/
Ontology
Numeric
datasets
Searching across data types
 Different media can be linked indirectly via
metadata, but often (e.g. for socio-economic
numeric data series) you also need to specify
WHERE to get correct results
But texts associated with numeric
data can be mapped as well…
Texts
EVI
Thesaurus/
Ontology
EVI
captions
Numeric
datasets
But there are also geographic
dependencies…
Texts
EVI
Thesaurus/
Ontology
EVI
Maps/
Geo Data
captions
Numeric
datasets
WHERE: Place names are
problematic…
 Variant forms: St. Petersburg, Санкт Петербург,
Saint-Pétersbourg, . . .
 Multiple names: Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and
Kolozsvar.
 Names changes: Bombay  Mumbai.
 Homographs:Vienna, VA, and Vienna, Austria;
– 50 Springfields.
 Anachronisms: No Germany before 1870
 Vague, e.g. Midwest, Silicon Valley
 Unstable boundaries: 19th century Poland;
Balkans; USSR
 Use a gazetteer!
WHERE. Geo-temporal search interface. Place names found i
documents. Gazetteer provided lat. & long. Places displayed on
map.
Timebar
Zoom on map. Click on place for a list of records. Click on record to display text.
So geographic search becomes
part of the infrastructure
Texts
Maps/
Geo Data
EVI
Thesaurus/
Ontology
Gazetteers
captions
Numeric
datasets
WHEN: Search by time is also
weakly supported…
 Calendars are the standard for time
 But people use the names of events to refer to time
periods
 Named time periods resemble place names in
being:
– Unstable: European War, Great War, First World War
– Multiple: Second World War, Great Patriotic War
– Ambiguous: “Civil war” in different centuries in
England, USA, Spain, etc.
 Places have temporal aspects & periods have
geographical aspects: When the Stone Age was,
varies by region
Vocabularies are the key!
Want: Kung-fu movies?
Use LCSH: Hand-to-hand fighting, oriental, in motion
pictures.
Linking vocabularies WHAT, WHERE, WHEN
Library subject headings
Topic – Geographic subdivision – Chronological subdivision
Place name gazetteer:
Place name – Type – Spatial markers (Lat & long) – When
Time Period Directory
Period name – Type – Time markers (Calendar) – Where
Time period directories link via
the place (or time)
Texts
Maps/
Geo Data
EVI
Thesaurus/
Ontology
Gazetteers
captions
Time Period Directory
Numeric
datasets
Time lines, Chronologies
WHEN: Time Period Directory Timeline
Link to Catalog
Link to Wikipedia
WHO: Biographical Dictionary
Complex relationships
Life events metadata
WHAT: Actions prisoner
WHERE: Places
Holstein
WHEN: Times
1261-1262
WHO: People Margaret
Sambiria
Need external links
Any document, object,
or performance
Connect it with its context – and other resources.
Facet
Vocabulary
Displays
WHAT
Thesaurus
e.g. LCSH
WHERE
Gazetteer
WHEN
Period directory Timeline
WHO
Biograph. dict. Personal
e.g. Who’s Who relations
Crossreferences
Any catalog: Archives,
Libraries, Museums, TV,
Publishers
Map
Any resource:
Audio, Images, Texts, Numeric
data, Objects,
Virtual reality, Webpages
Demo of search interface
Entry Vocabulary
Index suggests correct
LCSH with different
spelling
Related places
Potentially related
people
Potentially related
periods
Mostly in India 16th18th century
Find out more about
this area.
Different Browsing Options!
Zooming in to
South Asia
Select
Restricting time
frame
More information about
the country of India…
More information about
the country of India…
Berkeley Natural History Museums
Wikipedia
BBC
CIA Factbook
Ethnologue
Historical events – linked to
Library catalog & Wikipedia :
none avail. for this time period
ECAI Cultural Atlases:
presenting history in its geographical
& chronological contexts
Mongol Empire Video
Demo Interface
 http://ecai.berkeley.edu/imls2004/imls4w/
New Project: Bringing Lives to Light:
Biography in Context
Ray R. Larson, Michael Buckland, Fredric Gey
University of California, Berkeley
Overview
 Focussing on the Who in Who, What,
Where and When
 Types of Biographical Markup
WHEN, WHERE and WHO
 Catalog records found from a time period search commonly include
names of persons important at that time. Their names can be forwarded
to, e.g., biographies in the Wikipedia encyclopedia.
Place and time are broadly important across numerous tools
and genres including, e.g. Language atlases, Library catalogs,
Biographical dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc.
Biographical dictionaries are also heavy on place and time:
Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm
Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon,
Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv,
1970.
Life as a series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
A new form of biographical
dictionary would link to all
Biographical Dictionary
Texts
Maps/
Geo Data
EVI
Thesaurus/
Ontology
Gazetteers
captions
Time Period Directory
Numeric
datasets
Time lines, Chronologies
Projected Work
 Develop XML markup for Biographical
Events
 Most likely to be adaptation and extension
of existing biographical event markup
– Example: EAC/EAD
 Harvest biographical resources
– Wikipedia, etc.
 Integrate as next generation of current
interface
EAC/EAD
<bioghist>
<head>Biographical Note</head>
<chronlist>
<chronitem>
<date>1892, May 7</date>
<event>Born, <geogname>Glencoe, Ill.</geogname></event>
</chronitem>
<chronitem>
<date>1915</date>
<event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event>
</chronitem>
<chronitem>
<date>1916</date>
<event>Married <persname>Ada Hitchcock</persname>
</event>
</chronitem>
<chronitem>
<date>1917-1919</date>
<event>Served in <corpname>United States Army</corpname></event>
</chronitem>
</chronlist>
</bioghist>
Wikipedia data
Life events metadata
WHAT: Actions prisoner
WHERE: Places
Holstein
WHEN: Times
1261-1262
WHO: People Margaret
Sambiria
Need external links
A Metadata Infrastructure
INTERMEDIA INFRASTRUCTURE
Facet
Authority Control
Special Display Tools
RESOURCES
CATALOGS
WHAT
Thesaurus
Syndetic Structure
Learners
WHERE
Gazetteer
Maps
WHEN
Time Period Directory
Timelines
WHO
Biographical Dictionary
Dossiers
Achives
Historical Societies
Libraries
Museums
Public Television
Publishers
Booksellers
Audio
Images
Numeric Data
Objects
Texts
Virtual Reality
Webpages
Acknowledgements
 Electronic Cultural Atlas Initiative project
 This work is being supported supported by the Institute of
Museum and Library Services through a National
Leadership Grant for Libraries
 Contact: [email protected]