Institutional Archives Technology Overview Michael L. Nelson Old Dominion University [email protected] http://www.cs.odu.edu/~mln/ Institutional Archives & Repositories: What this digital movement means for Federal Libraries Library of Congress Workshop September.

Download Report

Transcript Institutional Archives Technology Overview Michael L. Nelson Old Dominion University [email protected] http://www.cs.odu.edu/~mln/ Institutional Archives & Repositories: What this digital movement means for Federal Libraries Library of Congress Workshop September.

Institutional Archives
Technology Overview
Michael L. Nelson
Old Dominion University
[email protected]
http://www.cs.odu.edu/~mln/
Institutional Archives & Repositories: What this digital movement
means for Federal Libraries
Library of Congress Workshop
September 12, 2003
Acknowledgements
•
•
•
•
•
•
ODU: K. Maly, M. Zubair, J. Bollen
LANL: R. Luce, X. Liu
NASA: G. Roncaglia, J. Rocker
Cornell: C. Lagoze, S. Warner
MAGiC (UK): Paul Needham
and, of course, Herbert Van de Sompel (LANL)
– the OpenURL slides are nicked from his presentations
Outline
• A bit of history
• Core technologies
– OAI-PMH
– OpenURL
• Example implementations
• Download and go…
OAI-PMH
Background
• I met Herbert Van de Sompel in April 1999...
– we spoke of a demonstration project he had in mind and
had received sponsorship from Paul Ginsparg and Rick
Luce
– We wanted to demonstrate a multi-disciplinary DL that
leveraged the large number of high quality, yet often
isolated, tech report servers, e-print servers, etc.
• most digital libraries (DLs) had grown up along single disciplines
or institutions
– little to no interoperability; isolated DL “gardens”
Universal Preprint Service
• A cross-archive DL that that provides services on a collection of
metadata harvested from multiple archives
– Nelson: NCSTRL+; a modified version of Dienst
• support for “clustering”
• support for “buckets”
– Krichel: ReDIF metadata format
– Van de Sompel: SFX Linking
• Demonstrated at Santa Fe NM, October 21-22, 1999
– http://web.archive.org/web/*/http://ups.cs.odu.edu/
– D-Lib Magazine, 6(2) 2000 (2 articles)
• http://www.dlib.org/dlib/february00/02contents.html
– UPS was soon renamed the Open Archives Initiative (OAI)
http://www.openarchives.org/
Data and Service Providers
• Self-describing archives
– Much of the learning about the constituent UPS
archives occurred out of band…
• Data Providers
– publishing into an archive
– providing methods for metadata “harvesting”
• provide non-technical context for sharing information
also
• Service Providers
– harvest metadata from providers
– implement user interface to data
Even if these
are done by
the same DL,
these are
distinct roles
Metadata Harvesting
• Move away from distributed searching
• Extract metadata from various sources
• Build services on local copies of metadata
– data remains at remote repositories
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
metadata
harvested
offline
search for “cfd
applications”
local copy of
metadata
metadata
harvested
offline
metadata
harvested
offline
metadata
harvested
offline
...
each node
independently
maintained
Result… OAI
• The OAI was the result of the demonstration and discussion during the Santa
Fe meeting
– OAI = a bunch of people, a religion, a cult, etc.
– OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and
maintained by the OAI
• Initial focus was on federating collections of scholarly e-print materials…
• …however, interest grew and the scope and application of OAI-PMH
expanded to become a generic bulk metadata transport protocol
• Note:
– OAI-PMH is only about metadata -- not full text!
• but what is metadata vs. full-text?
– OAI is neutral with respect to the nature of the metadata or the resources the
metadata describes
• read: commercial publishers have an interest in OAI-PMH too...
Open Archives Initiative
The protocol is openly
documented, and metadata
is “exposed” to at least some
peer group (note: rights
management still applies!)
Archive defined as a
“collection of stuff” -not the archivist’s
definition of “archive”.
“Repository” used in
most OAI documents.
TLA; needed another
vowel...
Request is encoded
in http
Response is encoded
in XML
XML Schema for the
responses are defined
in the OAI-PMH
document
OAI-PMH Mechanics
Overview of OAI-PMH Verbs
Verb
archival
metadata
harvesting
verbs
Function
Identify
description of archive
ListMetadataFormats
metadata formats supported by archive
ListSets
sets defined by archive
ListIdentifiers
OAI unique ids contained in archive
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
OAI-PMH Data Model
set-membership is
item-level property
item = identifier
Dublin Core
metadata
resource
all available metadata
about David
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
Data Providers / Service Providers
data providers
(repositories)
service providers
(harvesters)
Aggregators
aggregators allow for:
• scalability for OAI-PMH
• load balancing
• community building
• discovery
data providers
(repositories)
aggregator
service providers
(harvesters)
Aggregators
• Frequently interchangeable terms:
– aggregators: likely to be community / institutionally focused
– caches: stores a copy, less likely to be community-oriented
– proxies: less likely to store a copy, may gateway between OAI-PMH and
other protocols
• Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03
• To learn more about aggregators, caches & proxies:
–
–
http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm
http://www.cs.odu.edu/~mln/jcdl03/
Example Aggregators
• Arc - http://arc.cs.odu.edu/
– first described “hierarchical harvesting” in DLib Magazine, 7(4) 2001
• http://www.dlib.org/dlib/april01/liu/04liu.html
• Celestial - http://celestial.eprints.org/
– among other services, it provides a history of
harvests (successful vs. errors)
• http://celestial.eprints.org/cgi-bin/status
OAI-PMH 2.0 Registration
unregistered because:
75 repositories
registered
??? unregistered
repositories
• testing / development
• not for public harvesting
• public, but “low-profile”
• never got around to it…
• ???
DP:SP ~= 5:1
Data Providers: http://www.openarchives.org/Register/BrowseSites.pl
Service Providers: http://www.openarchives.org/service/listproviders.html
Registration is Nice…
…But Not Required
• OAI-PMH is (becoming) the “http” for digital libraries
– there is no central registry of http servers
• remember the NCSA “What’s New” page? (ca. 1994)
• There will never be “registration support” in OAI-PMH
– registries are a type of service provider, built on top of OAI-PMH
– registration will be an integral part of community building
– friends…
NASA <friends> example
harvester
Identify
<friends>…</friends>
http://techreports.larc.nasa.gov/ltrs/oai2.0/
http://naca.larc.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://ntrs.nasa.gov/oai2.0/
http://horus.riacs.edu/perl/oai/
Field of Dreams
• It should be easy to be a data provider, even if it
makes more work for the service provider.
– if enough data providers exist, the service providers
will come (DPs >> SPs)
• Open-source / freely available tools
– “drop-in” data providers
• at the end of this presentation
– tools to make your existing DL a data provider:
• http://www.openarchives.org/tools/tools.htm
• also: OAI-implementers mailing list / mail archive!
– service providers:
• http://oaiarc.sourceforge.net/
OAI-PMH Meeting History
OAI Open Day,
Washington DC
1/2001
2nd OAI Workshop
CERN 10/2002
4
Protocol definition,
development tools
1
5
DPs, retrofitting
existing DLs
4
1
SPs, new services
11
0
Socio-EconomicPolitical Issues
6
Shift of Topics
• From the protocol itself, supporting &
debugging tools and how to retrofit
(existing) DLs…
• …to building (new) services that use
the OAI-PMH as a core technology and
reporting on their impact to the
institution/community
Arc
•
http://arc.cs.odu.edu/
• harvests all known archives
• first end-user service
provider
• source available through
SourceForge
• hierarchical harvesting
NCSTRL
•
http://www.ncstrl.org/
• metadata harvesting
replacement for Dienstbased NCSTRL
• based on Arc
• computer science metadata
Archon
•
http://archon.cs.odu.edu/
• physics metadata
• based on Arc
• features:
– citation indexing
– equation-based
searching
Torii
• http://torii.sissa.it/
• physics metadata
• features
– personalization
– recommendations
– WAP access
iCite
• http://icite.sissa.it/
• physics metadata
• features
– citation based
access to arXiv
metadata
my.OAI
•
http://www.myoai.com/
• covers all registered
metadata
• features
– result sets
– personalization
– many other advanced
features
Cyclades
•
http://www.ercim.org/cyclades
• scientific metadata
• features
– personalization
– recommendations
– collaboration
• status?
citebase
•
http://citebase.eprints.org/
• arXiv metadata
• citation based
indexing, reporting
OAIster
•
http://oaister.umdl.umich.edu/
• harvests all known
archives
Others…
• Commercial publishers
– American Physical Society (APS)
– Institute of Physics
– Elsevier / Scirus (www.scirus.com)
• Department of Energy
– OSTI
– LANL
• Institutional servers
– DSpace (MIT; www.dspace.org)
– Eprints (www.eprints.org)
– DARE (All Dutch universities)
NACA Technical Report Server
• publicly available
– began in 1996
– details in NASA TM-1999209127
• scanned reports from 19171958
– NACA = predecessor to NASA
• contents mirrored with the
MaGIC project
– a UK-based grey-literature
preservation project
– OAI-PMH used to mirror
contents
http://naca.larc.nasa.gov/
http://naca.larc.nasa.gov/oai2.0/
NACA Report 1345
as seen through its native DL
http://naca.larc.nasa.gov/
NACA Report 1345
as seen through MAGiC
http://www.magic.ac.uk/
NACA Report 1345
as seen through its Scirus
(Elsevier)
http://www.scirus.com/
NACA Report 1345
as seen through my.OAI
(FS Consulting)
http://www.myoai.com/
NASA Technical Report Server
• replacement for the previous
distributed searching version of
NTRS
–
–
–
–
MySQL
Va Tech harvester
modified “bucket”
details in Nelson, Rocker,
Harrison, Library Hi-Tech,
21(2) (March 2003)
• a service provider & aggregator
– same OAI baseURL as used
for interactive searching
http://ntrs.nasa.gov/
NASA Technical Report Server
• advanced, fielded
search
• explicit query routing
– 10 NASA repositories
– 4 non-NASA
repositories
• turned “off” by default
non-NASA
repositories
> 0.5M records
NASA DLs in the Larger STI Realm
Publishers
Universities
International
DOD
...
DOE
this could be a fully
connected graph
NTRS could also be a
data provider from the
point of view of other
DLs; allowing the
harvesting of NASA
report metadata.
NTRS could also harvest
metadata from other DLs,
and provide access to
non-NASA content.
NTRS
LTRS
ATRS
…
CASITRS
We hope to influence
the direction of the
science.gov effort to use
OAI-PMH
Service Providers
• It is clear that SPs are proliferating, despite
(because of?) the inherent bias toward DPs in the
protocol
– easy to be a DP -> many DPs -> SPs eventually emerge
– hard to be a DP -> SPs starve
– currently 5x DPs more than SPs
• SPs are beginning to offer increasingly
sophisticated services
– competitive market originally envisioned for SPs is
emerging
OpenURL
Origins & Motivation
The Context: Library Automation Environment anno 1998
• distributed information environment
• local & remote A&I databases
• rapidly growing e-journal collection
• need to interlink the available information
The Problem:
• links are delivered by info providers
• links are not sensitive to user’s context
• appropriate copy problem
• links dependent on business agreements between
information vendors
• links don’t cover the complete collection
Origins & Motivation
The Context: Library Automation Environment anno 1998
• distributed information environment
• local & remote A&I databases
• rapidly growing e-journal collection
• need to interlink the available information
The REAL Problem:
• libraries have no say in linking
• libraries are losing core part of the “organizing
information” task
• expensive collection is not used optimally
• users are not well served
Origins & Motivation
The Solution:
In information services:
• DO NOT provide a link which is an actual service
related to a referenced item (e.g. a link from a record
in an A&I database to the corresponding full-text)
• BUT rather provide
• a link that transports metadata about the
OpenURL
referenced item
to
• others that are better placed to provide service
links
Linking server operated by library
non-OpenURL linking
resource
resource
link destination
link source
reference
.
link
resolution of
metadata into link
link to referenced work
OpenURL linking
transportation of
metadata & identifiers
user-specific
link source
reference
.
OpenURL
OpenURL
provision of OpenURL
linking
server
link
link
link
link
resolution of
metadata & identifiers into services
link
destination
link
destination
link
destination
link
destination
Evolution ~ 1998
• Nature of solution determined
• Experiment with local databases at Ghent University
• Demonstrated October 1998 at Belgian Library meeting
• Problem statement & Experiment described in 2 D-Lib
Magazine papers, April 1999
Evolution ~ 1999
• Feasibility of solution tested in 2 complex
environments
• Experiments:
• SFX@Ghent & SFX@LANL: LANL, Ghent, APS, Wiley,
SilverPlatter, Ex Libris
• UPS Prototype: arXiv, SLAC/SPIRES, LANL, Ghent, …
• Demonstrated:
• June 1999 at ALA LiTA session, New Orleans
• October 1999 at OAI meeting, Santa Fe
• Experiments described in 2 D-Lib Magazine papers,
October 1999 and February 2000
Evolution ~ 2000
• OpenURL 0.1 released
• Quick adoption of OpenURL 0.1 in information community
• SFX linking server goes beta
Evolution ~ 2001
• Integration of OpenURL Framework and DOI/CrossRef
framework
• Experiment involving CNRI, LANL, OhioLink, Academic
Press, Ex Libris, …
• DOI/OpenURL integration described in 2 D-Lib Magazine
papers, March 2001 and September 2001
• First non-SFX linking servers appear
Evolution ~ 2001
• Proposal to standardize OpenURL
• Generalization of OpenURL Framework concepts
beyond scholarly information community
• Described in:
Van de Sompel, Herbert and Beit-Arie, Oren. Generalizing
the OpenURL Framework beyond References to Scholarly
Works: the Bison-Futé model. July/August 2001. D-Lib
Magazine.
• NISO AX Committee starts standardization of the
OpenURL Framework using the Bison-Futé model as the
basis of its work.
NISO OpenURL Standardization Charge
• Use existing “OpenURL Framework” as starting point
• notion of context-sensitive services
• notion of transporting “contextual” metadata packages
to obtain context-sensitive services
• Define syntax and transport-method for “contextual”
metadata packages
• Ensure extensibility:
• must support future applications
• must support other information communities
=> Generalize and Standardize
NISO OpenURL Standardization Charge
Therefore, to be addressed were:
• OpenURL Framework beyond scholarly resources
• “contextual” metadata packages
• Syntax for “contextual” metadata packages
• Transport of “contextual” metadata packages
default links:
• restricted in nature
• action-radius restricted by business agreements
• not context-sensitive
resource2
resource3
default links
resource1
herbert van de sompel
metadata plane
extended services plane
service
component1
service
component2
resource2
resource3
default links
resource1
herbert van de sompel
metadata plane
Download and Go!
Where Do You Want to Build?
user
service
provider
data
provider
data
provider
data
provider
...
data
provider
local contextsensitive services
EPrints.org
Fedora
• joint project between Cornell & UVa
– funded by the Mellon Foundation
• a repository management system
– focuses on complex digital objects and their behaivors
• more info:
– http://www.fedora.info/
– D-Lib Magazine, 9(4)
• http://www.dlib.org/dlib/april03/staples/04staples.html
• MIT + HP Labs
• constructed to capture all the output of MIT’s
faculty
• now generalized to the DSpace Federation
– 8 top universities in the US & Canada
• More info:
– http://www.dspace.org/
– http://sourceforge.net/projects/dspace/
– D-Lib Magazine 9(1)
• http://www.dlib.org/dlib/january03/smith/01smith.html
EPrints.org
• developed at Southampton University
– part of larger suite of institutional/author self-archiving
tools and services
• e.g.: citebase; paracite
• widely adopted -- 100+ sites
– http://software.eprints.org/#ep2
• more info
– http://www.eprints.org/
– http://www.arl.org/sparc/core/index.asp?page=g20#6
• P2P publishing for academia
– community servers for coordination,
management
– archivelets for individual laptops, PCs
• more info:
– http://kepler.cs.odu.edu/
– D-Lib Magazine 7(4)
• http://www.dlib.org/dlib/april01/maly/04maly.html
• developed by UKOLN
– open source
• OpenURL 0.1 format resolver
– NISO 1.0 format???
• more info:
– Ariadne, 28
• http://www.ariadne.ac.uk/issue28/resolver/
• ftp://ftp.ukoln.ac.uk/metadata/tools/openresolver/
• http://www.ukoln.ac.uk/distributed-systems/openurl/
Conclusions
Why The OAI-PMH
is NOT Important
• Users don’t care
• OAI-PMH is middleware
– if done right, the uninterested user should never have to
know
• Using OAI-PMH does not insure a good SP
• OAI-PMH is (or is becoming) HTTP for DLs
– few people get excited about http now
• http & OAI-PMH are core technologies whose
presence is now assumed
Other Uses For the OAI-PMH
• Assumptions:
– Traditional DLs / SPs will continue on their present path of
increasing sophistication
• citation indexing, search results viz, personalization, recommendations,
subject-based filtering, etc.
– growth rates remain the same (5x DPs as SPs)
• Premise: OAI-PMH is applicable to any scenario that needs to
update / synchronize distributed state
– Future opportunities are possible by creatively interpreting the
OAI-PMH data model
• See Van de Sompel, Young & Hickey, D-Lib Magazine July 2003,
http://www.dlib.org/dlib/july03/young/07young.html
OpenURL Framework evolution
A spec based on HTTP GET to transport metadata about
• a scholarly referent &
• the context in which the referent is referenced
Draft Van de Sompel, Beit-Arie, Hochstenbach 05/2001
A framework Standard that enables different Communities
to:
• describe a referent
• describe the context in which the referent is referenced
• transport these descriptions
NISO Draft Standard 04/2003
The Future: Community Building
• Ultimately, protocols and metadata formats are not what
makes a difference
• Rather, the critical mass afforded by a common set of
utilities (cf. http, Dublin Core, XML)
• The best current example: The Open Language Archives
Community
– http://www.language-archives.org/
• OAI-PMH provides the basis for communication between
strangers, but allows even richer communication between
friends