A Review of Institutional Repository Projects and Technologies Michael L. Nelson Old Dominion University [email protected] http://www.cs.odu.edu/~mln/ Texas A&M University May 6, 2004

Download Report

Transcript A Review of Institutional Repository Projects and Technologies Michael L. Nelson Old Dominion University [email protected] http://www.cs.odu.edu/~mln/ Texas A&M University May 6, 2004

A Review of Institutional Repository Projects and Technologies

Michael L. Nelson Old Dominion University [email protected]

http://www.cs.odu.edu/~mln/ Texas A&M University May 6, 2004

Acknowledgements

• ODU: K. Maly, M. Zubair, J. Bollen • LANL: R. Luce, X. Liu • NASA: G. Roncaglia, J. Rocker, C. Mackey • Cornell: C. Lagoze, S. Warner • MAGiC (UK): Paul Needham • and, of course, Herbert Van de Sompel (LANL) – the OpenURL slides are nicked from his presentations

Outline

• A bit of history • Core technologies & Issues – OAI-PMH • deep web – OpenURL – Handles / DOIsObject Models • Example implementations • Download and go…

covered only briefly

OAI-PMH

Background

• I met Herbert Van de Sompel in April 1999...

– we spoke of a demonstration project he had in mind and had received sponsorship from Paul Ginsparg and Rick Luce – We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc.

• most digital libraries (DLs) had grown up along single disciplines or institutions – little to no interoperability; isolated DL “gardens”

Universal Preprint Service

• A cross-archive DL that provides services on a collection of metadata harvested from multiple archives – Nelson: NCSTRL+; a modified version of Dienst • support for “clustering” • support for “buckets” – Krichel: ReDIF metadata format – Van de Sompel: SFX Linking • Demonstrated at Santa Fe NM, October 21-22, 1999 – http://web.archive.org/web/*/http://ups.cs.odu.edu/ – D-Lib Magazine, 6(2) 2000 (2 articles) • http://www.dlib.org/dlib/february00/02contents.html

– UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

Data and Service Providers

• Self-describing archives – Much of the learning about the constituent UPS archives occurred out of band… • Data Providers – publishing into an archive – providing methods for metadata “harvesting” • provide non-technical context for sharing information also • Service Providers – harvest metadata from providers – implement user interface to data

Even if these are done by the same DL, these are distinct roles

Metadata Harvesting

• Move away from distributed searching – the return of union catalogs • Extract metadata from various sources • Build services on local copies of metadata – data remains at remote repositories user all searching, browsing, etc. performed on the metadata here individual nodes can still support direct user interaction search for “cfd applications” local copy of metadata

metadata harvested offline metadata harvested offline metadata harvested offline

. . .

metadata harvested offline

each node independently maintained

Result… OAI

• The OAI was the result of the demonstration and discussion during the Santa Fe meeting – OAI = a bunch of people, a religion, a cult, etc.

– OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained by the OAI • Initial focus was on federating collections of scholarly e-print materials… • …however, interest grew and the scope and application of OAI PMH expanded to become a generic bulk metadata transport

protocol

• Note: – OAI-PMH is only about metadata -- not full text!

• but what is metadata vs. full-text?

– OAI is neutral with respect to the nature of the metadata or the resources the metadata describes • read: commercial publishers have an interest in OAI-PMH too...

Open Archives Initiative

The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management still applies!) Archive defined as a “collection of stuff” - not the archivist’s definition of “archive”. “Repository” used in most OAI documents.

TLA; needed another vowel...

Request is encoded in http

OAI-PMH Mechanics

Response is encoded in XML XML Schema for the responses are defined in the OAI-PMH document

Overview of OAI-PMH Verbs

archival metadata harvesting verbs Identify

Verb

ListMetadataFormats ListSets ListIdentifiers ListRecords GetRecord

Function

description of archive metadata formats supported by archive sets defined by archive OAI unique ids contained in archive listing of N records listing of a single record most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

OAI-PMH Data Model

set-membership is item-level property

resource

item = identifier

all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records

record = identifier + metadata format + datestamp

Data Providers / Service Providers

data providers

(repositories)

service providers

(harvesters)

Aggregators

aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers

(repositories)

aggregator service providers

(harvesters)

Aggregators

• Frequently interchangeable terms: – aggregators: likely to be community / institutionally focused – caches: stores a copy, less likely to be community-oriented – proxies: less likely to store a copy, may gateway between OAI PMH and other protocols • Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03 • To learn more about aggregators, caches & proxies: – http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm

– http://www.cs.odu.edu/~mln/jcdl03/

Example Aggregators

• Arc - http://arc.cs.odu.edu/ – first described “hierarchical harvesting” in D Lib Magazine, 7(4) 2001 • http://www.dlib.org/dlib/april01/liu/04liu.html

• Celestial - http://celestial.eprints.org/ – among other services, it provides a history of harvests (successful vs. errors) • http://celestial.eprints.org/cgi-bin/status

OAI-PMH 2.0 Registration

??? unregistered repositories 75 repositories registered unregistered because: • testing / development • not for public harvesting • public, but “low-profile” • never got around to it… • ???

DP:SP ~= 5:1

Data Providers: http://www.openarchives.org/Register/BrowseSites.pl

Service Providers: http://www.openarchives.org/service/listproviders.html

Registration is Nice… …But Not Required

• OAI-PMH is (becoming) the “http” for digital libraries – there is no central registry of http servers • remember the NCSA “What’s New” page? (ca. 1994) • There will never be “registration support” in OAI-PMH – registries are a type of service provider, built on top of OAI PMH – registration will be an integral part of community building – friends…

NASA example

harvester

Identify http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://ston.jsc.nasa.gov/collections/TRS/oai/ http://naca.larc.nasa.gov/oai2.0/ http://ntrs.nasa.gov/oai2.0/ http://horus.riacs.edu/perl/oai/

Scientific Communication

• With only some exceptions, which interface is used for discovery is not as important as the fact that discovery occurred in the first place… – “control” of the discovered objects is not “lost” by data providers • however, higher level mirroring services can be built on top of OAI (cf. NACA & ARC mirroring between NASA LaRC and MAGiC) • The real power of OAI-PMH derives as much from what it does not do as what it actually does

What Does OAI-PMH Mean for Authors?

• On the surface, absolutely nothing!

– the ideal OAI deployment should be absolutely invisible to normal DL operations – uninterested users should not even notice or care • Indirectly, they should enjoy the benefits of the critical mass of current and developing DL tools & systems – personal, institutional data providers – proliferation of targetted, value-added service providers

What Does OAI-PMH Mean For Editors?

• Absolutely everything… • The decoupling of SPs and DPs will have significant and profound implications on scientific and technical information exchange – OAI-PMH is actually just one component in a larger engineering effort for scholarly communication (e.g. OpenURL) • Service and resource integration will be the focus of journals, professional societies, universities, etc.

– OAI-PMH will be a basic, core technology for scientific publishing as http & XML

Field of Dreams

• It should be easy to be a data provider, even if it makes more work for the service provider.

– if enough data providers exist, the service providers will come (DPs >> SPs) • Open-source / freely available tools – “drop-in” data providers • at the end of this presentation – tools to make your existing DL a data provider: • http://www.openarchives.org/tools/tools.htm

• also: OAI-implementers mailing list / mail archive!

– service providers: • http://oaiarc.sourceforge.net/

OAI-PMH Meeting History

OAI Open Day, Washington DC 1/2001 4 Protocol definition, development tools 2nd OAI Workshop CERN 10/2002 1 5 1 0 DPs, retrofitting existing DLs SPs, new services Socio-Economic Political Issues 4 11 6

Shift of Topics

• From the protocol itself, supporting & debugging tools and how to retrofit (existing) DLs… • …to building (new) services that use the OAI-PMH as a core technology and reporting on their impact to the institution/community

• • • • • http://arc.cs.odu.edu/ harvests all known archives first end-user service provider source available through SourceForge hierarchical harvesting • • • • http://www.ncstrl.org/ metadata harvesting replacement for Dienst-based NCSTRL based on Arc computer science metadata • • • • http://archon.cs.odu.edu/ physics metadata based on Arc features: – citation indexing – equation-based searching

• • • http://torii.sissa.it/ physics metadata features – personalization – recommendations – WAP access • • • http://icite.sissa.it/ physics metadata features – citation based access to arXiv metadata • • • http://citebase.eprints.org/ arXiv metadata citation based indexing, reporting

• • • http://www.myoai.com/ covers all registered metadata features – result sets – personalization – many other advanced features • • • • http://www.ercim.org/cyclades scientific metadata features – personalization – – status?

recommendations collaboration • • • • http://oaister.umdl.umich.edu/ harvests all known archives Mellon Foundation funded project Content-sharing agreement with Yahoo!

– http://www.openarchives.org/pipermail/oai general/2004-March/000371.html

Others…

• Commercial publishers – American Physical Society (APS) – Institute of Physics – Elsevier / Scirus (www.scirus.com) – BioMed Central • US Govt – OSTI – LANL – PubMed Central • Institutional servers – DARE (All Dutch universities) – California Digital Library

NACA Technical Report Server

http://naca.larc.nasa.gov/ http://naca.larc.nasa.gov/oai2.0/ • publicly available – began in 1996 – details in NASA TM-1999 209127 • scanned reports from 1917-1958 – NACA = predecessor to NASA • contents mirrored with the MaGIC project – a UK-based grey literature preservation project – OAI-PMH used to mirror contents

NACA Report 1345 as seen through its native DL http://naca.larc.nasa.gov/

NACA Report 1345 as seen through MAGiC http://www.magic.ac.uk/

NACA Report 1345 as seen through its Scirus (Elsevier) http://www.scirus.com/

NACA Report 1345 as seen through my.OAI

(FS Consulting) http://www.myoai.com/

NASA Technical Report Server

• replacement for the previous distributed searching version of NTRS – MySQL – Va Tech harvester – modified “bucket” – details in Nelson, Rocker, Harrison, Library Hi-Tech, 21(2) (March 2003) • a service provider & aggregator – same OAI baseURL as used for interactive searching http://ntrs.nasa.gov/

NASA Technical Report Server

• advanced, fielded search • explicit query routing – 12 NASA repositories – 4 non-NASA repositories • turned “off” by default • >600k abstracts; >300k full-text

NASA DLs in the Larger STI Realm

DOE Publishers Universities DOD International . . .

this could be a fully connected graph NTRS could also be a data provider from the point of view of other DLs; allowing the harvesting of NASA report metadata.

LTRS NTRS ATRS … CASITRS NTRS could also harvest metadata from other DLs, and provide access to non-NASA content.

We hope to influence the direction of the science.gov effort to use OAI-PMH

Service Providers

• It is clear that SPs are proliferating, despite (because of?) the inherent bias toward DPs in the protocol – easy to be a DP -> many DPs -> SPs eventually emerge – hard to be a DP -> SPs starve – currently 5x DPs more than SPs • SPs are beginning to offer increasingly sophisticated services – competitive market originally envisioned for SPs is emerging

OAI-PMH & The Deep Web

Exposing Repository Contents

• DP9: Webcrawler access to OAI-PMH repositories – – • http://dlib.cs.odu.edu/dp9/ • JCDL 02 http://www.cs.odu.edu/~liu_x/dp9/dp9.pdf

• An Apache module for OAI-PMH http://www.modoai.org/ • Extensible Repository Resource Locators (ERRoLs) for OAI Identifiers http://www.oclc.org/research/projects/oaireso lver/default.htm

Race for This New Market…

• Yahoo! & University of Michigan – – http://www.umich.edu/news/index.html?

Releases/2004/Mar04/r031004 • Google & CrossRef http://www.nature.com/nature/focus/ac cessdebate/17.html

OpenURL

slides from Herbert Van de Sompel, LANL

Origins & Motivation The Context: Library Automation Environment anno 1998 • distributed information environment • local & remote A&I databases • rapidly growing e-journal collection • need to interlink the available information The Problem: • links are delivered by info providers • links are not sensitive to user’s context • appropriate copy problem • links dependent on business agreements between information vendors • links don’t cover the complete collection

Origins & Motivation The Context: Library Automation Environment anno 1998 • distributed information environment • local & remote A&I databases • rapidly growing e-journal collection • need to interlink the available information The REAL Problem: • libraries have no say in linking • libraries are losing core part of the “organizing information” task • expensive collection is not used optimally • users are not well served

Origins & Motivation The Solution: In information services: • DO NOT provide a link which is an actual service related to a referenced item (e.g. a link from a record in an A&I database to the corresponding full-text) • BUT rather provide • a link that transports metadata about the referenced item to • others that are better placed to provide service links

Linking server operated by library

non-OpenURL linking

resource link source

.

reference

link to referenced work link resolution of metadata into link resource link destination

OpenURL linking

transportation of metadata & identifiers user-specific link source

reference

.

OpenURL OpenURL provision of OpenURL linking server link link link link resolution of metadata & identifiers into services link destination link destination link destination

Evolution ~ 1998 • Nature of solution determined • Experiment with local databases at Ghent University • Demonstrated October 1998 at Belgian Library meeting • Problem statement & Experiment described in 2 D-Lib Magazine papers, April 1999

Evolution ~ 1999 • Feasibility of solution tested in 2 complex

environments

• Experiments: • SFX@Ghent & SFX@LANL: LANL, Ghent, APS, Wiley, SilverPlatter, Ex Libris • UPS Prototype: arXiv, SLAC/SPIRES, LANL, Ghent, … • Demonstrated: • June 1999 at ALA LiTA session, New Orleans • October 1999 at OAI meeting, Santa Fe • Experiments described in 2 D-Lib Magazine papers, October 1999 and February 2000

Evolution ~ 2000 • OpenURL 0.1 released • Quick adoption of OpenURL 0.1 in information community • SFX linking server goes beta

Evolution ~ 2001 • Integration of OpenURL Framework and DOI/CrossRef

framework

• Experiment involving CNRI, LANL, OhioLink, Academic Press, Ex Libris, … • DOI/OpenURL integration described in 2 D-Lib Magazine papers, March 2001 and September 2001 • First non-SFX linking servers appear

Evolution ~ 2001 • Proposal to standardize OpenURLGeneralization of OpenURL Framework concepts

beyond scholarly information community

• Described in:

Van de Sompel, Herbert and Beit-Arie, Oren. Generalizing the OpenURL Framework beyond References to Scholarly Works: the Bison-Futé model. July/August 2001. D-Lib Magazine.

• NISO AX Committee starts standardization of the OpenURL Framework using the Bison-Futé model as the basis of its work.

NISO OpenURL Standardization Charge • Use existing “OpenURL Framework” as starting point • notion of context-sensitive services • notion of transporting “contextual” metadata packages to obtain context-sensitive services • Define syntax and transport-method for “contextual” metadata packages • Ensure extensibility: • must support future applications • must support other information communities => Generalize and Standardize

NISO OpenURL Standardization Charge Therefore, to be addressed were: • OpenURL Framework beyond scholarly resources • “contextual” metadata packages • Syntax for “contextual” metadata packages • Transport of “contextual” metadata packages

default links : • restricted in nature • action-radius restricted by business agreements • not context-sensitive herbert van de sompel resource2 resource3 default links resource1 metadata plane

service component1 extended services plane service component2 herbert van de sompel resource2 resource3 default links resource1 metadata plane

Naming: Handles & DOIs

Naming

• Fundamental to other technologies (OAI PMH, OpenURL, etc.) • Options – URNs – Persistent URLs (PURLs) • http://purl.org/ – Handles • http://www.handle.net/ – Digital Object Identifiers • http://www.doi.org/ – ARK • http://www.cdlib.org/inside/diglib/ark/

“Inverted Archives”

• Unit of discourse is no longer an archive or service, but a DOI which has services linked from it – cf.: • UPS demonstration prototype • “Smart Objects, Dumb Archives” (SODA) model

Object Models

Popular Object Models

• METS – used in DSpace, Fedora – – http://www.loc.gov/standards/mets/ • MPEG-21 DIDL http://xml.coverpages.org/mpeg21-didl.html

– used in LANL DLs • http://www.dlib.org/dlib/november03/bekaert/11bekaert.html

• • http://www.dlib.org/dlib/february04/bekaert/02bekaert.html

http://lib-www.lanl.gov/~herbertv/papers/jcdl2004-submitted draft.pdf

Object Models & OAI-PMH

Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see t his pic ture.

Move from simple metadata files “pointing” to resources… resource item records Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see t his pic ture.

oai:foo.edu:1234 METS …to records as “modeled representations” of resources

Download and Go!

Where Do You Want to Build?

user

CDSware

service provider data provider data provider data provider data provider

. . .

data provider

EPrints.org

CDSware

local context sensitive services

Fedora

• joint project between Cornell & UVa – funded by the Mellon Foundation • a repository management system – focuses on complex digital objects and their – behaviors • more info: http://www.fedora.info/ – D-Lib Magazine, 9(4) • http://www.dlib.org/dlib/april03/staples/04staples.h

tml

• MIT + HP Labs • constructed to capture all the output of MIT’s faculty • now generalized to the DSpace Federation – 8 top universities in the US & Canada • More info: – http://www.dspace.org/ – http://sourceforge.net/projects/dspace/ – D-Lib Magazine 9(1) • http://www.dlib.org/dlib/january03/smith/01smith.ht

ml

EPrints.org

• developed at Southampton University – part of larger suite of institutional/author self – – archiving tools and services • e.g.: citebase; paracite • widely adopted -- 100+ sites http://software.eprints.org/#ep2 • more info http://www.eprints.org/ – http://www.arl.org/sparc/core/index.asp?page= g20#6

CDSware

• developed at CERN • data provider & service provider • large-scale use @ CERN (> 600k records) – in use at a few non-CERN sites • free & paid support models • more info – http://cdsware.cern.ch/

• P2P publishing for academia – community servers for coordination, management – archivelets for individual laptops, PCs • more info: – http://kepler.cs.odu.edu/ – D-Lib Magazine 7(4) • http://www.dlib.org/dlib/april01/maly/04maly.html

• developed by UKOLN – open source • OpenURL 0.1 format resolver – NISO 1.0 format???

• more info: – Ariadne, 28 • http://www.ariadne.ac.uk/issue28/resolver/ • • ftp://ftp.ukoln.ac.uk/metadata/tools/openresolver/ http://www.ukoln.ac.uk/distributed-systems/openurl/

Conclusions

Why The OAI-PMH is NOT Important

• Users don’t care • OAI-PMH is middleware – if done right, the uninterested user should never have to know • Using OAI-PMH does not insure a good SP • OAI-PMH is

(or is becoming)

presence is now assumed HTTP for DLs – few people get excited about http now • http & OAI-PMH are core technologies whose

Other Uses For the OAI-PMH

• Assumptions: – Traditional DLs / SPs will continue on their present path of increasing sophistication • citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc.

– growth rates remain the same (5x DPs as SPs) • Premise: OAI-PMH is applicable to any scenario that needs to

update / synchronize distributed state

– Future opportunities are possible by creatively interpreting the OAI-PMH data model • See Van de Sompel, Young & Hickey, D-Lib Magazine July 2003, http://www.dlib.org/dlib/july03/young/07young.html

• Nelson, 2nd OAI Workshop, http://agenda.cern.ch/askArchive.php?base=agenda&categ=a0 2333&id=a02333s5t8/transparencies

OpenURL Framework evolution A spec based on HTTP GET to transport metadata about • a scholarly referent & • the context in which the referent is referenced Draft Van de Sompel, Beit-Arie, Hochstenbach 05/2001 A framework Standard that enables different Communities to: • describe a referent • describe the context in which the referent is referenced • transport these descriptions NISO Draft Standard 04/2003

The Future: Community Building

• Ultimately, protocols and metadata formats are not what makes a difference • Rather, the critical mass afforded by a common set of utilities (cf. http, Dublin Core, XML) • The best current example: The Open Language Archives Community – http://www.language-archives.org/ • OAI-PMH provides the basis for communication between strangers, but allows even richer communication between friends

Further Reading

• Gerry McKiernan, Library Hi-Tech News – http://www.public.iastate.edu/~gerrymck/OAI-SP-I.pdf

– – http://www.public.iastate.edu/~gerrymck/OAI-SP-II.pdf

http://www.public.iastate.edu/~gerrymck/OAI-SP-III.pdf

• Open Archives Forum OAI-PMH Tutorial – http://www.oaforum.org/tutorial/ • “A Survey of Digital Library Aggregation Services” – – – http://www.diglib.org/pubs/brogan/ • Open Access News http://www.earlham.edu/~peters/fos/fosblog.html

• Guide To Institutional Repository Software http://www.soros.org/openaccess/software/

Great Stuff I Did Not Cover…

• OAI-PMH – Static Repositories • • http://www.openarchives.org/OAI/2.0/guidelines-static repository.htm

– OAI-Rights http://www.openarchives.org/documents/OAIRightsWhite Paper.html

• http://www.openarchives.org/news/oairightspress030929.

html • Digital Preservation – http://www.digitalpreservation.gov/