Repositories COMP3016 Public, managed, web collections of knowledge Repositories & Green OA • Open Archiving Initiative - October 1999 – Agreed OAI-PMH for metadata sharing – (2008

Download Report

Transcript Repositories COMP3016 Public, managed, web collections of knowledge Repositories & Green OA • Open Archiving Initiative - October 1999 – Agreed OAI-PMH for metadata sharing – (2008

Repositories
COMP3016
Public, managed, web
collections of
knowledge
Repositories & Green OA
• Open Archiving Initiative - October 1999
– Agreed OAI-PMH for metadata sharing
– (2008 OAI-ORE for data exchange)
• Among the Participants
– Paul Ginsparg (arXiv)
– Carl Lagoze (NCSTRL)
– Stevan Harnad (Cogprints)
• EPrints
– proposed as a ‘build your own repository’ solution
– enable institutions and groups to participate in OAI
metadata sharing initiative
Example Repository
http://eprints.ecs.soton.ac.uk/
A repository for a school of
Electronics and Computer Science.
It achieves 80-100% full text selfdeposit
Looking at the Differences between a
Repository and a Website through a
Whistlestop Tour of the ECS Repository
• Repository provides:
– Different views
– Different ways of exporting data
– Metadata capture
EPrints Walkthrough: Browse
• Browse Views aka “Collections”
– Subdivisions
– Ordering
EPrints Walkthrough: Views
• View content lists as “tag clouds”
or “communities of practice”
EPrints Walkthrough: Searches
Advanced search allows useful reports to be
generated:
• journal articles funded by NIH published in
2007
• conference posters with a PowerPoint file in
the Maths department
• refereed conference papers or journal
articles with full text
• old journal articles that haven’t been cited
EPrints Walkthrough: Search Results
EPrints Walkthrough: Exporting Search
results
• The output from any search can be
exported…
– as RSS feeds
– as METS, Dublin Core or other
DL interoperability formats
– as BibTeX, refer, EndNote &
other bibliography formats
– to Google Earth, Similie
TimeLine or other web services
and mashups
EPrints Walkthrough: Infrastructure Exports
EPrints Walkthrough: Infrastructure
Exports
Publication lists and data imported by and branded by other research group portals.
EPrints Walkthrough: Depositing a
New Item
EPrints Walkthrough: Import Items
from Various Sources
Reference Model for a Web Site
REQUEST
UPLOAD
DOWNLOAD
• A web site is very simple in its functionality; a
repository (as we have seen) is more complex
Reference Model for an Open Archival
Information System (OAIS)
• SIP/DIP/AIP =
Submission/Dissemination/Archival
What is a Repository?
• A repository is a platform that allows you to capture items in
any format –
–
–
–
–
text,
video,
audio,
data.
• It distributes it over the web, mainly via Google
• It indexes your work, so users can search and retrieve your
items.
• It preserves your digital work over the long term.
What are the benefits of using a
repository?
• Some example benefits:
– Getting your research results out quickly, to a worldwide audience
– Reaching a worldwide audience through exposure to search engines such as
Google
– Storing reusable teaching materials that you can use with course management
systems
– Archiving and distributing material you would currently put on your personal
website
– Storing examples of students’ projects (with the students’ permission)
– Showcasing students’ theses (again with permission)
– Keeping track of your own publications/bibliography
– Having a persistent network identifier for your work, that never changes or
breaks
– No more page charges for images. You can point to your images’ persistent
identifiers in your published articles.
What does a Repository look like?
http://www.dspace.org/images/stories/dspace-diagram.pdf
Application Architecture
• Repository systems are organised into three tiers which
consist of a number of components
• Each layer only invokes the layer below it i.e. the application
layer may not used the storage layer directly
The Storage Layer
• The storage layer is responsible for physical storage of
metadata and content
• Repositories use a relational databases to store all
information about the organization of content, metadata
about the content, information about e-people and
authorization, and the state of currently-running workflows.
The Business Logic Layer
• The business logic layer deals with managing the content of
the archive, users of the archive (e-people), authorization,
and workflow
The Application Layer
• The application layer contains components that communicate
with the world outside of the individual repository, for
example the Web user interface and the Open Archives
Initiative protocol for metadata harvesting service
The Problem of LongTerm Data
• Researchers have have hard disks which are just
organised enough to support daily activity but
researchers’ careers last for forty years
–
–
–
–
–
Disk crashes
Stolen laptops
Software upgrades that go wrong
Backups that never quite get restored
Draws and folders full of old stuff that eventually fall off the
radar
• “Lost in some research assistant’s computer, the data are
often irretrievable or an undecipherable string of digits”
Lost in a Sea of Science Data. S.Carlson,
The Chronicle of Higher Education (23/06/2006)
Where Are My Files Now?
Preservation, Persistence and
Sustainability
• Persistent URLs needed to last across many
generations of organisation (e.g. CS Group,
CSDept, Dept of ECS, School of ECS)
– PURLs, DOIs or Handles
– Or just persistent policies for URL naming!
• Persistent storage / across many generations of
hardware (e.g. desktop vs cloud)
• Persistent readability / across many generations
of software
– Format migration
– WordPerfect – Word 5.1 – Office 2007
Open Archives Initiative Protocol for
Metadata Harvesting (OAI-PMH)
• A way of asking an archive about the stuff it’s
got in it.
• allows services to harvest metadata from
many archives
– Google harvests data, OAI-PMH harvests
metadata
• allows services to provide search and other
functionality
CogPrints
(GNU EPrints)
1600 Records
Harvester #1
(Psychology Service)
500 Cogprints
169 D-Space
Harvester #3
(General Service)
www.orgprints.org
(GNU EPrints)
264 Records
arXiv
(custom software)
230,000 Records
D-Space @ MIT
(D-Space Software)
769 Records
230,000 arXiv
769 D-Space
264 OrgPrints
1600 CogPrints
150,162 “Improved” records
from physics aggregator
Harvester #2
(Physics Aggregator)
150,000 arXiv
162 D-Space
Day 1
Archive Service
A
1403 records
Give me everything!
Harvester
1403 records
OK!
(1403 records)
Day 2
Give me all records which were
added or changed since yesterday
Harvester
Archive Service
A
1501 records
Archive Service
B
123 records
OK!
(102 new records,
Give
me everything
4 deleted
records,
in set
“physics”records)
23 changed
OK!
(15 records)
1403 records
1501
15 records
Day 3
Give me all records which were
added or changed since yesterday
Harvester
Archive Service
A
1490 records
Archive Service
B
123 records
Give OK!
me everything in set
“physics”
which
were
(25 new
records,
added
changed
since
36or
deleted
records,
yesterday.
3 changed records)
OK!
(0 new records,
1 record changed)
1501 records
1490
15 records
Now, OAI-ORE (Object Exchange and
Reuse)
• Repositories are being filled with complex sets of
data and metadata.
• ORE is a protocols to allow repositories, agents,
and services to use and reuse of compound
digital objects beyond the boundaries of the
holding repositories.
–
–
–
–
–
–
to facilitate discovery of objects,
to reference (link to) objects (and their parts),
to obtain a variety of disseminations of objects,
to aggregate and disaggregate objects,
to harvest and deposit (register, put) objects
to enable processing by automated agents
ORE: Compound Information Objects
• Identified, bounded aggregations of
distinct information units that when
combined form a logical whole
– Scholarly publication with an article
and supporting information including
dataset, video, etc.
– Digitized book with multiple chapters,
each chapter containing multiple
scanned pages.
– Archaeological assemblies of images,
maps, charts, and find lists
– Flickr ‘sets’, comments/annotations etc.
ORE: Publishing compound objects to the
Web (1)
• Web graph without any explicit compound objects
• each information object identified with a URI
• and there are links between them
ORE: Publishing compound objects to the
Web (2)
• Compound object and its parts are published to the Web with URIs
• Links indicate relationships but cannot show boundaries and true
structure in a machine context
ORE: Publishing compound objects to the
Web (3)
• This time … added layer is publishing the compound object and its
parts with relationships and boundary as a ‘named graph’
Summary
• Repository adds management services to
basic architectural model
– ingest, dissemination
– management
– preservation