A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research.

Download Report

Transcript A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research.

A New Model for Web Resource Harvesting
Michael Nelson
Computer Science Department
Old Dominion University
Herbert Van de Sompel
Digital Library Research & Prototyping Team
Research Library, Los Alamos National Laboratory
Her
This work supported in part by the Andrew Mellon Foundation & Library of Congress
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Outline
(0) The Problem
(1) mod_oai
(2) Future Research
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
WWW and DL: Separated at Birth
WWW
The Good: XML, BitTorrent, Web Services
The Bad: RSS
The Ugly: Semantic Web
WWW
DL
1994
DL
The Good: OAIS, DOI, OAI-PMH
The Bad: Dublin Core
The Ugly: SRU/W
Today
The problem is not that the WWW doesn’t work; it clearly does.
The problem is that our expectations have been lowered.
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Web Robots
what documents have been
modified since 2003-11-15 ?
what is this file?
what are its relationships to other files?
how often does it change?
www.getty.edu
doc1; last mod
2003-03-12
doc2; last mod
2002-07-19
…
doc100; last mod
2003-09-11
robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
A More Efficient Way
<co>
<metadata/>
<link/>
<link/>
<change/>
…
</co>
what documents have been
modified since 2003-11-15 ?
www.getty.edu
with mod_oai
doc1; last mod
2003-03-12
doc2; last mod
2002-07-19
…
doc100; last mod
2003-09-11
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Outline
(0) The Problem
(1) mod_oai
(2) Future Research
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
mod_oai approach
•
•
Goal: integrate OAI-PMH functionality into the web server itself…
mod_oai: an Apache 2.0 module to automatically answer OAI-PMH
requests for an http server
o
o
•
•
written in C
respects values in .htaccess, httpd.conf
compile mod_oai on http://www.foo.edu/
baseURL is now http://www.foo.edu/modoai
o
Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
- http://www.foo.edu/modoai?
verb=ListIdentifiers &
metdataPrefix=oai_dc &
from=2004-09-15 &
set=mime:video:mpeg
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
OAI-PMH data model in mod_oai
resource
OAI-PMH sets
MIME type
metadata pertaining
to the resource
http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf
OAI-PMH identifier
= entry point to all records pertaining to the resource
Dublin Core
metadata
HTTP header
metadata
MPEG-21
DIDL
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
item
records
OAI-PMH concepts : typical repository
OAI-PMH Entity
Resource
value
URL
description
PDF, PS, XML, HTML or other file
Item
identifier
OAI Identifier
DNS-based name of metadata about resource
set membership
LCSH
Library of Congress Subject Heading
metadataPrefix
oai_dc
bibliographic metadata in Dublin Core
Record
datestamp
2004-10-18
modification date of DC record
Record
metadataPrefix
datestamp
oai_marc
2004-07-31
bibliographic metadata in MARC
modification date of MARC record
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
OAI-PMH concepts : mod_oai
OAI-PMH Entity
Resource
value
description
URL
HTML, GIF, PDF or other web file
URL
same URL as the resource
set membership
MIME type
MIME type of the resource
metadataPrefix
http_header
the http headers that would have been
returned via HTTP GET/HEAD
datestamp
2004-07-31
modification date of resource
oai_dc
a subset of http_header in DC
2004-07-31
modification date of resource
Item
identifier
Record
Record
metadataPrefix
datestamp
Record
metadataPrefix
datestamp
oai_didl
2004-07-31
MPEG-21 DIDL: base64 encoded resource +
http_header metadata
modification date of resource
Resource Discovery: ListIdentifiers
harvester
• issues a ListIdentifiers,
• finds URLs of updated
resources
• does HTTP GETs updates
only
• can get URLs of
resources with specified
MIME types
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Preservation: ListRecords
harvester
• issues a ListRecords,
• Gets updates as MPEG21 DIDL documents
(HTTP headers, resource
By Value or By
Reference)
• can get resources with
specified MIME types
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
performance of mod_oai and wget
on www.cs.odu.edu
# of files in baseline
# of files in update
(25%)
index.html
as seed
709
114
wget
"find . -type f"
as seed
5739
1318
mod_oai
files
5268
1335
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Readings
•
Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L.
Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata
Harvesting. http://arxiv.org/abs/cs.DL/0503069
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Outline
(0) The Problem
(1) mod_oai
(2) Future Research
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Issues and Future Work
•
For a given server, there are a set of URLs, U, and a set of files F
o
o
•
Neither function is 1-1 nor onto
o
•
Apache maps U  F
mod_oai maps F  U
We can easily check if a single u maps to F, but given F we cannot (easily)
generate U
Short-term issues:
o
dynamic files
- exporting unprocessed server-side files would be a security hole
o
IndexIgnore
- httpd will “hide” valid URLs
o
File permissions
- httpd will advertise files it cannot read
•
Long-term issues
o
Alias, Location
- files can be covered up by the httpd
o
UserDir
- interactions between the httpd and the filesystem
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
IndexIgnore & File Permissions
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Alias: Covering Up Files
httpd.conf:
Alias /A /usr/local/web/htdocs/B
Alias /B /usr/local/web/htdocs/A
the files “A” and “B” will be different from the URLs
http://server/A
http://server/B
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
UserDir: “Just in Time” mounting of directories
whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home
liu_x/ mln/
whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso
/home/tharriso/
whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home
liu_x/ mln/ tharriso/
whiskey.cs.odu.edu:/ftp/WWW/conf %
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Looking Further Down the Road for mod_oai
•
“Reverse” the method of URL discovery
o
o
cannot look to the files;
listen to incoming requests and build a list of valid URLs
- could be seeded with files at start
- also the method for handling server processed files / URLs
•
Plug-ins for descriptive metadata
o
o
o
•
DC tags in HTML
MS Office formats, PDF
Tags from JPEG, TIFF, MP3, etc.
Additional metadata in the DIDL
o
o
technical metadata from JHOVE
estimated change rate
- cf. Cho & Garcia-Molina, ACM TOIT 28(4)
•
http log access as separate metadata formats
- cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
Expanding OAI-PMH / Complex Object Access
•
OAI-PMH / CO access for:
o
o
o
blogs
message boards
native file systems
- e.g. Mac OS X “Spotlight”
•
More aggressive use of OAI-PMH / CO for preservation
o
o
recently funded NSF DIGARCH program
use for preservation:
- Usenet
- Email
- Multicasting
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland
OAI-PMH + Complex Objects:
A New Model for Web Resource Harvesting
•
Better web harvesting can be achieved through:
o
o
•
Use cases:
o
o
•
Preservation (ListRecords)
Web crawling (ListIdentifiers)
mod_oai: reference implementation
o
o
o
•
OAI-PMH: structured access to updates
Complex object formats: modeled representation of digital objects
Better performance than wget
static files only; dynamic files in the future
not a replacement for DSpace, Fedora, eprints.org, etc.
More info:
o
o
http://www.modoai.org/
http://whiskey.cs.odu.edu/
OAI-PMH for Resource Harvesting Tutorial
OAI4, October 20th 2005, CERN, Geneva, Switzerland