A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research.
Download ReportTranscript A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research.
A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory Her This work supported in part by the Andrew Mellon Foundation & Library of Congress OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland WWW and DL: Separated at Birth WWW The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web WWW DL 1994 DL The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W Today The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Web Robots what documents have been modified since 2003-11-15 ? what is this file? what are its relationships to other files? how often does it change? www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 … doc100; last mod 2003-09-11 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland A More Efficient Way <co> <metadata/> <link/> <link/> <change/> … </co> what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 … doc100; last mod 2003-09-11 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland mod_oai approach • • Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server o o • • written in C respects values in .htaccess, httpd.conf compile mod_oai on http://www.foo.edu/ baseURL is now http://www.foo.edu/modoai o Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) - http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland OAI-PMH data model in mod_oai resource OAI-PMH sets MIME type metadata pertaining to the resource http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH identifier = entry point to all records pertaining to the resource Dublin Core metadata HTTP header metadata MPEG-21 DIDL OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland item records OAI-PMH concepts : typical repository OAI-PMH Entity Resource value URL description PDF, PS, XML, HTML or other file Item identifier OAI Identifier DNS-based name of metadata about resource set membership LCSH Library of Congress Subject Heading metadataPrefix oai_dc bibliographic metadata in Dublin Core Record datestamp 2004-10-18 modification date of DC record Record metadataPrefix datestamp oai_marc 2004-07-31 bibliographic metadata in MARC modification date of MARC record OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland OAI-PMH concepts : mod_oai OAI-PMH Entity Resource value description URL HTML, GIF, PDF or other web file URL same URL as the resource set membership MIME type MIME type of the resource metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD datestamp 2004-07-31 modification date of resource oai_dc a subset of http_header in DC 2004-07-31 modification date of resource Item identifier Record Record metadataPrefix datestamp Record metadataPrefix datestamp oai_didl 2004-07-31 MPEG-21 DIDL: base64 encoded resource + http_header metadata modification date of resource Resource Discovery: ListIdentifiers harvester • issues a ListIdentifiers, • finds URLs of updated resources • does HTTP GETs updates only • can get URLs of resources with specified MIME types OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Preservation: ListRecords harvester • issues a ListRecords, • Gets updates as MPEG21 DIDL documents (HTTP headers, resource By Value or By Reference) • can get resources with specified MIME types OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland performance of mod_oai and wget on www.cs.odu.edu # of files in baseline # of files in update (25%) index.html as seed 709 114 wget "find . -type f" as seed 5739 1318 mod_oai files 5268 1335 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Readings • Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069 OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Issues and Future Work • For a given server, there are a set of URLs, U, and a set of files F o o • Neither function is 1-1 nor onto o • Apache maps U F mod_oai maps F U We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: o dynamic files - exporting unprocessed server-side files would be a security hole o IndexIgnore - httpd will “hide” valid URLs o File permissions - httpd will advertise files it cannot read • Long-term issues o Alias, Location - files can be covered up by the httpd o UserDir - interactions between the httpd and the filesystem OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland IndexIgnore & File Permissions OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Looking Further Down the Road for mod_oai • “Reverse” the method of URL discovery o o cannot look to the files; listen to incoming requests and build a list of valid URLs - could be seeded with files at start - also the method for handling server processed files / URLs • Plug-ins for descriptive metadata o o o • DC tags in HTML MS Office formats, PDF Tags from JPEG, TIFF, MP3, etc. Additional metadata in the DIDL o o technical metadata from JHOVE estimated change rate - cf. Cho & Garcia-Molina, ACM TOIT 28(4) • http log access as separate metadata formats - cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8) OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland Expanding OAI-PMH / Complex Object Access • OAI-PMH / CO access for: o o o blogs message boards native file systems - e.g. Mac OS X “Spotlight” • More aggressive use of OAI-PMH / CO for preservation o o recently funded NSF DIGARCH program use for preservation: - Usenet - Email - Multicasting OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting • Better web harvesting can be achieved through: o o • Use cases: o o • Preservation (ListRecords) Web crawling (ListIdentifiers) mod_oai: reference implementation o o o • OAI-PMH: structured access to updates Complex object formats: modeled representation of digital objects Better performance than wget static files only; dynamic files in the future not a replacement for DSpace, Fedora, eprints.org, etc. More info: o o http://www.modoai.org/ http://whiskey.cs.odu.edu/ OAI-PMH for Resource Harvesting Tutorial OAI4, October 20th 2005, CERN, Geneva, Switzerland