Presentatie Francis Cave (ACAP)

Download Report

Transcript Presentatie Francis Cave (ACAP)

>>>> Communicating
with crawlers
What ACAP has to offer
>>>>
WEBCONTENT: Te Mooi Om Weg Te Geven
NUV, Amsterdam
Francis Cave, EDItEUR
ACAP Technical Project Manager
May 2008
Communicating with crawlers
What ACAP has to offer…
 What is ACAP (Version 1.0)?
 What has been the experience so far?
 What publishers should do now...
The ACAP Technical Framework
“ACAP Version 1.0”
What it is…
 a toolkit to enable communication of
content access and usage policies
 adopting and building upon existing standards
 rooted in the requirements of real use cases
 a “proof of concept”
What it isn’t…
 at this stage ACAP is not a formal standard
 a technical enforcement mechanism
The ACAP Technical Framework
“ACAP Version 1.0”
What is it?
 Protocols for machine-to-machine messaging
 using a common vocabulary
of access and usage terminology
 Guidance on methods of
communication and access control
 Software tools to support implementation
The ACAP Technical Framework
“ACAP Version 1.0”
What kinds of “protocols”?
 business layer protocols…
 machines already know how to
talk to one another…
 physical layer: PPP, ATM, …
 network layer: TCP/IP
 application layer: HTTP, HTTPS, SMTP, FTP…
 business layer: RSS, ebXML, EDIINT, SOAP,
web services, …
 they just don’t know what to say in the business
of communicating access and usage policies
The ACAP Technical Framework
“ACAP Version 1.0”
 We need to tell the machines
what to say to one another…
 we need a common vocabulary
 so they knows what to say…
 …and how to interpret it
 …and tell them how to say it
 using whatever protocols they already use
to talk to one another
The ACAP Technical Framework
“ACAP Version 1.0”
 But machines aren’t going to
do this on their own
 we need to provide guidance on how to
implement the protocols
 we need to provide tools to support
implementation
The ACAP Technical Framework
“ACAP Version 1.0”
How has it been developed?
 We started with a set of real business use cases:
 Nine publishers looking for ways of communicating
access and use policies for their online content
 A national archive looking for ways of finding out what
they were allowed to do with the content that they are
preserving for posterity
 A search engine looking for ways to include more
high-quality content in their index
The ACAP Technical Framework
“ACAP Version 1.0”
What does ACAP Version 1.0 include?
 Extensions to the Robots Exclusion Protocol (REP)
 Part 1 specifies extensions to the “robots.txt” format
 enables policies to be expressed for an entire website
 leverages the established protocol for web server-crawler communication
 the existing format is used on millions of websites
and understood by hundreds of crawlers
 Part 2 specifies extensions to the Robots META Tags format
 enables policies to be expressed within individual
HTML pages
 existing format understood by major search engines
 Dictionary of access and usage terminology
 robots.txt conversion tool
The ACAP Technical Framework
“ACAP Version 1.0”
Why does REP need to be extended?
 conventional REP has only a very limited vocabulary
 even if we include non-standard extensions that
not every search engine has implemented
 conventional REP is inconsistently interpreted
 e.g. “Disallow” is interpreted differently means different
things to different crawlers:
 don’t crawl?
 don’t index?
The ACAP Technical Framework
“ACAP Version 1.0”
ACAP Version 1.0 has been tested by four publishers
against their priority use cases:




De Persgroep – major Flemish news publisher
Media 24 – global news / media publisher based in South Africa
Macmillan – online book content hosting service
Reed Elsevier – scientific and business information publisher
 all the tested use cases concern text resources
 current technical work includes extension of ACAP Version 1.0 to
enable communication of policies relating specifically to non-text
resource such as images and video
ACAP Version 1.0 has been implemented in a test
crawler by search engine operator Exalead
The ACAP Technical Framework
“ACAP Version 1.0”
Tool for converting existing robots.txt files
 converts conventional robots.txt files so that existing
policies are expressed using ACAP terminology
User-agent:
Disallow:
Allow:



ACAP-crawler:
ACAP-disallow-crawl:
ACAP-allow-crawl:
 is implemented in perl
 can be used from the ACAP website
 http://www.the-acap.org/convert-robots-txt-to-acap.php
The ACAP Technical Framework
“ACAP Version 1.0”
Guidance on crawler authentication
 How to identify crawler names and IP addresses by analysing
web server access log files
 How to configure a server so that you can deliver different
‘robots.txt’ files to different crawlers
 examples are based upon the Apache web server
ACAP Version 1.0 Implementation Guide
 Step-by-step guide on how to make full use of the extensions
to REP proposed in ACAP Version 1.0
 Illustrated with many examples
The ACAP Technical Framework
“ACAP Version 1.0”
Review of test results
 We have tested ACAP Version 1.0 REP extensions in a
range of use cases
 for most of the tested use cases there are no unresolved issues
 but protected content use cases
 have been particularly challenging to implement
 have highlighted need for further work on some terminology
 ACAP Version 1.0 is ready to implement…
 for use cases in unprotected online content delivery
 for some use cases in protected online content delivery
 …but ACAP needs further development
 all specifications will continue to be revised and extended
The ACAP Technical Framework
Future plans…
To be added in future
 corrections and clarifications of a few points in ACAP Version
1.0
 additional vocabulary required for expressing policies specific to
 the creation and use of web archives
 the presentation of images and other media content
 the communication of policies associated with page fragments
 mechanisms for embedding ACAP policies in PDF and media
resources.
 an XML format for policy expression
 based upon ONIX for Licensing Terms developed by EDItEUR
 required for news and web syndication use cases
The ACAP Technical Framework
“ACAP Version 1.0”
Experience to date:
 ACAP Version 1.0 works
 it enables a richer form of expression of policies than is
possible using conventional REP ...
 it doesn’t interfere with current crawler activity ...
 ... but it only goes so far.
 ACAP Version 1.0 needs to be extended
 ACAP Version 1.1 (June/July 2008)
The ACAP Technical Framework
“ACAP Version 1.0”
What should publishers do now?
 ACAP Version 1.0 needs to be implemented!
 use the conversion tool to convert existing ‘robots.txt’ files
to use ACAP forms of expression
 use the Implementation Guide to refine policy expressions
 consider creating crawler-specific policies in separate
‘robots.txt’ files
 give us you feedback, to help us improve future
versions of ACAP
The ACAP Technical Framework
>>>> Thank you!
Questions…?
WEBCONTENT: Te Mooi Om Weg Te Geven
NUV, Amsterdam
>>>>
[email protected]
Francis Cave, EDItEUR
ACAP Technical Project Manager
May 2008