Presentatie Francis Cave (ACAP)
Download
Report
Transcript Presentatie Francis Cave (ACAP)
>>>> Communicating
with crawlers
What ACAP has to offer
>>>>
WEBCONTENT: Te Mooi Om Weg Te Geven
NUV, Amsterdam
Francis Cave, EDItEUR
ACAP Technical Project Manager
May 2008
Communicating with crawlers
What ACAP has to offer…
What is ACAP (Version 1.0)?
What has been the experience so far?
What publishers should do now...
The ACAP Technical Framework
“ACAP Version 1.0”
What it is…
a toolkit to enable communication of
content access and usage policies
adopting and building upon existing standards
rooted in the requirements of real use cases
a “proof of concept”
What it isn’t…
at this stage ACAP is not a formal standard
a technical enforcement mechanism
The ACAP Technical Framework
“ACAP Version 1.0”
What is it?
Protocols for machine-to-machine messaging
using a common vocabulary
of access and usage terminology
Guidance on methods of
communication and access control
Software tools to support implementation
The ACAP Technical Framework
“ACAP Version 1.0”
What kinds of “protocols”?
business layer protocols…
machines already know how to
talk to one another…
physical layer: PPP, ATM, …
network layer: TCP/IP
application layer: HTTP, HTTPS, SMTP, FTP…
business layer: RSS, ebXML, EDIINT, SOAP,
web services, …
they just don’t know what to say in the business
of communicating access and usage policies
The ACAP Technical Framework
“ACAP Version 1.0”
We need to tell the machines
what to say to one another…
we need a common vocabulary
so they knows what to say…
…and how to interpret it
…and tell them how to say it
using whatever protocols they already use
to talk to one another
The ACAP Technical Framework
“ACAP Version 1.0”
But machines aren’t going to
do this on their own
we need to provide guidance on how to
implement the protocols
we need to provide tools to support
implementation
The ACAP Technical Framework
“ACAP Version 1.0”
How has it been developed?
We started with a set of real business use cases:
Nine publishers looking for ways of communicating
access and use policies for their online content
A national archive looking for ways of finding out what
they were allowed to do with the content that they are
preserving for posterity
A search engine looking for ways to include more
high-quality content in their index
The ACAP Technical Framework
“ACAP Version 1.0”
What does ACAP Version 1.0 include?
Extensions to the Robots Exclusion Protocol (REP)
Part 1 specifies extensions to the “robots.txt” format
enables policies to be expressed for an entire website
leverages the established protocol for web server-crawler communication
the existing format is used on millions of websites
and understood by hundreds of crawlers
Part 2 specifies extensions to the Robots META Tags format
enables policies to be expressed within individual
HTML pages
existing format understood by major search engines
Dictionary of access and usage terminology
robots.txt conversion tool
The ACAP Technical Framework
“ACAP Version 1.0”
Why does REP need to be extended?
conventional REP has only a very limited vocabulary
even if we include non-standard extensions that
not every search engine has implemented
conventional REP is inconsistently interpreted
e.g. “Disallow” is interpreted differently means different
things to different crawlers:
don’t crawl?
don’t index?
The ACAP Technical Framework
“ACAP Version 1.0”
ACAP Version 1.0 has been tested by four publishers
against their priority use cases:
De Persgroep – major Flemish news publisher
Media 24 – global news / media publisher based in South Africa
Macmillan – online book content hosting service
Reed Elsevier – scientific and business information publisher
all the tested use cases concern text resources
current technical work includes extension of ACAP Version 1.0 to
enable communication of policies relating specifically to non-text
resource such as images and video
ACAP Version 1.0 has been implemented in a test
crawler by search engine operator Exalead
The ACAP Technical Framework
“ACAP Version 1.0”
Tool for converting existing robots.txt files
converts conventional robots.txt files so that existing
policies are expressed using ACAP terminology
User-agent:
Disallow:
Allow:
ACAP-crawler:
ACAP-disallow-crawl:
ACAP-allow-crawl:
is implemented in perl
can be used from the ACAP website
http://www.the-acap.org/convert-robots-txt-to-acap.php
The ACAP Technical Framework
“ACAP Version 1.0”
Guidance on crawler authentication
How to identify crawler names and IP addresses by analysing
web server access log files
How to configure a server so that you can deliver different
‘robots.txt’ files to different crawlers
examples are based upon the Apache web server
ACAP Version 1.0 Implementation Guide
Step-by-step guide on how to make full use of the extensions
to REP proposed in ACAP Version 1.0
Illustrated with many examples
The ACAP Technical Framework
“ACAP Version 1.0”
Review of test results
We have tested ACAP Version 1.0 REP extensions in a
range of use cases
for most of the tested use cases there are no unresolved issues
but protected content use cases
have been particularly challenging to implement
have highlighted need for further work on some terminology
ACAP Version 1.0 is ready to implement…
for use cases in unprotected online content delivery
for some use cases in protected online content delivery
…but ACAP needs further development
all specifications will continue to be revised and extended
The ACAP Technical Framework
Future plans…
To be added in future
corrections and clarifications of a few points in ACAP Version
1.0
additional vocabulary required for expressing policies specific to
the creation and use of web archives
the presentation of images and other media content
the communication of policies associated with page fragments
mechanisms for embedding ACAP policies in PDF and media
resources.
an XML format for policy expression
based upon ONIX for Licensing Terms developed by EDItEUR
required for news and web syndication use cases
The ACAP Technical Framework
“ACAP Version 1.0”
Experience to date:
ACAP Version 1.0 works
it enables a richer form of expression of policies than is
possible using conventional REP ...
it doesn’t interfere with current crawler activity ...
... but it only goes so far.
ACAP Version 1.0 needs to be extended
ACAP Version 1.1 (June/July 2008)
The ACAP Technical Framework
“ACAP Version 1.0”
What should publishers do now?
ACAP Version 1.0 needs to be implemented!
use the conversion tool to convert existing ‘robots.txt’ files
to use ACAP forms of expression
use the Implementation Guide to refine policy expressions
consider creating crawler-specific policies in separate
‘robots.txt’ files
give us you feedback, to help us improve future
versions of ACAP
The ACAP Technical Framework
>>>> Thank you!
Questions…?
WEBCONTENT: Te Mooi Om Weg Te Geven
NUV, Amsterdam
>>>>
[email protected]
Francis Cave, EDItEUR
ACAP Technical Project Manager
May 2008