Transcript SVM

CpSc 881: Information
Retrieval
How hard can crawling be?
Web search engines must crawl their documents.
Getting the content of the documents is easier for many other IR systems.
E.g., indexing all files on your hard disk: just do a
recursive descent on your file system
Ok: for web IR, getting the content of the documents takes longer . . .
. . . because of latency.
But is that really a design/systems challenge?
2
Basic crawler operation
Initialize queue with URLs of known seed pages
Repeat
Take URL from queue
Fetch and parse page
Extract URLs from page
Add URLs to queue
Fundamental assumption: The web is well linked.
3
Exercise: What’s wrong with this
crawler?
urlqueue := (some carefully selected set of seed urls)
while urlqueue is not empty:
myurl := urlqueue.getlastanddelete()
mypage := myurl.fetch()
fetchedurls.add(myurl)
newurls := mypage.extracturls()
for myurl in newurls:
if myurl not in fetchedurls and not in urlqueue:
urlqueue.add(myurl)
addtoinvertedindex(mypage)
4
What’s wrong with the simple
crawler
Scale: we need to distribute.
We can’t index everything: we need to subselect. How?
Duplicates: need to integrate duplicate detection
Spam and spider traps: need to integrate spam detection
Politeness: we need to be “nice” and space out all requests for a site over
a longer period (hours, days)
Freshness: we need to recrawl periodically.
Because of the size of the web, we can do frequent
recrawls only for a small subset.
Again, subselection problem or prioritization
5
Magnitude of the crawling problem
To fetch 20,000,000,000 pages in one month . . .
. . . we need to fetch almost 8000 pages per second!
Actually: many more since many of the pages we attempt to crawl will be
duplicates, unfetchable, spam etc.
6
What a crawler must do
Be polite
 Don’t hit a site too often
 Only crawl pages you are allowed to crawl: robots.txt
Be robust
 Be immune to spider traps, duplicates, very large pages,
very large websites, dynamic pages etc
7
Robots.txt
Protocol for giving crawlers (“robots”) limited access to a website, originally from
1994
Examples:
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow: /
Important: cache the robots.txt file of each site we are crawling
8
Example of a robots.txt (nih.gov)
User-agent: PicoSearch/1.0
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
User-agent: *
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
Disallow: /ddir/
Disallow: /sdminutes/
9
What any crawler should do
Be capable of distributed operation
Be scalable: need to be able to increase crawl rate by adding more
machines
Fetch pages of higher quality first
Continuous operation: get fresh version of already crawled pages
10
URL frontier
11
URL frontier
The URL frontier is the data structure that holds and manages URLs we’ve
seen, but that have not been crawled yet.
Can include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must keep all crawling threads busy
12
Basic crawl architecture
13
URL normalization
Some URLs extracted from a document are relative URLs.
E.g., at http://mit.edu, we may have aboutsite.html
This is the same as: http://mit.edu/aboutsite.html
During parsing, we must normalize (expand) all relative URLs.
14
Content seen
For each page fetched: check if the content is already in the index
Check this using document fingerprints or shingles
Skip documents whose content has already been indexed
15
Distributing the crawler
Run multiple crawl threads, potentially at different nodes
Usually geographically distributed nodes
Partition hosts being crawled into nodes
16
Google data centers (wazfaring.
com)
17
Distributed crawler
18
URL frontier: Two main
considerations
Politeness: Don’t hit a web server too frequently
Only one connection is open at a time to any host
insert a time gap between successive requests to the
same server
Freshness: Crawl some pages (e.g., news sites) more often than others
High quality pages that change frequently should be prioritized for
frequenct crawling.
Not an easy problem: simple priority queue fails.
19
Mercator URL frontier
URLs flow in from the top into the
frontier.
Front queues manage
prioritization.
Back queues enforce politeness.
Each queue is FIFO.
20
Mercator URL frontier: Front
queues
Prioritizer assigns to URL an
integer priority between 1 and F.
Then appends URL to
corresponding queue
Heuristics for assigning priority:
refresh rate, PageRank etc
21
Mercator URL frontier: Front
queues
Selection from front queues is
initiated by back queues
Pick a front queue from which
to select next URL: Round robin,
randomly, or more sophisticated
variant
But with a bias in favor of highpriority front queues
22
Mercator URL frontier: Back
queues
Invariant 1. Each back queue is
kept non-empty while the crawl
is in progress.
Invariant 2. Each back queue
only contains URLs from a single
host.
Maintain a table from hosts to
back queues.
23
Mercator URL frontier: Back
queues
In the heap:
One entry for each back queue
The entry is the earliest time te
at which the host corresponding
to the back queue can be hit
again.
The earliest time te is
determined by (i) last access to
that host (ii) time gap heuristic
24
Mercator URL frontier: Back
queues
How fetcher interacts with back
queue:
Repeat (i) extract current root q
of the heap (q is a back queue)
and (ii) fetch URL u at head of q
...
. . . until we empty the q we get.
(i.e.: u was the last URL in q)
25
Mercator URL frontier: Back
queues
When we have emptied a back
queue q:
Repeat (i) pull URLs u from
front queues and (ii) add u to its
corresponding back queue . . .
. . . until we get a u whose host
does not have a back queue.
Then put u in q and create heap
entry for it.
26
Spider trap
Malicious server that generates an infinite sequence of linked pages
Sophisticated spider traps generate pages that are not easily identified as
dynamic.
27