Transcript Document

Overview of Web-Crawlers
 Neal Richter & Anthony Arnone
• Nov 30, 2005 – CS Conference Room
• These slides are edited versions of the chapter 2 lecture notes from “Mining the Web”
•
by Soumen Chakrabarti.
“Programming Spiders, Bots, and Aggregators” in Java by Jeff Heaton also is a good
reference.
Mining the Web
Chakrabarti and Ramakrishnan
1
Crawling the Web
Web pages
•Few thousand characters long
•Served through the internet using the hypertext
transport protocol (HTTP)
•Viewed at client end using `browsers’
Crawler
•To fetch the pages to the computer
•At the computer
Automatic
documents
programs can analyze hypertext
HTML
 HyperText Markup Language
 Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
 expressed
as an anchor tag with a HREF attribute
 HREF names another page using a Uniform
Resource Locator (URL),
• URL =
 protocol
field (“HTTP”) +
 a server hostname (“www.cse.iitb.ac.in”) +
 file path (/, the `root' of the published file system).
Mining the Web
Chakrabarti and Ramakrishnan
3
HTTP(hypertext transport protocol)
 Built on top of the Transport Control Protocol
(TCP)
 Steps(from client end)
• resolve the server host name to an Internet address
(IP)


Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
• contact the server using TCP



connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g.: GET)
Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
Mining the Web

Chakrabarti and Ramakrishnan
Fetch the HTML page
4
Crawl “all” Web pages?
 Problem: no catalog of all accessible
URLs on the Web.
 Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new
•
•
•
outlinking URLs
fetch these pages in turn…..
Submit the text in page to a text indexing
system
and so on……….
Mining the Web
Chakrabarti and Ramakrishnan
5
Crawling procedure
 Simple
• Great deal of engineering goes into industry-
•
strength crawlers
Industry crawlers crawl a substantial fraction
of the Web
E.g.: Alta Vista, Northern Lights, Inktomi
•
 No guarantee that all accessible Web
pages will be located in this fashion
 Crawler may never halt …….
• pages will be added continually even as it is
running.
Mining the Web
Chakrabarti and Ramakrishnan
6
Crawling overheads
 Delays involved in
• Resolving the host name in the URL to an IP
•
address using DNS
Connecting a socket to the server and
sending the request
Receiving the requested page in response
•
 Solution: Overlap the above delays by
• fetching many pages at the same time
Mining the Web
Chakrabarti and Ramakrishnan
7
Anatomy of a crawler.
 Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been
fetched
 Each page
• stored in compressed form to disk/tape
• scanned for outlinks
 Work pool of outlinks
• maintain network utilization without
overloading it
 Dealt
with by load manager
 Continue till the crawler has collected a
Mining the Web
Chakrabarti and Ramakrishnan
8
Typical anatomy of a large-scale crawler.
Mining the Web
Chakrabarti and Ramakrishnan
9
Large-scale crawlers: performance
and reliability considerations
 Need to fetch many pages at same time
• utilize the network bandwidth
• single page fetch may involve several seconds of
network latency
 Highly concurrent and parallelized DNS lookups
 Use of asynchronous sockets
• Explicit encoding of the state of a fetch context in a
•
data structure
Polling socket to check for completion of network
transfers
Multi-processing or multi-threading: Impractical
•
 Care in URL extraction
• Eliminating duplicates to reduce redundant fetches
Avoiding “spider Chakrabarti
traps”and Ramakrishnan
Mining •
the Web
10
DNS caching, pre-fetching and
resolution

A customized DNS component with…..
1. Custom client for address resolution



Tailored for concurrent handling of multiple outstanding requests
Allows issuing of many resolution requests together
–
polling at a later time for completion of individual requests
Facilitates load distribution among many DNS servers.
2. Caching server


With a large cache, persistent across DNS restarts
Residing largely in memory if possible.
3. Prefetching client



Mining the Web
Parse a page that has just been fetched & extract host names from HREF
targets
Query DNS cache via UDP
Don’t wait for it, Check later (when you need it)
Chakrabarti and Ramakrishnan
11
Multiple concurrent fetches
•
Managing multiple concurrent
connections
• A single download may take several
•
•
seconds
Open many socket connections to different
HTTP servers simultaneously
Multi-CPU machines not useful
• crawling performance limited by network and
disk
•
Two approaches
1. using multi-threading
Mining the Web
Chakrabarti and Ramakrishnan
12
Multi-threading
• logical threads
• physical thread of control provided by the operating
system (E.g.: pthreads) OR
• concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
• create a client socket
• connect the socket to the HTTP service on a server
• Send the HTTP request header
• read the socket (recv) until
•
no more characters are available
• close the socket.
• use blocking system calls
Mining the Web
Chakrabarti and Ramakrishnan
13
Multi-threading: Problems
• performance penalty
• mutual exclusion
• concurrent access to data structures
• slow disk seeks.
• great deal of interleaved, random input-output
•
on disk
Due to concurrent modification of document
repository by multiple threads
Mining the Web
Chakrabarti and Ramakrishnan
14
Non-blocking sockets and event
handlers
• non-blocking sockets
• connect, send or recv call returns immediately without
waiting for the network operation to complete.
• poll the status of the network operation separately
• “select” system call
• lets application suspend until more data can be read
from or written to the socket
• timing out after a pre-specified deadline
• Monitor polls several sockets at the same time
• More efficient memory management
• code that completes processing not interrupted by
other completions
• No need for locks and semaphores on the pool
• only append complete
pages to the log
Mining the Web
Chakrabarti and Ramakrishnan
15
Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
• Avoid multiple fetches of pages known by
•
different URLs
Relative URLs
•
need to be interpreted w.r.t to a base URL.
• many IP addresses / Mirror???
Mining the Web
Chakrabarti and Ramakrishnan
16
Canonical URL
•
•
•
•
Formed by
Using a standard string for the protocol
Canonicalizing the host name
Adding an explicit port number
Normalizing and cleaning up the path
Mining the Web
Chakrabarti and Ramakrishnan
17
Robot exclusion
• Check
• whether the server prohibits crawling a
•
normalized URL
In robots.txt file in the HTTP root directory of
the server
•
species a list of path prefixes which crawlers
should not attempt to fetch.
• Meant for crawlers only
Mining the Web
Chakrabarti and Ramakrishnan
18
Eliminating already-visited URLs
 Checking if a URL has already been fetched
• Before adding a new URL to the work pool
• Needs to be very quick.
• Achieved by computing MD5 hash function on the
URL
 Exploiting spatio-temporal locality of access

Two-level hash function.
– most significant bits (say, 24) derived by hashing the host name
plus port
– lower order bits (say, 40) derived by hashing the path

concatenated bits used as a key in a B-tree
 qualifying URLs added to frontier of the crawl.
 hash values added to B-tree.
Mining the Web
Chakrabarti and Ramakrishnan
19
Spider traps
 Protecting from crashing on
• Ill-formed HTML
 E.g.:
page with 68 kB of null characters
• Misleading sites
 indefinite
number of pages dynamically generated
by CGI scripts
 paths of arbitrary depth created using soft directory
links and path remapping features in HTTP server
Mining the Web
Chakrabarti and Ramakrishnan
20
Spider Traps: Solutions
 No automatic technique can be foolproof
 Check for URL length
 Guards
• Preparing regular crawl statistics
• Adding dominating sites to guard module
• Disable crawling active content such as CGI
•
form queries
Eliminate URLs with non-textual data types
Mining the Web
Chakrabarti and Ramakrishnan
21
Avoiding repeated expansion of
links on duplicate pages
 Reduce redundancy in crawls
 Duplicate detection
• Mirrored Web pages and sites
 Detecting exact duplicates
• Checking against MD5 digests of stored URLs
• Representing a relative link v (relative to aliases u1
and u2) as tuples (h(u1); v) and (h(u2); v)
 Detecting near-duplicates
• Even a single altered character will completely
change the digest !

E.g.: date of update/ name and email of the site administrator
• Solution : Shingling
Mining the Web
Chakrabarti and Ramakrishnan
22
Text repository
 Crawler’s last task
 Dumping fetched pages into a repository
 Decoupling crawler from other functions
for efficiency and reliability preferred
 Page-related information stored in two
parts
 meta-data
 page contents.
Mining the Web
Chakrabarti and Ramakrishnan
23
Large-scale crawlers often use multiple ISPs and a bank of local storage
servers to store the pages crawled.
Mining the Web
Chakrabarti and Ramakrishnan
24
Refreshing crawled pages
 Search engine's index should be fresh
 Web-scale crawler never `completes' its
job
 High variance of rate of page changes
 “If-modified-since” request header with
HTTP protocol
 Impractical for a large web crawler
 Solution
 At commencement of new crawling round
estimate which pages have changed
Mining the Web
Chakrabarti and Ramakrishnan
25
Determining page changes
 “Expires” HTTP response header
 For page that come with an expiry date
 Otherwise need to guess if revisiting that
page will yield a modified version.
 Score reflecting probability of page being
modified
 Crawler fetches URLs in decreasing order of
score.
 Assumption : recent past predicts the future
Mining the Web
Chakrabarti and Ramakrishnan
26
Estimating page change rates
 Brewington and Cybenko & Cho
 Algorithms for maintaining a crawl in which
most pages are fresher than a specified
epoch.
 Prerequisite
 average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
 Small scale intermediate crawler runs
 to monitor fast changing sites

E.g.: current news, weather, etc.
 Patched intermediate indices into master
index
Mining the Web
Chakrabarti and Ramakrishnan
27
Questions?
Lots of Programming detail
missing!!!!