Transcript Document

Crawling the Web
Web pages
•Few thousand characters long
•Served through the internet using the hypertext
transport protocol (HTTP)
•Viewed at client end using `browsers’
Crawler
•To fetch the pages to the computer
•At the computer
Automatic
documents
programs can analyze hypertext
HTML
 HyperText Markup Language
 Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
 expressed
as an anchor tag with a HREF attribute
 HREF names another page using a Uniform
Resource Locator (URL),
• URL =
 protocol
field (“HTTP”) +
 a server hostname (“www.cse.iitb.ac.in”) +
 file path (/, the `root' of the published file system).
Mining the Web
Chakrabarti and Ramakrishnan
2
HTTP(hypertext transport protocol)
 Built on top of the Transport Control Protocol
(TCP)
 Steps(from client end)
• resolve the server host name to an Internet address
(IP)


Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
• contact the server using TCP



connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g.: GET)
Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
Mining the Web

Chakrabarti and Ramakrishnan
Fetch the HTML page
3
Crawl “all” Web pages?
 Problem: no catalog of all accessible
URLs on the Web.
 Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new
•
•
•
outlinking URLs
fetch these pages in turn…..
Submit the text in page to a text indexing
system
and so on……….
Mining the Web
Chakrabarti and Ramakrishnan
4
Crawling procedure
 Simple
• Great deal of engineering goes into industry-
•
strength crawlers
Industry crawlers crawl a substantial fraction
of the Web
E.g.: Alta Vista, Northern Lights, Inktomi
•
 No guarantee that all accessible Web
pages will be located in this fashion
 Crawler may never halt …….
• pages will be added continually even as it is
running.
Mining the Web
Chakrabarti and Ramakrishnan
5
Crawling overheads
 Delays involved in
• Resolving the host name in the URL to an IP
•
address using DNS
Connecting a socket to the server and
sending the request
Receiving the requested page in response
•
 Solution: Overlap the above delays by
• fetching many pages at the same time
Mining the Web
Chakrabarti and Ramakrishnan
6
Anatomy of a crawler.
 Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been
fetched
 Each page
• stored in compressed form to disk/tape
• scanned for outlinks
 Work pool of outlinks
• maintain network utilization without
overloading it
 Dealt
with by load manager
 Continue till he crawler has collected a
Mining the Web
Chakrabarti and Ramakrishnan
7
Typical anatomy of a large-scale crawler.
Mining the Web
Chakrabarti and Ramakrishnan
8
Large-scale crawlers: performance
and reliability considerations
 Need to fetch many pages at same time
• utilize the network bandwidth
• single page fetch may involve several seconds of
network latency
 Highly concurrent and parallelized DNS lookups
 Use of asynchronous sockets
• Explicit encoding of the state of a fetch context in a
•
data structure
Polling socket to check for completion of network
transfers
Multi-processing or multi-threading: Impractical
•
 Care in URL extraction
• Eliminating duplicates to reduce redundant fetches
Avoiding “spider Chakrabarti
traps”and Ramakrishnan
Mining •
the Web
9

DNS caching, pre-fetching and
resolution
A customized DNS component with…..
1. Custom client for address resolution
2. Caching server
3. Prefetching client
Mining the Web
Chakrabarti and Ramakrishnan
10
Custom client for address
resolution
 Tailored for concurrent handling of multiple
outstanding requests
 Allows issuing of many resolution requests
together
• polling at a later time for completion of
individual requests
 Facilitates load distribution among many
DNS servers.
Mining the Web
Chakrabarti and Ramakrishnan
11
Caching server
 With a large cache, persistent across DNS
restarts
 Residing largely in memory if possible.
Mining the Web
Chakrabarti and Ramakrishnan
12
Prefetching client
•
Steps
1. Parse a page that has just been fetched
2. extract host names from HREF targets
3. Make DNS resolution requests to the
caching server
•
Usually implemented using UDP
• User Datagram Protocol
• connectionless, packet-based
•
•
communication protocol
does not guarantee packet delivery
Does not wait for resolution to be
completed.
Mining the Web
Chakrabarti and Ramakrishnan
13
Multiple concurrent fetches
•
Managing multiple concurrent
connections
• A single download may take several
•
•
seconds
Open many socket connections to different
HTTP servers simultaneously
Multi-CPU machines not useful
• crawling performance limited by network and
disk
•
Two approaches
1. using multi-threading
Mining the Web
Chakrabarti and Ramakrishnan
14
Multi-threading
• logical threads
• physical thread of control provided by the operating
system (E.g.: pthreads) OR
• concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
• create a client socket
• connect the socket to the HTTP service on a server
• Send the HTTP request header
• read the socket (recv) until
•
no more characters are available
• close the socket.
• use blocking system calls
Mining the Web
Chakrabarti and Ramakrishnan
15
Multi-threading: Problems
• performance penalty
• mutual exclusion
• concurrent access to data structures
• slow disk seeks.
• great deal of interleaved, random input-output
•
on disk
Due to concurrent modification of document
repository by multiple threads
Mining the Web
Chakrabarti and Ramakrishnan
16
Non-blocking sockets and event
handlers
• non-blocking sockets
• connect, send or recv call returns immediately without
waiting for the network operation to complete.
• poll the status of the network operation separately
• “select” system call
• lets application suspend until more data can be read
from or written to the socket
• timing out after a pre-specified deadline
• Monitor polls several sockets at the same time
• More efficient memory management
• code that completes processing not interrupted by
other completions
• No need for locks and semaphores on the pool
• only append complete
pages to the log
Mining the Web
Chakrabarti and Ramakrishnan
17
Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
• Avoid multiple fetches of pages known by
•
different URLs
many IP addresses
•
For load balancing on large sites
• Mirrored contents/contents on same file system
•
“Proxy pass“
• Mapping of different host names to a single IP address
• need to publish many logical sites
• Relative URLs
•
Mining the Web
need to be interpreted w.r.t to a base URL.
Chakrabarti and Ramakrishnan
18
Canonical URL
•
•
•
•
Formed by
Using a standard string for the protocol
Canonicalizing the host name
Adding an explicit port number
Normalizing and cleaning up the path
Mining the Web
Chakrabarti and Ramakrishnan
19
Robot exclusion
• Check
• whether the server prohibits crawling a
•
normalized URL
In robots.txt file in the HTTP root directory of
the server
•
species a list of path prefixes which crawlers
should not attempt to fetch.
• Meant for crawlers only
Mining the Web
Chakrabarti and Ramakrishnan
20
Eliminating already-visited URLs
 Checking if a URL has already been fetched
• Before adding a new URL to the work pool
• Needs to be very quick.
• Achieved by computing MD5 hash function on the
URL
 Exploiting spatio-temporal locality of access

Two-level hash function.
– most significant bits (say, 24) derived by hashing the host name
plus port
– lower order bits (say, 40) derived by hashing the path

concatenated bits use d as a key in a B-tree
 qualifying URLs added to frontier of the crawl.
 hash values added to B-tree.
Mining the Web
Chakrabarti and Ramakrishnan
21
Spider traps
 Protecting from crashing on
• Ill-formed HTML
 E.g.:
page with 68 kB of null characters
• Misleading sites
 indefinite
number of pages dynamically generated
by CGI scripts
 paths of arbitrary depth created using soft directory
links and path remapping features in HTTP server
Mining the Web
Chakrabarti and Ramakrishnan
22
Spider Traps: Solutions
 No automatic technique can be foolproof
 Check for URL length
 Guards
• Preparing regular crawl statistics
• Adding dominating sites to guard module
• Disable crawling active content such as CGI
•
form queries
Eliminate URLs with non-textual data types
Mining the Web
Chakrabarti and Ramakrishnan
23
Avoiding repeated expansion of
links on duplicate pages
 Reduce redundancy in crawls
 Duplicate detection
• Mirrored Web pages and sites
 Detecting exact duplicates
• Checking against MD5 digests of stored URLs
• Representing a relative link v (relative to aliases u1
and u2) as tuples (h(u1); v) and (h(u2); v)
 Detecting near-duplicates
• Even a single altered character will completely
change the digest !

E.g.: date of update/ name and email of the site administrator
• Solution : Shingling
Mining the Web
Chakrabarti and Ramakrishnan
24
Load monitor

Keeps track of various system statistics
• Recent performance of the wide area
network (WAN) connection

E.g.: latency and bandwidth estimates.
• Operator-provided/estimated upper bound
•
on open sockets for a crawler
Current number of active sockets.
Mining the Web
Chakrabarti and Ramakrishnan
25
Thread manager
 Responsible for
 Choosing units of work from frontier
 Scheduling issue of network resources
 Distribution of these requests over multiple
ISPs if appropriate.
 Uses statistics from load monitor
Mining the Web
Chakrabarti and Ramakrishnan
26
Per-server work queues
 Denial of service (DoS) attacks
 limit the speed or frequency of responses to
any fixed client IP address
 Avoiding DOS
 limit the number of active requests to a given
server IP address at any time
 maintain a queue of requests for each server

Use the HTTP/1.1 persistent socket capability.
 Distribute attention relatively evenly between
a large number of sites
 Access locality vs. politeness dilemma
Mining the Web
Chakrabarti and Ramakrishnan
27
Text repository
 Crawler’s last task
 Dumping fetched pages into a repository
 Decoupling crawler from other functions
for efficiency and reliability preferred
 Page-related information stored in two
parts
 meta-data
 page contents.
Mining the Web
Chakrabarti and Ramakrishnan
28
Storage of page-related information
 Meta-data
 relational in nature


usually managed by custom software to avoid
relation database system overheads
text index involves bulk updates
 includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.
Mining the Web
Chakrabarti and Ramakrishnan
29
Page contents storage
 Typical HTML Web page compresses to 24 kB (using zlib)
 File systems have a 4-8 kB file block size
 Too large !!
 Page storage managed by custom storage
manager
 simple access methods for


Mining the Web
crawler to add pages
Subsequent programs (Indexer etc) to retrieve
documents
Chakrabarti and Ramakrishnan
30
Page Storage
 Small-scale systems
 Repository fitting within the disks of a single
machine
 Use of storage manager (E.g.: Berkeley DB)


Manage disk-based databases within a single file
configuration as a hash-table/B-tree for URL
access key
 To handle ordered access of pages

configuration as a sequential log of page records.
 Since Indexer can handle pages in any order
Mining the Web
Chakrabarti and Ramakrishnan
31
Page Storage
 Large Scale systems
 Repository distributed over a number of
storage servers
 Storage servers


Connected to the crawler through a fast local
network (E.g.: Ethernet)
Hashed by URLs
 `T3' grade leased lines.

Mining the Web
To handle 10 million pages (40 GB) per hour
Chakrabarti and Ramakrishnan
32
Large-scale crawlers often use multiple ISPs and a bank of local storage
servers to store the pages crawled.
Mining the Web
Chakrabarti and Ramakrishnan
33
Refreshing crawled pages
 Search engine's index should be fresh
 Web-scale crawler never `completes' its
job
 High variance of rate of page changes
 “If-modified-since” request header with
HTTP protocol
 Impractical for a crawler
 Solution
 At commencement of new crawling round
estimate which pages have changed
Mining the Web
Chakrabarti and Ramakrishnan
34
Determining page changes
 “Expires” HTTP response header
 For page that come with an expiry date
 Otherwise need to guess if revisiting that
page will yield a modified version.
 Score reflecting probability of page being
modified
 Crawler fetches URLs in decreasing order of
score.
 Assumption : recent past predicts the future
Mining the Web
Chakrabarti and Ramakrishnan
35
Estimating page change rates
 Brewington and Cybenko & Cho
 Algorithms for maintaining a crawl in which
most pages are fresher than a specified
epoch.
 Prerequisite
 average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
 Small scale intermediate crawler runs
 to monitor fast changing sites

E.g.: current news, weather, etc.
 Patched intermediate indices into master
index
Mining the Web
Chakrabarti and Ramakrishnan
36
Putting together a crawler
 Reference implementation of the HTTP client
protocol


Mining the Web
World-wide Web Consortium (http://www.w3c.org/)
w3c-libwww package
Chakrabarti and Ramakrishnan
37
Design of the core components:
Crawler class.
 To copy bytes from network sockets to storage
media
 Three methods to express Crawler's contract
with user
 pushing a URL to be fetched to the Crawler

(fetchPush)
Termination callback handler (fetchDone) called with
same URL
Method (start) which starts Crawler's event loop.

 Implementation of Crawler class
 Need for two helper classes called DNS and Fetch
Mining the Web
Chakrabarti and Ramakrishnan
38