Transcript Document

These slides are based on the book “Mining the Web”
by Soumen Chakrabarti
Refer to “Crawling the Web” Chapter for more information
The amount of information
the World Wide Web Worm indexed 110K pages
: millions of pages
: billions of pages
: ???? Of pages
Complexity of the link graph
HTTP : Hypertext transport protocol
TCP : Transmission Control Protocol
IP : Internet Protocol
HTML : Hypertext markup language
URL : Uniform Resource Locator
<a href=“”> The IIT Bombay Computer Science Department</a>
Server host
File path
A click on the hyperlink is converted to a network request by the
Browser will then fetch and display the web page pointed to ny the url.
Server host name (like needs to be translated into
an ip address such as to contact the server using TCP.
<a href=“”> The IIT Bombay Computer Science Department</a>
Server host
File path
DNS (Domain Name Service) is a distributed
database of name-to-IP address mappings
This database is maintained by known servers
A click on the hyperlink is translated into
telnet 80
80 is the default http port
MIME Header
MIME: Multipurpose Internet Mail
Extensions, a standard for email
and web content transfer.
There is no directory of all accessible URLs
The main strategy is to
start from a set of seed web pages
Extract URLs from those pages
Apply the same techniques to the pages from those URL
It may not be possible to retrieve all the pages on the WEB with
this technique since New pages are added every day
Use a queue structure and mark
Visited nodes
Writing a basic crawler is easy
Writing a large-scale crawler is challenging
Following are the basic steps of crawling
URL to IP conversion using the DNS server
Socket connection to the server and sending the request
Receiving the requested page
For small pages, DNS lookup and socket connection takes more
time then receiving the requested page
We need to overlap the processing and waiting times for the
above three steps.
Storage requirements are huge
Need to store the list of URLs and the retrieved pages in the
Storing the URLs in the disk is also needed for persistency
Pages are stored in compressed form (goodle uses zlib for
compression, 3 to 1 )
Large Scale Crawler Tips
Fetch hundreds of pages at the same time to increase bandwidth
Use more than one DNS server for concurrent DNS lookup
Using asynchronous sockets is better than multi-threading
Eliminate duplicates to reduce the number of redundant fetches and to
avoid spider traps (infinite set of fake URLs)
DNS Caching
Address mapping is a significant bottleneck
A crawler can generate more requests per unit time than a DNS server
can handle
Caching the DNS entries helps
DNS cache needs to be refreshed periodically (whenever it is idle)
Concurrent page requests
Can be achieved by
Non-blocking sockets with event handlers
A set of threads are created
After the server name is translated to IP address,
a thread creates a client socket
Connects to the Http service on the server
Sends the http request header
Reads the socket until eof
Closes the socket
Blocking system calls are used to suspend the thread until the requested
data is available
A fixed number of worker threads share a work-queue of pages to
Handling concurrent access to data structures is a problem. Mutual
exclusion needs to be handled properly
Disk access can not be orchestrated when multiple concurrent threads
are used
Non-blocking sockets could be a better approach!
Non-blocking sockets
Connect, send, and receive calls will return immediately without
blocking for network data
The status of the network can be polled later on
“Select” system call lets the application wait for data to be available on
the socket
This way completion of page fetching is serialized.
No need for locks or semaphores
Can append the pages to the file in disk without being intercepted
Link Extraction and Normalization
An HTML page is searched for links to add to the work-pool
URLs extracted from pages need to be preprocessed before they are
added to the work-pool
Duplicate elimination is necessary but difficult
Since mapping from urls to hostnames is many-to-many I.e., a computer
may have many IP addresses and many hostnames.
Extracted URLs are converted to canonical form by
Using the canonical hotname provided by the DNS response
Adding an explicit port number
Converting the relative addresses to absolute addresses
Some more tips
Server may disallow crawling using “robots.txt” found in the http root
Robots.txt specifies a list of path prefixes that crawlers should not try
to fetch
Eliminating already visited URLS
IsUrlVisisted module in the architecture does that job
The same page could be kinked from many different sites
Checking if the page is already visited eliminates redundant page
Comparing the strings of URLs may take long time since it involves disk
access and checking against all the stored URLS
Eliminating already visited URLS
Duplicate checking is done by applying a hash function MD5 originally
designed for digital signature applications
MD5 algorithm takes a message of arbitrary length as input and
produces a 128-bit "fingerprint" or "message digest" as output
“it is computationally infeasible to produce two messages having the
same message digest”
Even the hashed URLs need to be stored in disk due to storage and
persistency requirements
Spatial and temporal locality of URL access means less number of disk
accesses when URL hashes are cached
Eliminating already visited URLs
We need utilize spatial locality as much as possible
But MD5 will distribute the domain of similar URLs string uniformly over
a range.
Two-block or two-level hash function is used
Use different hash functions for the host address and the path
B-tree could be used to index the host name, and the retrieved page will
contain the urls in the same host.
Spider Traps
Malicious pages designed to crash the crawlers
Simply add 64K of null characters in the middle of URL to crash the lexical
Infinitely deep web sites
Using dynamically generated links via CGI scripts
Need to check the link length
No technique is foolproof
Generate periodic statistics for the crawler to eliminate dominating sites
Disable crawling active content
Avoiding duplicate pages
A page can be accessed via different URLs
Eliminating duplicate pages will also help eliminate spider traps
MD5 can be used for that purpose
Minor changes can not be handled with MD5.
Can divide the page into blocks
Denial of Service
HTTP servers protect themselves against denial of service (DoS)
DoS attacks will send frequent requests to the same server to slow
down its operation
Therefore frequent requests from the same IP are prohibited
Crawlers need to consider such cases for courtesy/legal action
Need to limit the active requests to a given server IP address at any time
Maintain a queue of requests for each server
This will also reduce the effect of spider traps
Text Repository
The pages that are fetched are dumped into a text repository
The text repository is significantly large
Needs to be compressed (google uses zlip for 3-1 compression)
Google implements its own file system
Berkeley DB ( can also be used
Stores a database within a single file
Provides several access methods such as B-tree or sequential
Refreshing Crawled Pages
HTTP protocol could be used to check if a page changes since last time
it was crawled
But using HTTP for checking if a page is modified takes a lot of time
If a page expires after a certain time, this could be extracted from the
http header.
If we had a score that reflects the probability of change since last time
it was visited
We can sort the pages wrt that score and crawl them in that order
Use the past behavior to model the future!
Your crawler
Use w3c-libwww API to implement your crawler
Start from a very simple implementation and go on from that!
Sample codes and algorithms are provided in the handouts