Diapositiva 1

Download Report

Transcript Diapositiva 1

Claudio Scordino
Ph.D. Student
May 2004
Crawling the Web:
problems and techniques
Computer Science Department - University of Pisa
Outline
• Introduction
• Crawler architectures
- Increasing the throughput
• What pages we do not want to fetch
- Spider traps
- Duplicates
- Mirrors
Introduction
Job of a crawler (or spider): fetching the Web
pages to a computer where they will be analyzed
The algorithm is conceptually simple, but…
…it’s a complex and underestimate activity
Famous Crawlers
• Mercator (Compaq, Altavista)
 Java
 Modular (components loaded dynamically)
 Priority-based scheduling for URLs downloads
- The algorithm is a pluggable component
 Different processing modules for different contents
 Checkpointing
- Allows the crawler to recover its state after a failure
- In a distributed crawler is performed by the Queen
Famous Crawlers
• GoogleBot (Stanford, Google)
 C/C++
• WebBase (Stanford)
• HiWE: Hidden Web Exposer (Stanford)
• Heritrix (Internet Archive)
 http://www.crawler.archive.org/
Famous Crawlers
• Sphinx
 Java
 Visual and interactive environment
 Relocatable: capable of executing on a remote host
 Site-specific
-
Customizable crawling
-
Classifiers: site-specific content analyzers
1. Links to follow
2. Parts to process
- Not scalable
Crawler Architecture
PARSER
Citations
HREFs
extractor
and normalizer
URL
Filter
Duplicate
URL
Eliminator
SCHEDULER
Load
Monitor
Crawl
Metadata
DNS HTTP
RETRIEVERS
seed URLs
URL FRONTIER
Hosts
Interne
t
Web masters annoyed
Web Server administrators could be annoyed by:
1. Server overload
- Solution: per-server queues
2. Fetching of private pages
- Solution: Robot Exclusion Protocol
- File: /robots.txt
Crawler Architecture
Per-server
queues
Robots
Mercator’s scheduler
FRONT-END:
prioritizes URLs
with a value
between 1 and k
BACK-END: ensures
politeness
(no server overload)
Queues
containing
URLs of only
a single host
Specifies
when a
server may
be contacted
again
Increasing the throughput
Parallelize the process to fetch many pages at
the same time (~thousands per second).
Possible levels of parallelization:
DNS
HTTP
Parsing
Domain Name resolution
Problem: DNS requires time to resolve the server
hostname
Domain Name resolution
1. Asynchronous DNS resolver:
• Concurrent handling of multiple outstanding
requests
• Not provided by most UNIX implementations of
gethostbyname
• GNU ADNS library
• http://www.chiark.greenend.org.uk/~ian/adns/
• Mercator reduced the thread’s elapsed time from
87% to 25%
Domain Name resolution
2. Customized DNS component:
• Caching server with persistent cache largely
residing in memory
• Prefetching
• Hostnames extracted by HREFs and requests
made to the caching server
• Does not wait for resolution to be completed
Crawler Architecture
DNS
Cache
Async
DNS
prefetch
DNS
resolver
client
Per-server
queues
Robots
Page retrieval
Problem: HTTP requires time to fetch a page
1. Multithreading
• Blocking system calls (synchronous I/O)
• pthreads multithreading library
• Used in Mercator, Sphinx, WebRace
• Sphinx uses a monitor to determine the optimal
number of threads at runtime
• Mutual exclusion overhead
Page retrieval
2. Asynchronous sockets
• not blocking the process/thread
•
select monitors several sockets at the same time
• Does not need mutual exclusion since it performs a
serialized completion of threads (i.e. the code that
completes processing the page is not interrupted by
other completions).
• Used in IXE (1024 connection at once)
Page retrieval
3. Persistent connection
• Multiple documents requested on a single
connection
• Feature of HTTP 1.1
• Reduce the number of HTTP connection setups
• Used in IXE
IXE Crawler
CrawlInfo
thread
Table
<UrlInfo>
Citations
Crawler
Parser
synch. obj
memory
disk
Retriever
Feeder
Scheduler
Host
queues
Cache
select()
Retriever
Retriever select()
UrlEnumerator
Hosts
Robots
IXE Parser
• Problem: parsing requires 30% of execution
time
• Possible solution: distributed parsing
IXE Parser
Table
<UrlInfo>
DocID1
URL1
DocID2
URL2
Citations
URL1
URL1
URL2
URL Table
Manager
(“Crawler”)
URL2
Parser
Cache
DocID1
DocID2
URL1
URL2
A distributed parser
MISS
Table 1
<UrlInfo>
Hash(URL2)
→
Hash (URL1)
Manager1
→
Manager2
URL2
Table 1
Manager
?
URL1
URL2
Parser 1
DocID2
URL1
HIT
Table 2
<UrlInfo>
Table 2
Manager
DocID1
URL1
URL2
New
DocID
Parser N
Sched ()
→
Parser1
Citations
URL1
Scheduler
URL2
Cache
A distributed parser
• Does this solution scale?
- High traffic on the main link
• Suppose that:
- Average page size = 10KB
- Average out-links per page = 10
- URL size = 40 characters (40 bytes)
- DocID size = 5 byte
• X = throughput (pages per second)
• N = number of parsers
A distributed parser
• Bandwidth for web pages:
- X*10*1024*8 = 81920*X bps
• Bandwidth for messages (hit):
- X/N * 10 * (40+5) * 8 * N = 3600*X bps
Pages per
parser
Outlinks
DocID
per page Request
DocID
Reply
Number
of parsers
Byte → bit
• Using 100Mbps : X = 1226 pages per second
What we don’t want to fetch
1. Spider traps
2. Duplicates
2.1 Different URLs for the same page
2.2 Already visited URLs
2.3 Same document on different sites
2.4 Mirrors
• At least 10% of the hosts are mirrored
Spider traps
• Spider trap: hyperlink graph constructed
unintentionally or malevolently to keep a
crawler trapped
1. Infinitely “deep” Web sites
• Problem: using CGI is possible to generate an
infinite number of pages
• Solution: check of the URL length
Spider traps
2. Large number of dummy pages
• Example:
http://www.troutbums.com/Flyfactory/flyfactory/flyfactory/hatchlin
e/hatchline/flyfactory/hatchline/flyfactory/hatchline/flyfactory/flyfa
ctory/flyfactory/hatchline/flyfactory/hatchline/
• Solution: disable crawling
• a guard removes from consideration any
URL from a site which dominates the
collection
Avoid duplicates
• Problem almost nonexistent in classic IR
• Duplicate content
• wastes resources (index space)
• annoys users
Virtual Hosting
• Problem: Virtual Hosting
• Allows to map different sites to a single IP address
• Could be used to create duplicates
• Feature of HTTP 1.1
http://www.cocacola.com
129.33.45.163
http://www.coke.com
• Rely on canonical hostnames (CNAMEs)
provided by DNS
Already visited URLs
• Problem: how to recognize an already visited
URL ?
• The page is reachable by many paths
• We need an efficient Duplicate URL Eliminator
Already visited URLs
1. Bloom Filter
• Probabilistic data structure for set membership
testing
BIT VECTOR
hash function 1
URL
hash function 2
hash function n
0/1
0/1
0/1
• Problem: false positivs
• new URLs marked as already seen
Already visited URLs
2. URL hashing
• MD5
URL
128 bits
MD5
Digest
• Using a 64-bit hash function, a billion URLs
requires 8GB
- Does not fit in memory
- Using the disk limit the crawling rate to 75
downloads per second
Already visited URLs
3. two-level hash function
• The crawler is luckily to explore URLs within the
same site
• Relative URLs create a spatiotemporal locality of
access
• Exploit this kind of locality using a cache
24 bits
40 bits
Hostname+Port Path
Content based techniques
• Problem: how to recognize duplicates basing
on the page contents?
1. Edit distance
• Number of replacements required to transform one
document to the other
• Cost: l1*l2, where l1 and l2 are the lenghts of the
documents: Impractical!
Content based techniques
2. Hashing
• A digest associated with each crawled page
• Used in Mercator
• Cost: one seek in the index for each new crawled
page
Problem: pages could have minor syntatic
differences !
• site mantainer’s name, latest update
• anchors modified
• different formatting
Content based techniques
3. Shingling
• Shingle (or q-gram): contiguous subsequence of
tokens taken from document d
• representable by a fixed length integer
• w-shingle: shingle of width w
• S(d,w): w-shingling of document d
•
unordered set of distinct w-shingles contained in
document d
Content based techniques
Sentence:
a rose is a rose is a rose
Tokens:
a rose is a rose is a rose
a,rose,is,a
rose,is,a,rose
4-shingles:
is,a,rose,is
a,rose,is,a
rose,is,a,rose
S(d,4): a,rose,is,a rose,is,a,rose is,a,rose,is
Content based techniques
• Each token = 32 bit
w-shingle=320 bit
• w = 10 (suitable value)
• S(d,10) = set of 320-bits numbers
• We can hash the w-shingles and keep 500
bytes of digests for each document
Content based techniques
• Resemblance of documents d1 and d2:
r (d1, d 2) 
S (d1, w)  S (d 2, w)
S (d1, w)  S (d 2, w)
Jaccard coefficient
• Eliminate pages too similar (pages whose resemblance value is close to 1)
Mirrors
URL
http://www.research.digital.com/SRC/
access
hostname
path
method
• Precision = relevant retrieved docs / retrieved
docs
Mirrors
1. URL String based
• Vector Space model: term vector matching to
compute the likelyhood that a pair of hosts are
mirrors
• terms with df(t) < 100
Mirrors
a) Hostname matching
27%
• Terms: substrings of the hostname
• Term weighting:
log(len(t ))
1  log(df (t ))
len(t)= number of segments obtained by breaking
the term at ‘.’ characters
• This weighting favours substrings composed by
many segments very specific
Mirrors
b) Full path matching
59%
• Terms: entire paths
• Term weighting:
m df
1  log(
)
df (t )
mdf = max df(t)
t∈collection
Connectivity based filtering stage:
+19%
• Idea: mirrors share many common paths
• Testing for each common path if it has the
same set of out-links on both hosts
• Remove hostnames from local URLs
Mirrors
c) Positional word bigram matching
72%
• Terms creation:
• Break the path into a list of words by treating
‘/’ and ‘.’ as breaks
• Eliminate non-alphanumeric characters
• Replace digits with ‘*’ (effect similar to
stemming)
• Combine successive pairs of words in the list
• Append the ordinal position of the first word
Mirrors
conferences/d299/advanceprogram.html
conferences
d*
advanceprogram
html
conferences_d*_0
Positional
d*_advanceprogram_1
Word
advanceprogram_html_2
Bigrams
Mirrors
2. Host connectivity based 45%
• Consider all documents on a host as a single large
document
• Graph:
• host → node
• document on host a pointing to a document on
host B → directed edge from A to B
• Idea: two hosts are likely to be mirrors if their
nodes point to the same nodes
• Term vector matching
- Terms: set of nodes that a host’s node points to
References
S. Chakrabarti and M. Kaufmann, Mining the Web: Analysis of Hypertext
and Semi Structured Data, 2002. Pages 17-43,71-72.
S.Brin and L.Page, The anatomy of a large-scale hypertextual Web
search engine. Proceedings of the 7th World Wide Web Conference
(WWW7), 1998.
A.Heydon and M.Najork, Mercator: A scalable, extensible Web crawler,
World Wide Web Conference, 1999.
K.Bharat, A.Broder, J.Dean, M,R.Henzinger, A comparison of
Techniques to Find Mirrored Hosts on the WWW, Journal of the
American Society for Information Science, 2000.
References
A.Heydon and M.Najork, High performance Web Crawling, Technical
Report, SRC Research Report, 173, Compaq Systems Research Center,
26 September 2001.
R.C.Miller and K.Bharat, SPHINX: a framework for creating personal,
site-specific web crawlers, Proceedings of the 7th World-Wide Web
Conference, 1998.
D. Zeinalipour-Yazti and M. Dikaiakos. Design and Implementation of a
Distributed Crawler and Filtering Processor, Proceedings of the 5th
Workshop on Next Generation Information Technologies and Systems
(NGITS 2002), June 2002.