Crawling - MediaLab | slideum.com

Crawling - MediaLab

Transcript Crawling - MediaLab

Crawling
Giuseppe Attardi
Dipartimento di Informatica
Università di Pisa
Crawling
Download a set of Web pages
 Set consists typically of all pages
reachable following links from a root
set
 Crawling is performed periodically
 Goals:

– Find new pages
– Keep pages fresh
– Select “good” pages
Web Dynamics

Size
– ~10 billion Public Indexable pages
– 10kB / page  100 TB
– Doubles every 18 months

Dynamics
– 33% change weekly
– 8% new pages every week
– 25% new links every week
Weekly change
Fetterly, Manasse, Najork, Wiener 2003
Crawling Issues

How to crawl?
– Quality: “Best” pages first
– Efficiency: Avoid duplication (or near
duplication)
– Etiquette: Robots.txt, Server load concerns

How much to crawl? How much to
index?
– Coverage: How big is the Web? How much do
we cover?
– Relative Coverage: How much do competitors
have?
Basic crawler operation
Begin with known “seed” pages
 Add them to a queue
 Extract URL from queue

– Fetch it
– Parse it and extract URLs it points to
– Add any non visited extracted URLs on
the queue

Repeat until queue is empty
Simple picture – complications

Web crawling isn’t feasible with one machine
– All of the above steps distributed

Even non-malicious pages pose challenges
– Latency/bandwidth to remote servers vary
– Robots.txt stipulations
• How “deep” should you crawl a site’s URL hierarchy?
– Site mirrors and duplicate pages

Malicious pages
– Spam pages
– Spider traps – incl dynamically generated

Politeness – don’t hit a server too often
Robots.txt

Protocol for giving spiders (“robots”)
limited access to a website, originally
from 1994
– www.robotstxt.org/wc/norobots.html

Website announces its request on
what can(not) be crawled
– For a URL, create a file
URL/robots.txt
– This file specifies access restrictions
Robots.txt example

No robot should visit any URL
starting with "/yoursite/temp/",
except the robot called “IxeBot":
User-agent: *
Disallow: /yoursite/temp/
User-agent: IxeBot
Disallow:
Crawling Issues
Crawl strategies
 Distributed crawling
 Refresh strategies
 Filtering duplicates
 Mirror detection

Crawl Strategies
Where do we crawl next?
URLs crawled
and parsed
URLs in queue
Web
Crawl Order
 Want
best pages first
 Potential quality measures:
• Final In-degree
• Final PageRank
 Crawl heuristics:
• Breadth First Search (BFS)
• Partial Indegree
• Partial PageRank
• Random walk
Breadth-First Crawl

Basic idea:


start at a set of known URLs
explore in “concentric circles” around these URLs
start pages
distance-one pages
distance-two pages


used by broad web search engines
balances load between servers
Web Wide Crawl (328M pages) [Najo01]
BFS crawling brings in high quality
pages early in the crawl
Stanford Web Base (179K) [Cho98]
Overlap with
best x% by
indegree
x% crawled by O(u)
Queue of URLs to be fetched

What constraints dictate which queued
URL is fetched next?
 Politeness – don’t hit a server too often,
even from different threads of your spider
 How far into a site you’ve crawled already
– Most sites, stay at ≤ 5 levels of URL hierarchy

Which URLs are most promising for
building a high-quality corpus
– This is a graph traversal problem:
– Given a directed graph you’ve partially visited,
where do you visit next?
Where do we crawl next?

Keep all crawlers busy
 Keep crawlers from treading on each
others’ toes
– Avoid fetching duplicates repeatedly

Respect politeness/robots.txt
 Avoid getting stuck in traps
 Detect/minimize spam
 Get the “best” pages
– What’s best?
– Best for answering search queries
Where do we crawl next?

Complex scheduling optimization
problem, subject to all the constraints
listed
– Plus operational constraints (e.g., keeping all
machines load-balanced)

Scientific study – limited to specific
aspects
– Which ones?
– What do we measure?

What are the compromises in distributed
crawling?
Page selection
Importance metric
 Web crawler model
 Crawler method for choosing page to
download

Importance Metrics
Given a page P, define how “good”
that page is
 Several metric types:

– Interest driven
– Popularity driven
– Location driven
– Combined
Interest Driven




Define a driving query Q
Find textual similarity between P and Q
Define a word vocabulary t1…tn
Define a vector for P and Q:
– Vp, Vq = <w1,…,wn>
• wi = 0 if ti does not appear in the document
• wi = IDF(ti) = 1 / number of pages containing ti



Importance: IS(P) = Vp * Vq (cosine product)
Finding IDF requires going over the entire web
Estimate IDF by pages already visited, to calculate
IS’
Popularity Driven

How popular a page is:
– Backlink count
• IB(P) = the number of pages containing a
link to P
• Estimate by pervious crawls: IB’(P)
– More sophisticated metric, e.g.
PageRank: IR(P)
Location Driven

IL(P): A function of the URL to P
– Words appearing on URL
– Number of “/” on the URL

Easily evaluated, requires no data
from pervious crawls
Combined Metrics
IC(P): a function of several other
metrics
 Allows using local metrics for first
stage and estimated metrics for
second stage
 IC(P) = a*IS(P) + b*IB(P) + c*IL(P)

Focused Crawling
Focused Crawling (Chakrabarti)
Distributed federation of focused
crawlers
 Supervised topic classifier
 Controls priority of unvisited frontier
 Trained on document samples from
Web directory (Dmoz)

Basic Focused Crawler
DOM tree analysis
Learning Algorithm
Naïve Bayes
 Page u is modeled as bag of words

{ <feature, freq> }
feature = < term, distance >
Comparisons






Start the baseline crawler from the URLs
in one topic
Fetch up to 20000-25000 pages
For each pair of fetched pages (u,v), add
item to the training set of the apprentice
Train the apprentice
Start the enhanced crawler from the same
set of pages
Fetch about the same number of pages
Results
Controversy
Chakrabarty claims focused crawler
superior to breadth-first
 Suel claims the contrary and that
argument was based on experiments
with poor performance crawlers

Distributed Crawling
Approaches
Centralized Parallel Crawler
 Distributed
 P2P

Parallel Crawlers

A parallel crawler consists of multiple
crawling processes communicating via
local network (intra-site parallel crawler) or
Internet (distributed crawler)
– http://www2002.org/CDROM/refereed/108/index.html

Setting: we have a number of c-proc’s
– c-proc = crawling process

Goal: we wish to crawl the best pages with
minimum overhead
Crawler-process distribution
at geographically
distant locations.
on the same local
network
Distributed Crawler
Central
Parallel Crawler
Distributed model

Crawlers may be running in diverse
geographic locations
– Periodically update a master index
– Incremental update so this is “cheap”
• Compression, differential update etc.
– Focus on communication overhead
during the crawl

Also results in dispersed WAN load
Issues and benefits
Issues:
– overlap: minimization of multiple downloaded
pages
– quality: depends on the crawl strategy
– communication bandwidth: minimization
Benefits:
– scalability: for large-scale web-crawls
– costs: use of cheaper machines
– network-load dispersion and reduction: by
dividing the web into regions and crawling
only the nearest pages
Coordination
A parallel crawler consists of multiple crawling
processes communicating via local network
(intra-site parallel crawler) or Internet (distributed
crawler)
Coordination among c-proc's:
1.
2.
3.
Independent: no coordination, every process follows its
extracted links
Dynamic assignment: a central coordinator dynamically
divides the web into small partitions and assigns each
partition to a process
Static assignment: Web is partitioned and assigned
without central coordinator before the crawl starts
Coordination
1. Independent:
► no coordination, every process follows its
extracted links
2. Dynamic assignment:
► a central coordinator dynamically divides the
web into small partitions and assigns each
partition to a process
3. Static assignment:
► Web is partitioned and assigned without
central coordinator before the crawl starts
c-proc’s crawling the web
Which c-proc
gets this URL?
URLs crawled
URLs in
queues
Communication: by URLs
passed between c-procs.
Static assignment
Links from one partition to another (interpartition links) can be handled either in:
1.
2.
3.
Firewall mode:
a process does not follow
any inter-partition link
Cross-over mode:
a process follows also
inter-partition links and
discovers also more pages
in its partition
Exchange mode:
processes exchange interpartition URLs; mode
needs communication
Partition 1
Partition 2
a
b
e
f
c
g
d
h
i
Classification of parallel crawlers
If exchange mode is used, communication can be
limited by:
–
–
Batch communication: every process collects some URLs
and send them in a batch
Replication: the k most popular URLs are replicated at each
process and are not exchanged (previous crawl or on the fly)
Some ways to partition the Web:
–
–
–
URL-hash based: many inter-partition links
Site-hash based: reduces the inter partition links
Hierarchical: .com domain, .net domain …
Static assignement: comparison
Coverage
Overlap
Quality
Communication
Firewall
Bad
Good
Bad
Good
Crossover
Good
Bad
Bad
Good
Exchange
Good
Good
Good
Bad
UBI Crawler
Features:
– Full distribution: identical agents / no central coordinator
– Balanced locally computable assignment:
• each URL is assigned to one agent
• each agent can compute the URL assignement locally
• distribution of URLs is balanced
– Scalability:
• number of crawled pages per second and per agent are
independent of the number of agents
– Fault tolerance:
• URLs are not statically distributed
• distributed reassignment protocol not reasonable
UBI Crawler: Assignment Function
A: set of agent identifiers
L A
L: set of alive agents
m: total number of hosts
: assigns host h to an alive agent in L:
 L (h)  L
Requirements:
 Balance: each agent should be responsible for approximatly the
same number of hosts:


1
L
(a) 
m
L
Contravariance: if the number of agents grows, the portion of the
web crawled by each agent must shrink:
L  L   L (a)  L ' (a)
'
1
1
L  L'   L ' (h)  L   L ' (h)   L (h)
Consistent Hashing

Each bucket is replicated k times and each replica is mapped
randomly on the unit circle
 Hashing a key: compute a point on the unit circle and find the nearest
replica
L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9}
5
8
b
5
a
2
3
b
4
6
0
b
7
b
a
a
b
0
7
8
a
9
1
L-1(a) = {1,4,5,6,8,9}
L-1(b) = {0,2,3,7}
2
c
3
b
a
4
6
a
c
1
c
9
L‘-1(a) = {4,5,6,8}
L‘-1(b) = {0,2,7}
L‘-1(c) = {1,3,9}
Contravariance:
L  L'   L (a)  L ' (a)
1
1
L  L'   L '  L   L ' (h)   L (h)
Balancing:
Hash function and random
number generator
UBI Crawler: fault tolerance
Up to now: no metrics for estimating the fault tolerance of
distributed crawlers
– Each agent has its own view of the set of alive agents (views
can be different) but two agents will never dispatch hosts to
two different agents.
1
b
died
c
2
3
a
d
– Agents can be added dynamically in a self-stabilizing
way
Evaluation metrics (1)
1.
Overlap:
N I
Overlap 
N
N: total number of fetched pages
I: number of distinct fetched pages
• minimize the overlap
1.
Coverage:
U
Coverage 
I
U: total number of Web pages
• maximize the coverage
Evaluation metrics (2)
3.
Communication Overhead:
M
Overhead 
P
M: number of exchanged messages
(URLs)
P: number of downloaded pages
• minimize the overhead
4.
Quality:
Quality 
1
PageRank( pi )

N i
• maximize the quality
• backlink count / oracle crawler
x
Experiments

40M URL graph – Stanford Webbase
– Open Directory (dmoz.org) URLs as
seeds

Should be considered a small Web
Firewall mode coverage

The price of crawling in firewall mode
Crossover mode overlap

Demanding coverage drives up
overlap
Exchange mode communication

Communication overhead sublinear
Per
downloaded
URL
Cho’s conclusion


<4 crawling processes run in parallel firewall
mode provide good coverage
firewall mode not appropriate when:
– > 4 crawling processes
– download only a small subset of the Web and quality of
the downloaded pages is important

exchange mode
– consumes < 1% network bandwidth for URL exchanges
– maximizes the quality of the downloaded pages

By replicating 10,000 - 100,000 popular URLs,
communication overhead reduced by 40%
Resources

www.robotstxt.org/wc/norobots.html

www2002.org/CDROM/refereed/108/inde
x.html

www2004.org/proceedings/docs/1p59
5.pdf
Crawler Models

A crawler
– Tries to visit more important pages first
– Only has estimates of importance metrics
– Can only download a limited amount

How well does a crawler perform?
– Crawl and Stop
– Crawl and Stop with Threshold
Crawl and Stop

A crawler stops after visiting K pages
 A perfect crawler
– Visits pages with ranks R1,…,Rk
– These are called Top Pages

A real crawler
– Visits only M < K top pages
M
 Performance rate
Pcs 
K
K
 For a random crawler Pcs 
N
Crawl and Stop with Threshold



A crawler stops after visiting K pages
Top pages are pages with a metric higher than G
A crawler visits V top pages

Metric: percent of top pages visited

Perfect crawler
K
Pst 
T
Random crawler
T
K
K
N
Pst 

T
N

V
Pst 
T
Ordering Metrics

The crawlers queue is prioritized
according to an ordering metric
 The ordering metric is based on an
importance metric
– Location metrics - directly
– Popularity metrics - via estimates according to
pervious crawls
– Similarity metrics – via estimates according to
anchor
When to stop
Up to a certain depth
 Up to a certain amount from each
site
 Up to a maximum pages

Literature
Mercator (Altavista, Java)
 WebBase (Garcia-Molina, Cho)
 PolyBot (Suel, NY Polytechnic)
 Grub (grub.org)
 UbiCrawler (Vigna, Uni. Milano)

Crawler Architecture
PolyBot
• application issues
requests to manager
• manager does DNS
and robot exclusion
• manager schedules
URL on downloader
• downloader gets file
and puts it on disk
• application is notified
of new files
• application parses new
files for hyperlinks
• application sends data
to storage component
Scaling up
•
20 machines
•
1500 pages/s?
•
depends on
crawl strategy
•
hash to nodes
•
based on site
(b/c robot ex)
Scheduler Data Structures
• when to insert new URLs into internal structures?
IXE Crawler
CrawlInfo
thread
Table
<UrlInfo>
Citations
Crawler
Parser
synch. obj
memory
disk
Retriever
Feeder
Scheduler
Host
queues
Cache
select()
Retriever
Retriever select()
UrlEnumerator
Hosts
Robots
Scheduler
Splits URLs by hosts
 Distributes to Retrievers
 Keeps hosts on wait queue for grace
period

Retriever
Modified version of cURL library
 Asynchronous DNS resolver
 Handles up to 1024 transfer at once
 Keeps connections alive, transfer
multiple files on one connection
(connection setup can be 25% of
retrieval time)

Persistent Storage
Implemented by metaprogramming
as IXE object-relational tables
 Tuned to handle concurrent access,
with low granularity locking
 Locking completely transparent to
application
 < 1% of overall runtime

Parser
> 30% of execution time
 Uses a ThreadPool

Locking and Concurrency

Portable layer (Unix, Windows):
– Thread
– ThreadGroup
– ThreadPool
– Lock
– Condition
– ConditionWait
Pipe
bool Get(T& res) {
LockUp our(_lock);
if (empty()) {
if (closed)
return false;
else {
ConditionWait(not_empty, &_lock);
if (closed)
return false;
}
}
assert(!empty());
res = front();
pop_front();
return true;
}
In Memory Communication
Channel<T>
 Pipe<T, Q = queue<T> >
 MultiPipe<T, Q = queue<T> >
 Asynchronous Put(obj)
 Synchronous Get(obj)
 Peek()
 Close()

Performance Comparison
Language
Parallelism
DNS resolution
nutch
Java
threads: 300
IXE
C++
asynch IO:
1 thread, 500
connections
synchronous asynchronous
Page analysis stages
concurrent
Download speed 19 pag/sec
120 pag/sec
Bandwidth
1.3 Mb/s peak 24 Mb/s peak
Crawl Performance

Single PC:
– Peak 120 pages per second (24 Mb/s)
– 5 million pages/day

Full crawl of 2 billion pages:
– 400 day/PC, 25 Mb/s or
– 4 days, 100 PCs, 2.5 Gb/s bandwidth
Crawler Console
Web pushlet (aka AJAX) graphical
monitor
 Real time bandwidth usage graph
 Scheduled Windows Service

URL compression
10 billion URLs = 550 GB
 Must map URL to unique Ids
 Simple trick: use undocumented
feature of zlib

– Supply dictionary to decompress()
– Create dictionary by compressing a
number of URL’s with typical patterns
– Save dictionary for separate use
– Achieves >50% compression rate
Statistics
Page avg. 13KB, 8 links per page
 From 20 million pages >100 million
links: several days to crawl

Related work

Mercator (Heydon/Najork, DEC/Compaq)
– used in AltaVista
– centralized system (2-CPU Alpha with RAID
disks)
– URL-seen test by fast disk access and caching
– one thread per HTTP connection
– completely in Java, with pluggable
components

Atrax: distributed extension to Mercator
– combines several Mercators
– URL hashing and off-line URL check
Related Work (cont.)

early Internet Archive crawler (circa 96)
– uses hashing to partition URLs between
crawlers
– bloom filter for “URL seen” structure



early Google crawler (1998)
P2P crawlers (grub.org and others)
Cho/Garcia-Molina (WWW 2002)
– study of overhead/quality tradeoff in parallel
crawlers
– difference: we scale services separately, and
focus on single-node performance
– in our experience, parallel overhead low
Open Issues

Measuring and tuning peak performance
– need simulation environment
– eventually reduces to parsing and network
– to be improved: space, fault-tolerance (Xactions?)

Highly Distributed crawling
– highly distributed (e.g., grub.org) ? (maybe)
– bybrid? (different services)
– few high-performance sites? (several Universities)

Recrawling and focused crawling strategies
– what strategies?
– how to express?
– how to implement?
Refresh Strategies
Page Refresh

Make sure pages are up-to-date
 Many possible strategies
– Uniform refresh
i, j
fi  f j
– Proportional to change frequency
j

i
i, j

f

i
Need to define a metric
fj
Freshness Metric

Freshness
F (ei , t )  1 if fresh, 0 otherwise
1
F (S , t ) 
N

N
 F (e , t )
i
i 1
Age of pages
A(ei , t )  time since modified
1
A( S , t ) 
N
N
 A(e , t )
i 1
i
Average Freshness
Freshness changes over time
 Take the average freshness over a
long period of time

t
1
f ( S , t )  lim  f ( S , t )dt
t  t 0
Refresh Strategy
Crawlers can refresh only a certain
amount of pages in a period of time
 The page download resource can be
allocated in many ways
 The proportional refresh policy
allocated the resource proportionally
to the pages’ change rate

Example

The collection contains 2 pages
– E1 changes 9 times a day
– E2 changes once a day
– Simplified change model
• The day is split into 9 equal intervals, and E1
changes once on each interval
• E2 changes once during the entire day
• The only unknown is when the pages change within
the intervals

The crawler can download a page a day
 Goal is to maximize the freshness
Example (2)
Example (3)

Which page do we refresh?
– If we refresh E2 in midday
• If E2 changes in first half of the day, and we refresh
in midday, it remains fresh for the rest half of the day
– 50% for 0.5 day freshness increase
– 50% for no increase
– Expectancy of 0.25 day freshness increase
– If we refresh E1 in midday
• If E1 changes in first half of the interval, and we
refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of the
interval = 1/18 of a day
– 50% for 1/18 day freshness increase
– 50% for no increase
– Expectancy of 1/36 day freshness increase
Example (4)
This gives a nice estimation
 But things are more complex in real
life

– Not sure that a page will change within
an interval
– Have to worry about age

Using a Poisson model shows a
uniform policy always performs
better than a proportional one
Comparing Policies
Proportional
Uniform
Optimal
Freshness
Age
0.12 400 days
0.57 5.6 days
0.62 4.3 days
Based on Statistics from experiment
and revisit frequency of every month
Duplicate Detection
Duplicate/Near-Duplicate Detection

Duplicate: Exact match with fingerprints
 Near-Duplicate: Approximate match
– Overview
• Compute syntactic similarity with an editdistance measure
• Use similarity threshold to detect nearduplicates
– e.g., Similarity > 80% => Documents are “near
duplicates”
– Not transitive though sometimes used transitively
Computing Near Similarity

Features:
– Segments of a document (natural or artificial
breakpoints) [Brin95]
– Shingles (word N-Grams) [Brin95, Brod98]
“a rose is a rose is a rose” =>
a_rose_is_a
rose_is_a_rose
is_a_rose_is

Similarity Measure
– TF*IDF [Shiv95]
– Set intersection [Brod98]
|Intersection| / |Union|
Jaccard measure
Shingles + Set Intersection

Computing exact set intersection of
shingles between all pairs of
documents is unfeasible
– Approximate using a cleverly chosen
subset of shingles from each (a sketch)
Shingles + Set Intersection

Estimate |intersection| / |union| based on
a short sketch ([Brod97, Brod98])
– Create a “sketch vector” (e.g. of size 200) for
each document
– Documents which share more than t (say 80%)
corresponding vector elements are similar
– For doc D, sketch[i] is computed as follows:
• Let f map all shingles in the universe to 0..2m (e.g. f =
fingerprinting)
• Let pi be a specific random permutation on 0..2m
• Pick sketch[i] := MIN pi (f(s)) over all shingles s in D
Computing Sketch[i] for Doc1
Document 1
264
Start with 64 bit shingles
264
Permute on the number line
264
with pi
264
Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 2
Document 1
A
264
264
264
264
264
264
264
B
Are these equal?
Test for 200 random permutations:
p1, p2,… p200
264
However…
Document 2
Document 1
A
264
264
264
264
264
264
264
B
264
A = B iff the shingle with the MIN value in the union of Doc1 and
Doc2 is common to both (i.e. lies in the intersection)
This happens with probability:
|intersection| / |union|
Mirror Detection
Mirror Detection

Mirroring is systematic replication of Web
pages across hosts
– Single largest cause of duplication on the Web

URL1 and URL2 are mirrors iff
For all (or most) paths p such that when
URL1/p exists
URL2/p exists as well
with identical (or near identical) content, and
vice versa
Example of mirrors


http://www.elsevier.com/ and
http://www.elsevier.nl/
Structural Classification of Proteins
– http://scop.mrc-lmb.cam.ac.uk/scop
– http://scop.berkeley.edu/
– http://scop.wehi.edu.au/scop
– http://pdb.weizmann.ac.il/scop
– http://scop.protres.ru/
Approaches

Bottom-up [Cho 2000]
– Group near duplicates into clusters
– Merge clusters of same cardinality and
corresponding linkage

Top-down [Bhar99, Bhar00c]
– Select features
– Compute list of pairs
– Host pair validation by sampling
Mirror detection benefits

Smart crawling
– Fetch from the fastest or freshest server
– Avoid duplication

Better connectivity analysis
– Combine inlinks
– Avoid double counting outlinks
Avoid redundancy in query results
 Proxy caching

References

Mercator:
– http://research.compaq.com/SRC/mercator

K. M. Risvik and R. Michelsen, Search engines and web
dynamics, Computer Networks, vol. 39, pp. 289--302, June
2002, http://citeseer.ist.psu.edu/risvik02search.html

WebBase:
– http://www-diglib.stanford.edu/~testbed/doc2/WebBase

PolyBot:
– http://cis.poly.edu/polybot

Grub:
– www.grub.org

UbiCrawler:
– http://ubi.imc.pi.cnr.it/projects/ubicrawler
Lots of practical issues
URL normalization
 Embedded session ID
 Embedded URLs
 Cookies loops (e.g. inner page
redirects to main page + cookie)
 Malformed HTML
 Avoid spam
 Access free registration sites
 Handling URL queries

Questions?