Crawling - MediaLab
Download
Report
Transcript Crawling - MediaLab
Crawling
Giuseppe Attardi
Dipartimento di Informatica
Università di Pisa
Crawling
Download a set of Web pages
Set consists typically of all pages
reachable following links from a root
set
Crawling is performed periodically
Goals:
– Find new pages
– Keep pages fresh
– Select “good” pages
Web Dynamics
Size
– ~10 billion Public Indexable pages
– 10kB / page 100 TB
– Doubles every 18 months
Dynamics
– 33% change weekly
– 8% new pages every week
– 25% new links every week
Weekly change
Fetterly, Manasse, Najork, Wiener 2003
Crawling Issues
How to crawl?
– Quality: “Best” pages first
– Efficiency: Avoid duplication (or near
duplication)
– Etiquette: Robots.txt, Server load concerns
How much to crawl? How much to
index?
– Coverage: How big is the Web? How much do
we cover?
– Relative Coverage: How much do competitors
have?
Basic crawler operation
Begin with known “seed” pages
Add them to a queue
Extract URL from queue
– Fetch it
– Parse it and extract URLs it points to
– Add any non visited extracted URLs on
the queue
Repeat until queue is empty
Simple picture – complications
Web crawling isn’t feasible with one machine
– All of the above steps distributed
Even non-malicious pages pose challenges
– Latency/bandwidth to remote servers vary
– Robots.txt stipulations
• How “deep” should you crawl a site’s URL hierarchy?
– Site mirrors and duplicate pages
Malicious pages
– Spam pages
– Spider traps – incl dynamically generated
Politeness – don’t hit a server too often
Robots.txt
Protocol for giving spiders (“robots”)
limited access to a website, originally
from 1994
– www.robotstxt.org/wc/norobots.html
Website announces its request on
what can(not) be crawled
– For a URL, create a file
URL/robots.txt
– This file specifies access restrictions
Robots.txt example
No robot should visit any URL
starting with "/yoursite/temp/",
except the robot called “IxeBot":
User-agent: *
Disallow: /yoursite/temp/
User-agent: IxeBot
Disallow:
Crawling Issues
Crawl strategies
Distributed crawling
Refresh strategies
Filtering duplicates
Mirror detection
Crawl Strategies
Where do we crawl next?
URLs crawled
and parsed
URLs in queue
Web
Crawl Order
Want
best pages first
Potential quality measures:
• Final In-degree
• Final PageRank
Crawl heuristics:
• Breadth First Search (BFS)
• Partial Indegree
• Partial PageRank
• Random walk
Breadth-First Crawl
Basic idea:
start at a set of known URLs
explore in “concentric circles” around these URLs
start pages
distance-one pages
distance-two pages
used by broad web search engines
balances load between servers
Web Wide Crawl (328M pages) [Najo01]
BFS crawling brings in high quality
pages early in the crawl
Stanford Web Base (179K) [Cho98]
Overlap with
best x% by
indegree
x% crawled by O(u)
Queue of URLs to be fetched
What constraints dictate which queued
URL is fetched next?
Politeness – don’t hit a server too often,
even from different threads of your spider
How far into a site you’ve crawled already
– Most sites, stay at ≤ 5 levels of URL hierarchy
Which URLs are most promising for
building a high-quality corpus
– This is a graph traversal problem:
– Given a directed graph you’ve partially visited,
where do you visit next?
Where do we crawl next?
Keep all crawlers busy
Keep crawlers from treading on each
others’ toes
– Avoid fetching duplicates repeatedly
Respect politeness/robots.txt
Avoid getting stuck in traps
Detect/minimize spam
Get the “best” pages
– What’s best?
– Best for answering search queries
Where do we crawl next?
Complex scheduling optimization
problem, subject to all the constraints
listed
– Plus operational constraints (e.g., keeping all
machines load-balanced)
Scientific study – limited to specific
aspects
– Which ones?
– What do we measure?
What are the compromises in distributed
crawling?
Page selection
Importance metric
Web crawler model
Crawler method for choosing page to
download
Importance Metrics
Given a page P, define how “good”
that page is
Several metric types:
– Interest driven
– Popularity driven
– Location driven
– Combined
Interest Driven
Define a driving query Q
Find textual similarity between P and Q
Define a word vocabulary t1…tn
Define a vector for P and Q:
– Vp, Vq = <w1,…,wn>
• wi = 0 if ti does not appear in the document
• wi = IDF(ti) = 1 / number of pages containing ti
Importance: IS(P) = Vp * Vq (cosine product)
Finding IDF requires going over the entire web
Estimate IDF by pages already visited, to calculate
IS’
Popularity Driven
How popular a page is:
– Backlink count
• IB(P) = the number of pages containing a
link to P
• Estimate by pervious crawls: IB’(P)
– More sophisticated metric, e.g.
PageRank: IR(P)
Location Driven
IL(P): A function of the URL to P
– Words appearing on URL
– Number of “/” on the URL
Easily evaluated, requires no data
from pervious crawls
Combined Metrics
IC(P): a function of several other
metrics
Allows using local metrics for first
stage and estimated metrics for
second stage
IC(P) = a*IS(P) + b*IB(P) + c*IL(P)
Focused Crawling
Focused Crawling (Chakrabarti)
Distributed federation of focused
crawlers
Supervised topic classifier
Controls priority of unvisited frontier
Trained on document samples from
Web directory (Dmoz)
Basic Focused Crawler
DOM tree analysis
Learning Algorithm
Naïve Bayes
Page u is modeled as bag of words
{ <feature, freq> }
feature = < term, distance >
Comparisons
Start the baseline crawler from the URLs
in one topic
Fetch up to 20000-25000 pages
For each pair of fetched pages (u,v), add
item to the training set of the apprentice
Train the apprentice
Start the enhanced crawler from the same
set of pages
Fetch about the same number of pages
Results
Controversy
Chakrabarty claims focused crawler
superior to breadth-first
Suel claims the contrary and that
argument was based on experiments
with poor performance crawlers
Distributed Crawling
Approaches
Centralized Parallel Crawler
Distributed
P2P
Parallel Crawlers
A parallel crawler consists of multiple
crawling processes communicating via
local network (intra-site parallel crawler) or
Internet (distributed crawler)
– http://www2002.org/CDROM/refereed/108/index.html
Setting: we have a number of c-proc’s
– c-proc = crawling process
Goal: we wish to crawl the best pages with
minimum overhead
Crawler-process distribution
at geographically
distant locations.
on the same local
network
Distributed Crawler
Central
Parallel Crawler
Distributed model
Crawlers may be running in diverse
geographic locations
– Periodically update a master index
– Incremental update so this is “cheap”
• Compression, differential update etc.
– Focus on communication overhead
during the crawl
Also results in dispersed WAN load
Issues and benefits
Issues:
– overlap: minimization of multiple downloaded
pages
– quality: depends on the crawl strategy
– communication bandwidth: minimization
Benefits:
– scalability: for large-scale web-crawls
– costs: use of cheaper machines
– network-load dispersion and reduction: by
dividing the web into regions and crawling
only the nearest pages
Coordination
A parallel crawler consists of multiple crawling
processes communicating via local network
(intra-site parallel crawler) or Internet (distributed
crawler)
Coordination among c-proc's:
1.
2.
3.
Independent: no coordination, every process follows its
extracted links
Dynamic assignment: a central coordinator dynamically
divides the web into small partitions and assigns each
partition to a process
Static assignment: Web is partitioned and assigned
without central coordinator before the crawl starts
Coordination
1. Independent:
► no coordination, every process follows its
extracted links
2. Dynamic assignment:
► a central coordinator dynamically divides the
web into small partitions and assigns each
partition to a process
3. Static assignment:
► Web is partitioned and assigned without
central coordinator before the crawl starts
c-proc’s crawling the web
Which c-proc
gets this URL?
URLs crawled
URLs in
queues
Communication: by URLs
passed between c-procs.
Static assignment
Links from one partition to another (interpartition links) can be handled either in:
1.
2.
3.
Firewall mode:
a process does not follow
any inter-partition link
Cross-over mode:
a process follows also
inter-partition links and
discovers also more pages
in its partition
Exchange mode:
processes exchange interpartition URLs; mode
needs communication
Partition 1
Partition 2
a
b
e
f
c
g
d
h
i
Classification of parallel crawlers
If exchange mode is used, communication can be
limited by:
–
–
Batch communication: every process collects some URLs
and send them in a batch
Replication: the k most popular URLs are replicated at each
process and are not exchanged (previous crawl or on the fly)
Some ways to partition the Web:
–
–
–
URL-hash based: many inter-partition links
Site-hash based: reduces the inter partition links
Hierarchical: .com domain, .net domain …
Static assignement: comparison
Coverage
Overlap
Quality
Communication
Firewall
Bad
Good
Bad
Good
Crossover
Good
Bad
Bad
Good
Exchange
Good
Good
Good
Bad
UBI Crawler
Features:
– Full distribution: identical agents / no central coordinator
– Balanced locally computable assignment:
• each URL is assigned to one agent
• each agent can compute the URL assignement locally
• distribution of URLs is balanced
– Scalability:
• number of crawled pages per second and per agent are
independent of the number of agents
– Fault tolerance:
• URLs are not statically distributed
• distributed reassignment protocol not reasonable
UBI Crawler: Assignment Function
A: set of agent identifiers
L A
L: set of alive agents
m: total number of hosts
: assigns host h to an alive agent in L:
L (h) L
Requirements:
Balance: each agent should be responsible for approximatly the
same number of hosts:
1
L
(a)
m
L
Contravariance: if the number of agents grows, the portion of the
web crawled by each agent must shrink:
L L L (a) L ' (a)
'
1
1
L L' L ' (h) L L ' (h) L (h)
Consistent Hashing
Each bucket is replicated k times and each replica is mapped
randomly on the unit circle
Hashing a key: compute a point on the unit circle and find the nearest
replica
L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9}
5
8
b
5
a
2
3
b
4
6
0
b
7
b
a
a
b
0
7
8
a
9
1
L-1(a) = {1,4,5,6,8,9}
L-1(b) = {0,2,3,7}
2
c
3
b
a
4
6
a
c
1
c
9
L‘-1(a) = {4,5,6,8}
L‘-1(b) = {0,2,7}
L‘-1(c) = {1,3,9}
Contravariance:
L L' L (a) L ' (a)
1
1
L L' L ' L L ' (h) L (h)
Balancing:
Hash function and random
number generator
UBI Crawler: fault tolerance
Up to now: no metrics for estimating the fault tolerance of
distributed crawlers
– Each agent has its own view of the set of alive agents (views
can be different) but two agents will never dispatch hosts to
two different agents.
1
b
died
c
2
3
a
d
– Agents can be added dynamically in a self-stabilizing
way
Evaluation metrics (1)
1.
Overlap:
N I
Overlap
N
N: total number of fetched pages
I: number of distinct fetched pages
• minimize the overlap
1.
Coverage:
U
Coverage
I
U: total number of Web pages
• maximize the coverage
Evaluation metrics (2)
3.
Communication Overhead:
M
Overhead
P
M: number of exchanged messages
(URLs)
P: number of downloaded pages
• minimize the overhead
4.
Quality:
Quality
1
PageRank( pi )
N i
• maximize the quality
• backlink count / oracle crawler
x
Experiments
40M URL graph – Stanford Webbase
– Open Directory (dmoz.org) URLs as
seeds
Should be considered a small Web
Firewall mode coverage
The price of crawling in firewall mode
Crossover mode overlap
Demanding coverage drives up
overlap
Exchange mode communication
Communication overhead sublinear
Per
downloaded
URL
Cho’s conclusion
<4 crawling processes run in parallel firewall
mode provide good coverage
firewall mode not appropriate when:
– > 4 crawling processes
– download only a small subset of the Web and quality of
the downloaded pages is important
exchange mode
– consumes < 1% network bandwidth for URL exchanges
– maximizes the quality of the downloaded pages
By replicating 10,000 - 100,000 popular URLs,
communication overhead reduced by 40%
Resources
www.robotstxt.org/wc/norobots.html
www2002.org/CDROM/refereed/108/inde
x.html
www2004.org/proceedings/docs/1p59
5.pdf
Crawler Models
A crawler
– Tries to visit more important pages first
– Only has estimates of importance metrics
– Can only download a limited amount
How well does a crawler perform?
– Crawl and Stop
– Crawl and Stop with Threshold
Crawl and Stop
A crawler stops after visiting K pages
A perfect crawler
– Visits pages with ranks R1,…,Rk
– These are called Top Pages
A real crawler
– Visits only M < K top pages
M
Performance rate
Pcs
K
K
For a random crawler Pcs
N
Crawl and Stop with Threshold
A crawler stops after visiting K pages
Top pages are pages with a metric higher than G
A crawler visits V top pages
Metric: percent of top pages visited
Perfect crawler
K
Pst
T
Random crawler
T
K
K
N
Pst
T
N
V
Pst
T
Ordering Metrics
The crawlers queue is prioritized
according to an ordering metric
The ordering metric is based on an
importance metric
– Location metrics - directly
– Popularity metrics - via estimates according to
pervious crawls
– Similarity metrics – via estimates according to
anchor
When to stop
Up to a certain depth
Up to a certain amount from each
site
Up to a maximum pages
Literature
Mercator (Altavista, Java)
WebBase (Garcia-Molina, Cho)
PolyBot (Suel, NY Polytechnic)
Grub (grub.org)
UbiCrawler (Vigna, Uni. Milano)
Crawler Architecture
PolyBot
• application issues
requests to manager
• manager does DNS
and robot exclusion
• manager schedules
URL on downloader
• downloader gets file
and puts it on disk
• application is notified
of new files
• application parses new
files for hyperlinks
• application sends data
to storage component
Scaling up
•
20 machines
•
1500 pages/s?
•
depends on
crawl strategy
•
hash to nodes
•
based on site
(b/c robot ex)
Scheduler Data Structures
• when to insert new URLs into internal structures?
IXE Crawler
CrawlInfo
thread
Table
<UrlInfo>
Citations
Crawler
Parser
synch. obj
memory
disk
Retriever
Feeder
Scheduler
Host
queues
Cache
select()
Retriever
Retriever select()
UrlEnumerator
Hosts
Robots
Scheduler
Splits URLs by hosts
Distributes to Retrievers
Keeps hosts on wait queue for grace
period
Retriever
Modified version of cURL library
Asynchronous DNS resolver
Handles up to 1024 transfer at once
Keeps connections alive, transfer
multiple files on one connection
(connection setup can be 25% of
retrieval time)
Persistent Storage
Implemented by metaprogramming
as IXE object-relational tables
Tuned to handle concurrent access,
with low granularity locking
Locking completely transparent to
application
< 1% of overall runtime
Parser
> 30% of execution time
Uses a ThreadPool
Locking and Concurrency
Portable layer (Unix, Windows):
– Thread
– ThreadGroup
– ThreadPool
– Lock
– Condition
– ConditionWait
Pipe
bool Get(T& res) {
LockUp our(_lock);
if (empty()) {
if (closed)
return false;
else {
ConditionWait(not_empty, &_lock);
if (closed)
return false;
}
}
assert(!empty());
res = front();
pop_front();
return true;
}
In Memory Communication
Channel<T>
Pipe<T, Q = queue<T> >
MultiPipe<T, Q = queue<T> >
Asynchronous Put(obj)
Synchronous Get(obj)
Peek()
Close()
Performance Comparison
Language
Parallelism
DNS resolution
nutch
Java
threads: 300
IXE
C++
asynch IO:
1 thread, 500
connections
synchronous asynchronous
Page analysis stages
concurrent
Download speed 19 pag/sec
120 pag/sec
Bandwidth
1.3 Mb/s peak 24 Mb/s peak
Crawl Performance
Single PC:
– Peak 120 pages per second (24 Mb/s)
– 5 million pages/day
Full crawl of 2 billion pages:
– 400 day/PC, 25 Mb/s or
– 4 days, 100 PCs, 2.5 Gb/s bandwidth
Crawler Console
Web pushlet (aka AJAX) graphical
monitor
Real time bandwidth usage graph
Scheduled Windows Service
URL compression
10 billion URLs = 550 GB
Must map URL to unique Ids
Simple trick: use undocumented
feature of zlib
– Supply dictionary to decompress()
– Create dictionary by compressing a
number of URL’s with typical patterns
– Save dictionary for separate use
– Achieves >50% compression rate
Statistics
Page avg. 13KB, 8 links per page
From 20 million pages >100 million
links: several days to crawl
Related work
Mercator (Heydon/Najork, DEC/Compaq)
– used in AltaVista
– centralized system (2-CPU Alpha with RAID
disks)
– URL-seen test by fast disk access and caching
– one thread per HTTP connection
– completely in Java, with pluggable
components
Atrax: distributed extension to Mercator
– combines several Mercators
– URL hashing and off-line URL check
Related Work (cont.)
early Internet Archive crawler (circa 96)
– uses hashing to partition URLs between
crawlers
– bloom filter for “URL seen” structure
early Google crawler (1998)
P2P crawlers (grub.org and others)
Cho/Garcia-Molina (WWW 2002)
– study of overhead/quality tradeoff in parallel
crawlers
– difference: we scale services separately, and
focus on single-node performance
– in our experience, parallel overhead low
Open Issues
Measuring and tuning peak performance
– need simulation environment
– eventually reduces to parsing and network
– to be improved: space, fault-tolerance (Xactions?)
Highly Distributed crawling
– highly distributed (e.g., grub.org) ? (maybe)
– bybrid? (different services)
– few high-performance sites? (several Universities)
Recrawling and focused crawling strategies
– what strategies?
– how to express?
– how to implement?
Refresh Strategies
Page Refresh
Make sure pages are up-to-date
Many possible strategies
– Uniform refresh
i, j
fi f j
– Proportional to change frequency
j
i
i, j
f
i
Need to define a metric
fj
Freshness Metric
Freshness
F (ei , t ) 1 if fresh, 0 otherwise
1
F (S , t )
N
N
F (e , t )
i
i 1
Age of pages
A(ei , t ) time since modified
1
A( S , t )
N
N
A(e , t )
i 1
i
Average Freshness
Freshness changes over time
Take the average freshness over a
long period of time
t
1
f ( S , t ) lim f ( S , t )dt
t t 0
Refresh Strategy
Crawlers can refresh only a certain
amount of pages in a period of time
The page download resource can be
allocated in many ways
The proportional refresh policy
allocated the resource proportionally
to the pages’ change rate
Example
The collection contains 2 pages
– E1 changes 9 times a day
– E2 changes once a day
– Simplified change model
• The day is split into 9 equal intervals, and E1
changes once on each interval
• E2 changes once during the entire day
• The only unknown is when the pages change within
the intervals
The crawler can download a page a day
Goal is to maximize the freshness
Example (2)
Example (3)
Which page do we refresh?
– If we refresh E2 in midday
• If E2 changes in first half of the day, and we refresh
in midday, it remains fresh for the rest half of the day
– 50% for 0.5 day freshness increase
– 50% for no increase
– Expectancy of 0.25 day freshness increase
– If we refresh E1 in midday
• If E1 changes in first half of the interval, and we
refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of the
interval = 1/18 of a day
– 50% for 1/18 day freshness increase
– 50% for no increase
– Expectancy of 1/36 day freshness increase
Example (4)
This gives a nice estimation
But things are more complex in real
life
– Not sure that a page will change within
an interval
– Have to worry about age
Using a Poisson model shows a
uniform policy always performs
better than a proportional one
Comparing Policies
Proportional
Uniform
Optimal
Freshness
Age
0.12 400 days
0.57 5.6 days
0.62 4.3 days
Based on Statistics from experiment
and revisit frequency of every month
Duplicate Detection
Duplicate/Near-Duplicate Detection
Duplicate: Exact match with fingerprints
Near-Duplicate: Approximate match
– Overview
• Compute syntactic similarity with an editdistance measure
• Use similarity threshold to detect nearduplicates
– e.g., Similarity > 80% => Documents are “near
duplicates”
– Not transitive though sometimes used transitively
Computing Near Similarity
Features:
– Segments of a document (natural or artificial
breakpoints) [Brin95]
– Shingles (word N-Grams) [Brin95, Brod98]
“a rose is a rose is a rose” =>
a_rose_is_a
rose_is_a_rose
is_a_rose_is
Similarity Measure
– TF*IDF [Shiv95]
– Set intersection [Brod98]
|Intersection| / |Union|
Jaccard measure
Shingles + Set Intersection
Computing exact set intersection of
shingles between all pairs of
documents is unfeasible
– Approximate using a cleverly chosen
subset of shingles from each (a sketch)
Shingles + Set Intersection
Estimate |intersection| / |union| based on
a short sketch ([Brod97, Brod98])
– Create a “sketch vector” (e.g. of size 200) for
each document
– Documents which share more than t (say 80%)
corresponding vector elements are similar
– For doc D, sketch[i] is computed as follows:
• Let f map all shingles in the universe to 0..2m (e.g. f =
fingerprinting)
• Let pi be a specific random permutation on 0..2m
• Pick sketch[i] := MIN pi (f(s)) over all shingles s in D
Computing Sketch[i] for Doc1
Document 1
264
Start with 64 bit shingles
264
Permute on the number line
264
with pi
264
Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 2
Document 1
A
264
264
264
264
264
264
264
B
Are these equal?
Test for 200 random permutations:
p1, p2,… p200
264
However…
Document 2
Document 1
A
264
264
264
264
264
264
264
B
264
A = B iff the shingle with the MIN value in the union of Doc1 and
Doc2 is common to both (i.e. lies in the intersection)
This happens with probability:
|intersection| / |union|
Mirror Detection
Mirror Detection
Mirroring is systematic replication of Web
pages across hosts
– Single largest cause of duplication on the Web
URL1 and URL2 are mirrors iff
For all (or most) paths p such that when
URL1/p exists
URL2/p exists as well
with identical (or near identical) content, and
vice versa
Example of mirrors
http://www.elsevier.com/ and
http://www.elsevier.nl/
Structural Classification of Proteins
– http://scop.mrc-lmb.cam.ac.uk/scop
– http://scop.berkeley.edu/
– http://scop.wehi.edu.au/scop
– http://pdb.weizmann.ac.il/scop
– http://scop.protres.ru/
Approaches
Bottom-up [Cho 2000]
– Group near duplicates into clusters
– Merge clusters of same cardinality and
corresponding linkage
Top-down [Bhar99, Bhar00c]
– Select features
– Compute list of pairs
– Host pair validation by sampling
Mirror detection benefits
Smart crawling
– Fetch from the fastest or freshest server
– Avoid duplication
Better connectivity analysis
– Combine inlinks
– Avoid double counting outlinks
Avoid redundancy in query results
Proxy caching
References
Mercator:
– http://research.compaq.com/SRC/mercator
K. M. Risvik and R. Michelsen, Search engines and web
dynamics, Computer Networks, vol. 39, pp. 289--302, June
2002, http://citeseer.ist.psu.edu/risvik02search.html
WebBase:
– http://www-diglib.stanford.edu/~testbed/doc2/WebBase
PolyBot:
– http://cis.poly.edu/polybot
Grub:
– www.grub.org
UbiCrawler:
– http://ubi.imc.pi.cnr.it/projects/ubicrawler
Lots of practical issues
URL normalization
Embedded session ID
Embedded URLs
Cookies loops (e.g. inner page
redirects to main page + cookie)
Malformed HTML
Avoid spam
Access free registration sites
Handling URL queries
Questions?