Transcript slides
User-Centric Web Crawling*
Christopher Olston
CMU & Yahoo! Research**
* Joint work with Sandeep Pandey
** Work done at Carnegie Mellon
Distributed Sources of Dynamic
Information
central monitoring node
Support integrated querying
• Maintain historical archive
•
resource constraints
Sensors
• Web sites
•
2
source A
source B
source C
Christopher Olston
Workload-driven Approach
Goal: meet usage needs, while adhering to resource
constraints
Tactic: pay attention to workload
•
workload = usage + data dynamics
Thesis work: cooperative sources [VLDB’00,
SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]
Current focus: autonomous sources
–
–
3
Data archival from Web sources [VLDB’04]
Supporting Web search [WWW’05]
this talk
Christopher Olston
Outline
Introduction: monitoring distributed sources
User-centric web crawling
–
–
–
4
Model + approach
Empirical results
Related & future work
Christopher Olston
Web Crawling to Support Search
search engine
search queries
index
repository
crawler
Q: Given a full repository,
when to refresh each page?
resource constraint
5
web site A
users
web site B
web site C
Christopher Olston
Approach
Faced with optimization problem
Others:
–
–
Our approach:
–
–
6
Maximize freshness, age, or similar
Boolean model of document change
User-centric optimization objective
Rich notion of document change,
attuned to user-centric objective
Christopher Olston
Web Search User Interface
1. User enters keywords
2. Search engine returns ranked list of results
1.
2.
3.
4.
------------------------…
documents
7
3. User visits subset of results
Christopher Olston
Objective: Maximize Repository Quality,
from Search Perspective
Suppose a user issues search query q
Qualityq =
Σdocuments d (likelihood of viewing d) x (relevance of d to q)
Given a workload W of user queries:
Average quality = 1/K x Σqueries q W (freqq x Qualityq)
8
Christopher Olston
Viewing Likelihood
•
Depends primarily on rank in list [Joachims KDD’02]
•
From AltaVista data [Lempel et al. WWW’03]:
Probability of Viewing
view probability
1.2
1
0.8
0.6
ViewProbability(r) r –1.5
0.4
0.2
0
0
9
50
100
150
rank
Rank
Christopher Olston
Relevance Scoring Function
Search engines’ internal notion of how well a
document matches a query
Each D/Q pair numerical score [0,1]
Combination of many factors, e.g.:
–
–
–
10
Vector-space similarity (e.g., TF.IDF cosine metric)
Link-based factors (e.g., PageRank)
Anchortext of referring pages
Christopher Olston
(Caveat)
Using scoring function for absolute relevance
(Normally only used for relative ranking)
– Need to ensure scoring function has meaning on an
absolute scale
11
Probabilistic IR models, PageRank okay
Unclear whether TF-IDF does (still debated, I believe)
Bottom line: stricter interpretability requirement than
“good relative ordering”
Christopher Olston
Measuring Quality
scoring function over “live” copy of d
Avg. Quality =
Σq (freqq x Σd (likelihood of viewing d) x (relevance of d to q))
query logs
ViewProb( Rank(d, q) )
usage logs
12
scoring function over
(possibly stale) repository
Christopher Olston
Lessons from Quality Metric
Avg. Quality =
Σq (freqq x Σd (ViewProb( Rank(d, q) ) x Relevance(d, q)) )
ViewProb(r) monotonically nonincreasing
Quality maximized when ranking function orders documents in
descending order of true relevance
Out-of-date repository:
scrambles ranking lowers quality
Let ΔQD = loss in quality due to inaccurate information about D
Alternatively, improvement in quality if we (re)download D
13
Christopher Olston
ΔQD: Improvement in Quality
REDOWNLOAD
14
Repository Copy of D
(stale)
Repository Quality += ΔQD
Web Copy of D
(fresh)
Christopher Olston
Formula for Quality Gain (ΔQD)
Re-download document D at time t.
Quality beforehand:
Q(t–) = Σq (freqq x Σd (ViewProb( Rankt–(d, q) ) x Relevance(d, q)) )
Quality after re-download:
Q(t) = Σq (freqq x Σd (ViewProb( Rankt(d, q) ) x Relevance(d, q)) )
Quality gain:
∆QD(t) = Q(t) – Q(t–) = Σq (freqq x Σd (VP x Relevance(d, q)) )
15
where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )
Christopher Olston
Download Prioritization
Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly
Three difficulties:
1. ΔQD depends on order of downloading
2. Given both the “live” and repository copies of D,
measuring ΔQD is computationally expensive
3. Live copy usually unavailable
16
Christopher Olston
Difficulty 1:
Order of Downloading Matters
QD(t) = Σq (freqq x Σd (VP x Relevance(d, q)) )
where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )
17
ΔQD depends on relative rank positions of D
Hence, ΔQD depends on order of downloading
To reduce implementation complexity, avoid tracking
inter-document ordering dependencies
Assume ΔQD independent of downloading of other docs.
Christopher Olston
Difficulty 3: Live Copy Unavailable
Take measurements upon re-downloading D
(live copy available at that time)
Use forecasting techniques to project forward
past re-downloads
time
ΔQD(t1)
now
ΔQD(t2)
forecast ΔQD(tnow)
18
Christopher Olston
Ability to Forecast ΔQD
Avg. weekly ΔQD (log scale)
second 24 weeks
Top 50%
19
Data: 15 web sites
sampled from
OpenDirectory topics
Top 80%
Top 90%
Queries: AltaVista
query log
Docs downloaded
once per week, in
random order
first 24 weeks
Christopher Olston
Strategy So Far
Measure shift in quality (ΔQD) each
time re-download document D
Forecast future ΔQD
–
Treat each D independently
Prioritize re-downloading by ΔQD
Remaining difficulty:
2. Given both the “live” and repository copies of D,
measuring ΔQD is computationally expensive
20
Christopher Olston
Difficulty 2:
Metric Expensive to Compute
One problem: measurements of other documents required.
Example:
“Live” copy of D becomes less relevant to query q than
before
• Now D is ranked too high
• Some users visit D in lieu of Y,
which
is more relevant
• Result: less-than-ideal quality
Upon redownloading D, measuring quality gain requires
knowing relevancy of Y, Z
Results for q
Actual
1. X
2. D
3. Y
4. Z
Ideal
1. X
2. Y
3. Z
4. D
Solution: estimate!
21
Use approximate relevancerank mapping functions,
fit in advance for each query
Christopher Olston
DETAIL
Estimation Procedure
Focus on query q (later we’ll see how to sum across
all affected queries)
QD,q = Σd (ViewProb(d,q) x Relevance(d,q))
= VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))
Let Fq(rel) be relevancerank mapping for q
–
–
We use piecewise linear function in log-log space
Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank
≈ Σr=r1+1…r2 (VP(r–1) – VP(r)) x F–1q(r)
–
22
Use integral approximation of summation
Christopher Olston
DETAIL
Where we stand …
Context: QD = Σq (freqq x QD,q )
QD,q = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))
≈ f(Rel(D,q), Rel(Dold,q))
≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) )
QD,q ≈ g(Rel(D,q), Rel(Dold,q))
23
Christopher Olston
Difficulty 2, continued
Additional problem: must measure
effect of shift in rank across all queries.
Solution: couple measurements
with index updating operations
Sketch:
– Basic index unit: posting. Conceptually:
–
–
24
term
ID
document
ID
scoring
factors
Each time insert/delete/update a posting, compute old & new
relevance contribution from term/document pair*
Transform using estimation procedure, and accumulate across
postings touched to get ΔQD
* assumes scoring function treats term/document pairs independently
Christopher Olston
DETAIL
Background: Text Indexes
Basic index unit: posting
–
–
One posting for each term/document pair
Contains information needed for scoring function
Number of occurrences, font size, etc.
Dictionary
Term
25
# docs
Postings
Total
freq
aid
1
1
all
2
2
cold
1
1
duck
1
2
Doc #
Freq
58
1
37
1
62
1
15
1
41
2
Christopher Olston
DETAIL
Pre-Processing:
Approximate the Workload
Break multi-term queries into set of single-term
queries
–
–
Now, term query
Index has one posting for each query/document pair
Dictionary
Term
26
# docs
Postings
= query
Total
freq
aid
1
1
all
2
2
cold
1
1
duck
1
2
Doc #
Freq
58
1
37
1
62
1
15
1
41
2
Christopher Olston
DETAIL
Taking Measurements During
Index Maintenance
27
While updating index:
–
Initialize bank of ΔQD accumulators, one per document
(actually, materialized on demand using hash table)
–
Each time insert/delete/update a posting:
Compute new & old relevance contributions for
query/document pair: Rel(D,q), Rel(Dold,q)
Compute ΔQD,q using estimation procedure, add to
accumulator:
ΔQD += freqq x g(Rel(D,q), Rel(Dold,q))
Christopher Olston
Measurement Overhead
Implemented in
Lucene
Caveat:
Does not handle
factors that do not
depend on a single
term/doc. pair, e.g.
term proximity and
anchortext inclusion
28
Christopher Olston
Summary of Approach
29
User-centric metric of search repository quality
(Re)downloading document improves quality
Prioritize downloading by expected quality gain
Metric adaptations to enable feasible+efficient
implementation
Christopher Olston
Next: Empirical Results
Introduction: monitoring distributed sources
User-centric web crawling
–
–
–
30
Model + approach
Empirical results
Related & future work
Christopher Olston
Staleness = fraction of
out-of-date documents*
[Cho et al. 2000]
Embarrassment =
probability that user visits
irrelevant result* [Wolf et
al. 2002]
* Used “shingling” to filter out
“trivial” changes
31
Scoring function: PageRank
(similar results for TF.IDF)
resource requirement
Overall Effectiveness
Min. Staleness
Min. Embarrassment
User-Centric
Quality (fraction
of ideal)
Christopher Olston
Reasons for
Improvement
Does not rely on size of text
change to estimate importance
Tagged as important by staleness- and
embarrassment-based techniques, although
did not match many queries in workload
32
Christopher Olston
(boston.com)
Reasons for
Improvement
Accounts for “false negatives”
Does not always ignore
frequently-updated pages
User-centric crawling repeatedly
re-downloads this page
33
(washingtonpost.com)
Christopher Olston
Related Work (1/2)
General-purpose web crawling
–
Focused/topic-specific crawling
–
34
[Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01]
Maximize average freshness or age
Balance new downloads vs. redownloading old documents
[Chakrabarti, many others]
Select subset of documents that match user interests
Our work: given a set of docs., decide when to (re)download
Christopher Olston
Most Closely Related Work
[Wolf et al., WWW’02]:
–
–
35
Maximize weighted average freshness
Document weight = probability of “embarrassment” if not fresh
User-Centric Crawling:
–
Measure interplay between update and query workloads
When document X is updated, which queries are affected by
the update, and by how much?
–
Metric penalizes false negatives
Doc. ranked #1000 for a popular query should be ranked #2
Small embarrassment but big loss in quality
Christopher Olston
Future Work:
Detecting Change-Rate Changes
Current techniques schedule monitoring to exploit
existing change-rate estimates (e.g., ΔQD)
No provision to explore change-rates explicitly
Bad case: change-rate = 0, so never monitor
–
Won’t notice future increase in change-rate
Explore/exploit tradeoff
–
36
Ongoing work on Bandit Problem formulation
Christopher Olston
Summary
Approach:
–
–
Empirical results:
–
–
37
User-centric metric of search engine quality
Schedule downloading to maximize quality
High quality with few downloads
Good at picking “right” docs. to re-download
Christopher Olston