Transcript slides

User-Centric Web Crawling*
Christopher Olston
CMU & Yahoo! Research**
* Joint work with Sandeep Pandey
** Work done at Carnegie Mellon
Distributed Sources of Dynamic
Information
central monitoring node
Support integrated querying
• Maintain historical archive
•
resource constraints
Sensors
• Web sites
•
2
source A
source B
source C
Christopher Olston
Workload-driven Approach


Goal: meet usage needs, while adhering to resource
constraints
Tactic: pay attention to workload
•

workload = usage + data dynamics
Thesis work: cooperative sources [VLDB’00,
SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]

Current focus: autonomous sources
–
–
3
Data archival from Web sources [VLDB’04]
Supporting Web search [WWW’05]
this talk
Christopher Olston
Outline


Introduction: monitoring distributed sources
User-centric web crawling
–
–
–
4
Model + approach
Empirical results
Related & future work
Christopher Olston
Web Crawling to Support Search
search engine
search queries
index
repository
crawler
Q: Given a full repository,
when to refresh each page?
resource constraint
5
web site A
users
web site B
web site C
Christopher Olston
Approach


Faced with optimization problem
Others:
–
–

Our approach:
–
–
6
Maximize freshness, age, or similar
Boolean model of document change
User-centric optimization objective
Rich notion of document change,
attuned to user-centric objective
Christopher Olston
Web Search User Interface
1. User enters keywords
2. Search engine returns ranked list of results
1.
2.
3.
4.
------------------------…
documents
7
3. User visits subset of results
Christopher Olston
Objective: Maximize Repository Quality,
from Search Perspective

Suppose a user issues search query q
Qualityq =

Σdocuments d (likelihood of viewing d) x (relevance of d to q)
Given a workload W of user queries:
Average quality = 1/K x Σqueries q  W (freqq x Qualityq)
8
Christopher Olston
Viewing Likelihood
•
Depends primarily on rank in list [Joachims KDD’02]
•
From AltaVista data [Lempel et al. WWW’03]:
Probability of Viewing
view probability
1.2
1
0.8
0.6
ViewProbability(r)  r –1.5
0.4
0.2
0
0
9
50
100
150
rank
Rank
Christopher Olston
Relevance Scoring Function



Search engines’ internal notion of how well a
document matches a query
Each D/Q pair  numerical score  [0,1]
Combination of many factors, e.g.:
–
–
–
10
Vector-space similarity (e.g., TF.IDF cosine metric)
Link-based factors (e.g., PageRank)
Anchortext of referring pages
Christopher Olston
(Caveat)

Using scoring function for absolute relevance
(Normally only used for relative ranking)
– Need to ensure scoring function has meaning on an
absolute scale



11
Probabilistic IR models, PageRank okay
Unclear whether TF-IDF does (still debated, I believe)
Bottom line: stricter interpretability requirement than
“good relative ordering”
Christopher Olston
Measuring Quality
scoring function over “live” copy of d
Avg. Quality =
Σq (freqq x Σd (likelihood of viewing d) x (relevance of d to q))
query logs
ViewProb( Rank(d, q) )
usage logs
12
scoring function over
(possibly stale) repository
Christopher Olston
Lessons from Quality Metric
Avg. Quality =
Σq (freqq x Σd (ViewProb( Rank(d, q) ) x Relevance(d, q)) )


ViewProb(r) monotonically nonincreasing
Quality maximized when ranking function orders documents in
descending order of true relevance
Out-of-date repository:
scrambles ranking  lowers quality
Let ΔQD = loss in quality due to inaccurate information about D
 Alternatively, improvement in quality if we (re)download D
13
Christopher Olston
ΔQD: Improvement in Quality
REDOWNLOAD
14
Repository Copy of D
(stale)
Repository Quality += ΔQD
Web Copy of D
(fresh)
Christopher Olston
Formula for Quality Gain (ΔQD)
Re-download document D at time t.
 Quality beforehand:
Q(t–) = Σq (freqq x Σd (ViewProb( Rankt–(d, q) ) x Relevance(d, q)) )

Quality after re-download:
Q(t) = Σq (freqq x Σd (ViewProb( Rankt(d, q) ) x Relevance(d, q)) )

Quality gain:
∆QD(t) = Q(t) – Q(t–) = Σq (freqq x Σd (VP x Relevance(d, q)) )
15
where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )
Christopher Olston
Download Prioritization
Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly
Three difficulties:
1. ΔQD depends on order of downloading
2. Given both the “live” and repository copies of D,
measuring ΔQD is computationally expensive
3. Live copy usually unavailable
16
Christopher Olston
Difficulty 1:
Order of Downloading Matters
QD(t) = Σq (freqq x Σd (VP x Relevance(d, q)) )
where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )




17
ΔQD depends on relative rank positions of D
Hence, ΔQD depends on order of downloading
To reduce implementation complexity, avoid tracking
inter-document ordering dependencies
Assume ΔQD independent of downloading of other docs.
Christopher Olston
Difficulty 3: Live Copy Unavailable

Take measurements upon re-downloading D
(live copy available at that time)

Use forecasting techniques to project forward
past re-downloads
time
ΔQD(t1)
now
ΔQD(t2)
forecast ΔQD(tnow)
18
Christopher Olston
Ability to Forecast ΔQD
Avg. weekly ΔQD (log scale)
second 24 weeks
Top 50%
19
Data: 15 web sites
sampled from
OpenDirectory topics
Top 80%
Top 90%
Queries: AltaVista
query log
Docs downloaded
once per week, in
random order
first 24 weeks
Christopher Olston
Strategy So Far


Measure shift in quality (ΔQD) each
time re-download document D
Forecast future ΔQD
–

Treat each D independently
Prioritize re-downloading by ΔQD
Remaining difficulty:
2. Given both the “live” and repository copies of D,
measuring ΔQD is computationally expensive
20
Christopher Olston
Difficulty 2:
Metric Expensive to Compute
One problem: measurements of other documents required.
Example:

“Live” copy of D becomes less relevant to query q than
before
• Now D is ranked too high
• Some users visit D in lieu of Y,
which
is more relevant
• Result: less-than-ideal quality

Upon redownloading D, measuring quality gain requires
knowing relevancy of Y, Z
Results for q
Actual
1. X
2. D
3. Y
4. Z
Ideal
1. X
2. Y
3. Z
4. D
Solution: estimate!

21
Use approximate relevancerank mapping functions,
fit in advance for each query
Christopher Olston
DETAIL
Estimation Procedure

Focus on query q (later we’ll see how to sum across
all affected queries)
QD,q = Σd (ViewProb(d,q) x Relevance(d,q))
= VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))

Let Fq(rel) be relevancerank mapping for q
–
–
We use piecewise linear function in log-log space
Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank
≈ Σr=r1+1…r2 (VP(r–1) – VP(r)) x F–1q(r)
–
22
Use integral approximation of summation
Christopher Olston
DETAIL
Where we stand …
Context: QD = Σq (freqq x QD,q )
QD,q = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))
≈ f(Rel(D,q), Rel(Dold,q))
≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) )
QD,q ≈ g(Rel(D,q), Rel(Dold,q))
23
Christopher Olston
Difficulty 2, continued
Additional problem: must measure
effect of shift in rank across all queries.
Solution: couple measurements
with index updating operations
Sketch:
– Basic index unit: posting. Conceptually:
–
–
24
term
ID
document
ID
scoring
factors
Each time insert/delete/update a posting, compute old & new
relevance contribution from term/document pair*
Transform using estimation procedure, and accumulate across
postings touched to get ΔQD
* assumes scoring function treats term/document pairs independently
Christopher Olston
DETAIL
Background: Text Indexes
Basic index unit: posting
–
–
One posting for each term/document pair
Contains information needed for scoring function

Number of occurrences, font size, etc.
Dictionary
Term
25
# docs
Postings
Total
freq
aid
1
1
all
2
2
cold
1
1
duck
1
2
Doc #
Freq
58
1
37
1
62
1
15
1
41
2
Christopher Olston
DETAIL
Pre-Processing:
Approximate the Workload

Break multi-term queries into set of single-term
queries
–
–
Now, term  query
Index has one posting for each query/document pair
Dictionary
Term
26
# docs
Postings
= query
Total
freq
aid
1
1
all
2
2
cold
1
1
duck
1
2
Doc #
Freq
58
1
37
1
62
1
15
1
41
2
Christopher Olston
DETAIL
Taking Measurements During
Index Maintenance

27
While updating index:
–
Initialize bank of ΔQD accumulators, one per document
(actually, materialized on demand using hash table)
–
Each time insert/delete/update a posting:
 Compute new & old relevance contributions for
query/document pair: Rel(D,q), Rel(Dold,q)
 Compute ΔQD,q using estimation procedure, add to
accumulator:
ΔQD += freqq x g(Rel(D,q), Rel(Dold,q))
Christopher Olston
Measurement Overhead
Implemented in
Lucene
Caveat:
Does not handle
factors that do not
depend on a single
term/doc. pair, e.g.
term proximity and
anchortext inclusion
28
Christopher Olston
Summary of Approach
29

User-centric metric of search repository quality

(Re)downloading document improves quality

Prioritize downloading by expected quality gain

Metric adaptations to enable feasible+efficient
implementation
Christopher Olston
Next: Empirical Results


Introduction: monitoring distributed sources
User-centric web crawling
–
–
–
30
Model + approach
Empirical results
Related & future work
Christopher Olston


Staleness = fraction of
out-of-date documents*
[Cho et al. 2000]
Embarrassment =
probability that user visits
irrelevant result* [Wolf et
al. 2002]
* Used “shingling” to filter out
“trivial” changes

31
Scoring function: PageRank
(similar results for TF.IDF)
resource requirement
Overall Effectiveness
Min. Staleness
Min. Embarrassment
User-Centric
Quality (fraction
of ideal)
Christopher Olston
Reasons for
Improvement

Does not rely on size of text
change to estimate importance
Tagged as important by staleness- and
embarrassment-based techniques, although
did not match many queries in workload
32
Christopher Olston
(boston.com)
Reasons for
Improvement


Accounts for “false negatives”
Does not always ignore
frequently-updated pages
User-centric crawling repeatedly
re-downloads this page
33
(washingtonpost.com)
Christopher Olston
Related Work (1/2)

General-purpose web crawling
–

Focused/topic-specific crawling
–
34
[Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01]
 Maximize average freshness or age
 Balance new downloads vs. redownloading old documents
[Chakrabarti, many others]
 Select subset of documents that match user interests
 Our work: given a set of docs., decide when to (re)download
Christopher Olston
Most Closely Related Work

[Wolf et al., WWW’02]:
–
–

35
Maximize weighted average freshness
Document weight = probability of “embarrassment” if not fresh
User-Centric Crawling:
–
Measure interplay between update and query workloads
 When document X is updated, which queries are affected by
the update, and by how much?
–
Metric penalizes false negatives
 Doc. ranked #1000 for a popular query should be ranked #2
 Small embarrassment but big loss in quality
Christopher Olston
Future Work:
Detecting Change-Rate Changes


Current techniques schedule monitoring to exploit
existing change-rate estimates (e.g., ΔQD)
No provision to explore change-rates explicitly
Bad case: change-rate = 0, so never monitor
–
Won’t notice future increase in change-rate
 Explore/exploit tradeoff
–
36
Ongoing work on Bandit Problem formulation
Christopher Olston
Summary

Approach:
–
–

Empirical results:
–
–
37
User-centric metric of search engine quality
Schedule downloading to maximize quality
High quality with few downloads
Good at picking “right” docs. to re-download
Christopher Olston