Transcript slides
User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon Distributed Sources of Dynamic Information central monitoring node Support integrated querying • Maintain historical archive • resource constraints Sensors • Web sites • 2 source A source B source C Christopher Olston Workload-driven Approach Goal: meet usage needs, while adhering to resource constraints Tactic: pay attention to workload • workload = usage + data dynamics Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b] Current focus: autonomous sources – – 3 Data archival from Web sources [VLDB’04] Supporting Web search [WWW’05] this talk Christopher Olston Outline Introduction: monitoring distributed sources User-centric web crawling – – – 4 Model + approach Empirical results Related & future work Christopher Olston Web Crawling to Support Search search engine search queries index repository crawler Q: Given a full repository, when to refresh each page? resource constraint 5 web site A users web site B web site C Christopher Olston Approach Faced with optimization problem Others: – – Our approach: – – 6 Maximize freshness, age, or similar Boolean model of document change User-centric optimization objective Rich notion of document change, attuned to user-centric objective Christopher Olston Web Search User Interface 1. User enters keywords 2. Search engine returns ranked list of results 1. 2. 3. 4. ------------------------… documents 7 3. User visits subset of results Christopher Olston Objective: Maximize Repository Quality, from Search Perspective Suppose a user issues search query q Qualityq = Σdocuments d (likelihood of viewing d) x (relevance of d to q) Given a workload W of user queries: Average quality = 1/K x Σqueries q W (freqq x Qualityq) 8 Christopher Olston Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: Probability of Viewing view probability 1.2 1 0.8 0.6 ViewProbability(r) r –1.5 0.4 0.2 0 0 9 50 100 150 rank Rank Christopher Olston Relevance Scoring Function Search engines’ internal notion of how well a document matches a query Each D/Q pair numerical score [0,1] Combination of many factors, e.g.: – – – 10 Vector-space similarity (e.g., TF.IDF cosine metric) Link-based factors (e.g., PageRank) Anchortext of referring pages Christopher Olston (Caveat) Using scoring function for absolute relevance (Normally only used for relative ranking) – Need to ensure scoring function has meaning on an absolute scale 11 Probabilistic IR models, PageRank okay Unclear whether TF-IDF does (still debated, I believe) Bottom line: stricter interpretability requirement than “good relative ordering” Christopher Olston Measuring Quality scoring function over “live” copy of d Avg. Quality = Σq (freqq x Σd (likelihood of viewing d) x (relevance of d to q)) query logs ViewProb( Rank(d, q) ) usage logs 12 scoring function over (possibly stale) repository Christopher Olston Lessons from Quality Metric Avg. Quality = Σq (freqq x Σd (ViewProb( Rank(d, q) ) x Relevance(d, q)) ) ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of true relevance Out-of-date repository: scrambles ranking lowers quality Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D 13 Christopher Olston ΔQD: Improvement in Quality REDOWNLOAD 14 Repository Copy of D (stale) Repository Quality += ΔQD Web Copy of D (fresh) Christopher Olston Formula for Quality Gain (ΔQD) Re-download document D at time t. Quality beforehand: Q(t–) = Σq (freqq x Σd (ViewProb( Rankt–(d, q) ) x Relevance(d, q)) ) Quality after re-download: Q(t) = Σq (freqq x Σd (ViewProb( Rankt(d, q) ) x Relevance(d, q)) ) Quality gain: ∆QD(t) = Q(t) – Q(t–) = Σq (freqq x Σd (VP x Relevance(d, q)) ) 15 where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) ) Christopher Olston Download Prioritization Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly Three difficulties: 1. ΔQD depends on order of downloading 2. Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive 3. Live copy usually unavailable 16 Christopher Olston Difficulty 1: Order of Downloading Matters QD(t) = Σq (freqq x Σd (VP x Relevance(d, q)) ) where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) ) 17 ΔQD depends on relative rank positions of D Hence, ΔQD depends on order of downloading To reduce implementation complexity, avoid tracking inter-document ordering dependencies Assume ΔQD independent of downloading of other docs. Christopher Olston Difficulty 3: Live Copy Unavailable Take measurements upon re-downloading D (live copy available at that time) Use forecasting techniques to project forward past re-downloads time ΔQD(t1) now ΔQD(t2) forecast ΔQD(tnow) 18 Christopher Olston Ability to Forecast ΔQD Avg. weekly ΔQD (log scale) second 24 weeks Top 50% 19 Data: 15 web sites sampled from OpenDirectory topics Top 80% Top 90% Queries: AltaVista query log Docs downloaded once per week, in random order first 24 weeks Christopher Olston Strategy So Far Measure shift in quality (ΔQD) each time re-download document D Forecast future ΔQD – Treat each D independently Prioritize re-downloading by ΔQD Remaining difficulty: 2. Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive 20 Christopher Olston Difficulty 2: Metric Expensive to Compute One problem: measurements of other documents required. Example: “Live” copy of D becomes less relevant to query q than before • Now D is ranked too high • Some users visit D in lieu of Y, which is more relevant • Result: less-than-ideal quality Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z Results for q Actual 1. X 2. D 3. Y 4. Z Ideal 1. X 2. Y 3. Z 4. D Solution: estimate! 21 Use approximate relevancerank mapping functions, fit in advance for each query Christopher Olston DETAIL Estimation Procedure Focus on query q (later we’ll see how to sum across all affected queries) QD,q = Σd (ViewProb(d,q) x Relevance(d,q)) = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q)) Let Fq(rel) be relevancerank mapping for q – – We use piecewise linear function in log-log space Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank ≈ Σr=r1+1…r2 (VP(r–1) – VP(r)) x F–1q(r) – 22 Use integral approximation of summation Christopher Olston DETAIL Where we stand … Context: QD = Σq (freqq x QD,q ) QD,q = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q)) ≈ f(Rel(D,q), Rel(Dold,q)) ≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) ) QD,q ≈ g(Rel(D,q), Rel(Dold,q)) 23 Christopher Olston Difficulty 2, continued Additional problem: must measure effect of shift in rank across all queries. Solution: couple measurements with index updating operations Sketch: – Basic index unit: posting. Conceptually: – – 24 term ID document ID scoring factors Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair* Transform using estimation procedure, and accumulate across postings touched to get ΔQD * assumes scoring function treats term/document pairs independently Christopher Olston DETAIL Background: Text Indexes Basic index unit: posting – – One posting for each term/document pair Contains information needed for scoring function Number of occurrences, font size, etc. Dictionary Term 25 # docs Postings Total freq aid 1 1 all 2 2 cold 1 1 duck 1 2 Doc # Freq 58 1 37 1 62 1 15 1 41 2 Christopher Olston DETAIL Pre-Processing: Approximate the Workload Break multi-term queries into set of single-term queries – – Now, term query Index has one posting for each query/document pair Dictionary Term 26 # docs Postings = query Total freq aid 1 1 all 2 2 cold 1 1 duck 1 2 Doc # Freq 58 1 37 1 62 1 15 1 41 2 Christopher Olston DETAIL Taking Measurements During Index Maintenance 27 While updating index: – Initialize bank of ΔQD accumulators, one per document (actually, materialized on demand using hash table) – Each time insert/delete/update a posting: Compute new & old relevance contributions for query/document pair: Rel(D,q), Rel(Dold,q) Compute ΔQD,q using estimation procedure, add to accumulator: ΔQD += freqq x g(Rel(D,q), Rel(Dold,q)) Christopher Olston Measurement Overhead Implemented in Lucene Caveat: Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion 28 Christopher Olston Summary of Approach 29 User-centric metric of search repository quality (Re)downloading document improves quality Prioritize downloading by expected quality gain Metric adaptations to enable feasible+efficient implementation Christopher Olston Next: Empirical Results Introduction: monitoring distributed sources User-centric web crawling – – – 30 Model + approach Empirical results Related & future work Christopher Olston Staleness = fraction of out-of-date documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes 31 Scoring function: PageRank (similar results for TF.IDF) resource requirement Overall Effectiveness Min. Staleness Min. Embarrassment User-Centric Quality (fraction of ideal) Christopher Olston Reasons for Improvement Does not rely on size of text change to estimate importance Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload 32 Christopher Olston (boston.com) Reasons for Improvement Accounts for “false negatives” Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page 33 (washingtonpost.com) Christopher Olston Related Work (1/2) General-purpose web crawling – Focused/topic-specific crawling – 34 [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01] Maximize average freshness or age Balance new downloads vs. redownloading old documents [Chakrabarti, many others] Select subset of documents that match user interests Our work: given a set of docs., decide when to (re)download Christopher Olston Most Closely Related Work [Wolf et al., WWW’02]: – – 35 Maximize weighted average freshness Document weight = probability of “embarrassment” if not fresh User-Centric Crawling: – Measure interplay between update and query workloads When document X is updated, which queries are affected by the update, and by how much? – Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality Christopher Olston Future Work: Detecting Change-Rate Changes Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQD) No provision to explore change-rates explicitly Bad case: change-rate = 0, so never monitor – Won’t notice future increase in change-rate Explore/exploit tradeoff – 36 Ongoing work on Bandit Problem formulation Christopher Olston Summary Approach: – – Empirical results: – – 37 User-centric metric of search engine quality Schedule downloading to maximize quality High quality with few downloads Good at picking “right” docs. to re-download Christopher Olston