Credentials & Capabilities

Download Report

Transcript Credentials & Capabilities

Improved Techniques for Result Caching
in Web Search Engines
Qingqing Gan
Torsten Suel
CSE Department
Polytechnic Institute of NYU
Content of this Talk
Result caching in web search engines
(1) The case of weighted caching: some queries
more expensive to recompute than other
- investigate algorithms for this case
- hybrid algorithms, and impact of power laws
(2) Feature-based approach to caching
- improvements for result and index caching
Caching
• Query processing is a major performance bottleneck
• Common performance optimizations: caching,
index compression, index pruning and early
termination, parallel processing
• Multi-level caching: result caching vs. index caching
• Mostly focus on result caching (but also index)
Query Processing
Inverted index can efficiently identify pages that
contain a particular word or set of words
 Main challenge for query processing is the
significant size of the index data for a query
 Need to optimize to scale with users and data
 Caching is one of such optimizations



Result caching: has query occurred before?
List caching: has index data for term been accessed
before?
Related Work
• Markatos (WCW 2000) studies query log distributions and
compares several basic caching algorithms cache
• Number of subsequent papers on result caching:
•
Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003)
•
Fagni et al. (TOIS 2006)
•
Lempel/Moran (WWW 2003)
•
Saraiva et al. (SIGIR 2001)
•
Xie/Hallaron (Infocom 2002)
• Fagni el al. proposes hybrid methods that combine a dynamic
cache with a more static cache
• Baeze-Yates et al. (Spire 2007) use some features for cache
admission policy
Basics
• Sequence of queries q_1 to q_n
• LRU: least recently used
• LFU: least frequently used
• Can be implemented using basic data structures
score defined as the time since last occurrence of the same query in LRU,
or the frequency of a query in LFU. Evict query with smallest score
• Recency (LRU) vs. frequency (LFU)
• Various hybrids
SDC (Static and Dynamic Caching)
Fagni et al. (TOIS 2006)
LFU
LRU
Alpha = 0.7
Characteristics of Queries
• Query frequencies follow Zipf distribution
• While a few queries are quite frequent, most queries occur
only once or a few times
Characteristics of Queries
• Query traces exhibit some amount of burstiness, i.e.,
occurrences of queries are often clustered
• A significant part of this burstiness is due to the same user
reissuing a query to the engine.
Contributions
• Study result caching as a weighted caching problem
- Hit ratio
- Cost saving
• Hybrid algorithms for weighted caching
• Caching and power laws
• Feature-based cache eviction policies
Weighted Caching
•
•
•
•
Assume all cache entries have same size
Standard caching: all entries also same cost
Weighted caching: different costs
Result caching: some queries more
expensive to recompute than others
• In fact, costs highly skewed
• Should keep expensive results longer
• Note: throughput vs. latency
Weighted Caching Algorithms
• LFU_w: evict entry with smallest value of
past frequency * cost (weighted version on LFU)
• Landlord
• On insertion, give entry a deadline equal to its cost
• Evict entry with smallest deadline, and deduct this deadline
from all other deadlines in the cache
Weighed version of LFU (Young, Cao/Irani 1998)
• Clairvoyant: no poly. time optimal offline known
• We cook up an estimate
• Assume system returns cost of query computed
Dataset
• 2006 AOL query log with 36 million queries
• Queries which consist of only stop words are
removed
• Requests for further result pages are removed
Hit Ratio of Basic Algorithms
Cost Reduction
New Hybrid Algorithms
• SDC
• lru_lfu
• landlord_lfu_w
Weighted Caching and Power Laws
• Problem with weighted caching with high skew
• Suppose q_1 has occurred once and has cost 10,
and q_2 has occurred 10 times and has cost 1
• LFU_w gives same priority  is that right?
• Lottery:
•
•
•
•
•
Multiple rounds, one winner per round
Some people buy more tickets than others
But each person buys same number each week
Given past history, guess future winners
Suppose ticket sales are Zipfian
Weighted Caching and Power Laws
• Compare: smoothing techniques in language models
• Three solutions:
• Good-Turing estimator
• Estimator derived from power law
• Pragmatic: fit correction factors from real data
• Last solution subsumes others
Weighted Zipfian Caching
E.g, in LFU_w, Priority score = cost * frequency * g()
Frequency
g()
1
0.05
2
0.25
3
0.35
4
0.75
>=5
1.0
Hybrid Algorithms After Adding Correction
Feature-Based Caching
• Most standard algorithms view input as sequence of object IDs
• Hides many application details!
• E.g., query length, frequency of query terms in query logs or in
collection, click behavior, navi/info query
• But these could be very useful for caching!
• So, can/should we use more features in caching?
• … and, should we keep using “explicit” algorithms, or rely on
machine learning?
• Compare: ranking functions in IR
• Previous work: Baeza-Yates et al. (SPIRE 2007)
Features
• F1: steps to last occurrence of this query;
• F2: steps between last two occurrences of this query, if a query
occurs at least twice;
• F3: query frequency so far;
• F4: query length;
• F5: length of shortest inverted index list of all query terms in the
query;
• F6: the frequency of the rarest query term;
• F7: the number of users who issue this query;
• F8: among F7, the gap between the last two queries issued by
the most recently active user;
• F9: average number of clicks per query;
• F10: the query frequency of the rarest pair of terms in the query.
Caching Algorithm
• Trivial machine learning approach (i.e., counting)
• Split each feature into a few bins, thus placing each
cache entry into one bin
• For each bin, estimate likelihood of reoccurrence
using past queries
• During caching (online), can efficiently move entries
between bins until eviction
• O(lg c) cost per element
(c is cache size)
Experimental Results – Hit Ratio
Experimental Results – Hit Ratio
(cont.)
Experimental Results – Cost Savings
Priority score = probability score * cost of this query
Experimental Results – List Caching
Discussion
• A bunch of results on caching, in two parts …
• Note: feature-based beats the stuff in first part!
• Open: Cache size versus cache freshness issue
• Other apps of feature-based approach
Questions?