Transcript Web caching

Improved Techniques for Result Caching
in Web Search Engines
Qinqing Gan
Torsten Suel
Presenter: Arghyadip ● Konark
Summary:
Result caching in web search engines
(1) Query Result Caching of search engines to improve the
query processing performance.
(2) To increase the effective throughput of the entire search
engine system.
(3) Discussion of various weighted ,un-weighted and hybrid
query result caching techniques.
(4) Performance Evaluation.
Query Processing

Main challenge for query processing is the
significant size of the index data for a query

Need to optimize to scale with users and data

Caching is one of such optimizations


Result caching: has query occurred before?
List caching: has index data for term been accessed
before?
Query Coordinator
Related Work
• Number of subsequent papers on result caching: (Cache Hit
only)
•
Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003)
•
Fagni et al. (TOIS 2006)
•
Lempel/Moran (WWW 2003)
•
Saraiva et al. (SIGIR 2001)
•
Xie/Hallaron (Infocom 2002)
• Fagni el al. proposes hybrid methods that combine a dynamic
cache with a more static cache
• Baeze-Yates et al. (Spire 2007) use some features for cache
admission policy
Caching Basics
• LRU: least recently used
• LFU: least frequently used
• Can be implemented using basic data structures
score defined as the time since last occurrence of the same query in LRU,
or the frequency of a query in LFU. Evict query with smallest score
• Recency (LRU) vs. frequency (LFU)
• Various hybrids: Combines two or more.
SDC (Static and Dynamic Caching)
Fagni et al. (TOIS 2006)
LFU
LRU
Alpha = 0.7
Characteristics of Queries(AOL Query Log)
• Query frequencies follow Zipf distribution
• While a few queries are quite frequent, most queries occur
only once or a few times
Double
Logarithm
ic Scale
Characteristics of Queries
• Query traces exhibit some amount of burstiness, i.e.,
most of the queries occur only once or twice
• A significant part of this burstiness is due to the same user
reissuing a query to the engine.
•With an assumed query
arrival rate at 132 Queries per
minute
•Most queries repeat within
few minutes/hour
Only Cache Hit?
• Query Result Fails.
• Frequent Admission and
Eviction Occurs.
Ideology:
• Study result caching as a weighted caching problem
- Hit ratio
- Cost saving
• Hybrid algorithms for weighted caching
Weighted Caching
• Assume all cache entries have same size.
• Standard caching: all entries also same cost
• Weighted caching: different costs.
• Result caching: some queries more expensive to
recompute than others
• In fact, costs highly skewed
• Should keep expensive results longer
Weighted Caching Algorithms
• LFU_w: evict entry with smallest value of
past frequency * cost (weighted version on LFU)
• Landlord
• On insertion, give entry a deadline equal to its cost
• Evict entry with smallest deadline, and deduct this deadline
from all other deadlines in the cache
Weighed version of LFU (Young, Cao/Irani 1998)
• SDC_w: Combination of LFU_w and Landlord.
Hit Ratio of Basic Algorithms
Cost Reduction
New Hybrid Algorithms
• SDC
• lru_lfu
• landlord_lfu_w
Weighted Caching and Power Laws
• Problem with weighted caching with high skew
• Suppose q_1 has occurred once and has cost 10,
and q_2 has occurred 10 times and has cost 1
• LFU_w gives same priority  is that right?
• Lottery:
•
•
•
•
•
Multiple rounds, one winner per round
Some people buy more tickets than others
But each person buys same number each week
Given past history, guess future winners
Suppose ticket sales are Zipfian
Weighted Caching and Power Laws
• Compare: smoothing techniques in language models
• Three solutions:
• Good-Turing estimator
• Estimator derived from power law
• Pragmatic: fit correction factors from real data
• Last solution subsumes others
Weighted Zipfian Caching
E.g, in LFU_w, Priority score = cost * frequency * g()
Frequency
g()
1
0.05
2
0.25
3
0.35
4
0.75
>=5
1.0
Hybrid Algorithms After Adding Correction
Dataset and Evaluations
• 2006 AOL query log with 36 million queries
• 4GB of Data Collected as HTML Pages from Quora
• Lemur Search Engine has no support for Result
Caching
• Plan to Develop Weighted LRU, LFU and SDC Result
Caching on top of Lemur
• Compare the performance with different weights
assigned to Hit Ratio and Load over all the above
caching variants
• Evaluate which weight metric works best
Evaluation Methodology
Questions?