Effective Prediction of Web-user Access: A Data Mining

Transcript Effective Prediction of Web-user Access: A Data Mining

March 15, 2004
Caching in Web Memory Hierarchies
Dimitrios Katsaros
Yannis Manolopoulos
Data Engineering Lab
Department of Informatics
Aristotle Univ. of Thessaloniki, Greece
http://delab.csd.auth.gr
ACM Symposium on Applied Computing (ACM SAC) 2004
1
March 15, 2004
Web performance: the ubiquitous content cache
Origin
server
proxy
caches
reverseproxy
cache
cooperating
hierarchical
ACM Symposium on Applied Computing (ACM SAC) 2004
2
March 15, 2004
Web caching benefits
• Caching is important because by reducing
the number of requests
– the network bandwidth consumption is reduced
– the user-perceived delay is reduced (popular
objects are moved closer to clients)
– the load on the origin servers is reduced
(servers handle fewer requests)
ACM Symposium on Applied Computing (ACM SAC) 2004
3
March 15, 2004
Content caching is still strategic
Is the optimization of fine tuning of cache replacement a
“moot point” due to the ever decreasing prices of memory?
Such a conclusion is ill guided for several reasons:
•
•
•
First, studies have shown that the cache HR and BHR grow in a log-like
fashion as a function of cache size [3]. Thus, a better algorithm that increases
HR by only several percentage points would be equivalent to a several-fold
increase in cache size
Second, the growth rate of Web content is much higher than the rate with
which memory sizes for Web caches are likely to grow
Finally, the benefit of even a slight improvement in cache performance may
have an appreciable effect on network traffic, especially when such gains are
compounded through a hierarchy of caches
ACM Symposium on Applied Computing (ACM SAC) 2004
4
March 15, 2004
Web cache performance metrics
Replacement policies aim at improving cache effectiveness by
optimising two performance measures:
• the hit ratio:
• the cost savings ratio:
where
• hi is the number of references to object i satisfied by the cache,
• ri is the total number of references to I, and
• ci is the cost of fetching object i in cache.
The cost can be defined as:
• the object size si. Then, CSR coincides with BHR (byte hit ratio)
• the downloading latency ci. Then, CSR coincides with DSR (delay
savings ratio)
ACM Symposium on Applied Computing (ACM SAC) 2004
5
March 15, 2004
Challenges for a caching strategy
Several factors distinguish Web caching from caching
in traditional computer architectures
(a)
(b)
(c)
(d)
the heterogeneity in objects' sizes,
the heterogeneity in objects' fetching costs,
the depth of the Web caching hierarchy, and
the access patterns, which are not generated by a few
programmed processes, but mainly originate from large
human populations with diverse and varying interests
ACM Symposium on Applied Computing (ACM SAC) 2004
6
March 15, 2004
What has been done to address them? (1)
The majority of the replacement policies proposed so far fail to
achieve a balance between (or optimize both) HR and CSR:
•
The recency-based policies, favour the HR, e.g., the family of
GreedyDualSize algorithms [3, 7]
•
The frequency-based policies, favour the CSR (BHR or DSR),
e.g., LFUDA [5]
Exceptions are the LUV [2] and GD* [7], which combine recency and
frequency.
•
The drawback of LUV is the existence of a manually tunable
parameter λ, used to “select” the recency-based or frequencybased behaviour of the algorithm.
•
GD* has a similar drawback, since it requires manual tuning of
the parameter β
ACM Symposium on Applied Computing (ACM SAC) 2004
7
March 15, 2004
What has been done to address them ? (2)
Regarding the depth of the caching hierarchy:
Carey Williamson [15]
•
Proved an alteration in the access pattern, which is
characterized by weaker temporal locality
•
Proposed the use of different replacement policies (LRU,
LFU, GD-Size) in different levels of the caching hierarchies
This solution though is not feasible and/or acceptable:
•
the caches are administratively independent
•
the adoption of a replacement policy (e.g., LFU) at any level
of the hierarchy favours one performance metric (CSR)
over the other (HR)
ACM Symposium on Applied Computing (ACM SAC) 2004
8
March 15, 2004
What has been done to address them ? (3)
The origin of the request streams received little attention
•
It is (in combination with the caching hierarchy depth)
responsible for the large number of one-timers, objects
requested only once
•
Only SLRU [1] deals explicitly with this factor:
–
Proposed the use of a small auxiliary cache to maintain
metadata for past evicted objects
•
This approach:
–
needs to heuristically determine the size of the auxiliary
cache
–
precludes some objects from entering into the cache.
Thus, it may result in slow adaptation of the cache in a
changing request pattern
ACM Symposium on Applied Computing (ACM SAC) 2004
9
March 15, 2004
Why do we need a new caching policy?
• Need to optimize not only one of the two performance
metrics in a heterogeneous environment, like the Web.
We would like a balance between HR and CSR
(balance between the average latency that the user sees and the
traffic performance)
• Need to deal with the weak temporal locality in Web
request streams
• Need to eliminate any “administratively” tunable
parameters.
The existence of parameters whose value is derived from statistical
information extracted from Web traces (e.g., LNC-R-W3 [14] or LRV [12])
is not desirable due to the difficulty of tuning these parameters
• Our contribution: CRF, a new caching policy dealing
with all the particularities of the Web environment
ACM Symposium on Applied Computing (ACM SAC) 2004
10
March 15, 2004
CRF ’s design principles: BHR vs. DSR
• The delay savings ratio is affected very much by the
transient network and Web server conditions
• Two more reasons bring about significant variation in
the connection time for identical connections
– The persistent HTTP connections, which avoid reconnection
costs, and
– Connection caching [4], which reduces connection costs
• We favour the size (BHR) instead of the
latency (DSR) of fetching an object as a
measure of the cost
ACM Symposium on Applied Computing (ACM SAC) 2004
11
March 15, 2004
CRF ’s design principles: One-timers
• We partition the cache space
– Cache partitioning has been followed by prior algorithms, e.g.
FBR [13], but not for the purpose of the isolation of one-timers
– Only Segmented LRU [8] adopted partitioning for isolating onetimers. Experiments showed that (in the Web) it suffers from
cache pollution
• The cache has two segments: R-segment and I-segment
– The cache segments are allowed to grow and shrink deliberately
depending on the characteristics of the request stream
– The one-timers are accommodated into the R-segment. We do not
further partition the I-segment since it makes very difficult to
decide the segment from which the victim will be selected and it
incurs maintenance cost for moving the objects from one segment
to the other
ACM Symposium on Applied Computing (ACM SAC) 2004
12
March 15, 2004
CRF ’s design principles: Ranking (1)
A couple of decisions must be made, which regard:
•
•
the ranking of objects within each segment, and
the selection of replacement victims
These decisions must assure 3 constraints/targets:
(a) balance between hit and byte hit ratio,
(b) protect the cache from one-timers, but without preventing
the cache from adapting to a changing access patterns, and
(c) because of the weak temporal locality, exploit frequencybased replacement criteria
ACM Symposium on Applied Computing (ACM SAC) 2004
13
March 15, 2004
CRF ’s design principles: Ranking (2)
• Aim for the R-segment (one-timers):
– accommodate as many objects as possible
– exploit any short-term temporal locality of
the request stream
• the ranking function for the R-segment:
the ratio of object’s entry time over its
size
ACM Symposium on Applied Computing (ACM SAC) 2004
14
March 15, 2004
CRF ’s design principles: Ranking (3)
• Aim for the I-segment (heart of the cache):
– provide a balance between HR and BHR
– deal with the weak temporal locality
• the ranking function for the I-segment:
the product of the last inter-reference time
of an object times the recency of the object
– the inter-reference time stands for the steadystate popularity (frequency of reference) of an
object
– the recency stands for a transient preference to
an object
ACM Symposium on Applied Computing (ACM SAC) 2004
15
March 15, 2004
CRF ’s design principles: Replacement victim (1)
•
•
•
•
•
•
•
•
•
R-victim : the candidate victim from R-segment
I-victim : the candidate victim from the I-segment
tc : the current time
R1 : the reference time of the R-victim
I1 : the time of the penultimate reference to the I-victim
I2 : the time of the last reference to it
δ1 (= tc - I2) : the reference recency of the I-victim
δ2 (= tc- R1) : the reference recency of the R-victim
δ3 (= I2-I1) : the last inter-reference time of the I-victim
Estimate whether or not the I-victim loses its popularity and
also the potential of the R-victim to get a second reference
ACM Symposium on Applied Computing (ACM SAC) 2004
16
March 15, 2004
CRF ’s design principles: Replacement victim (2)
R-victim
R-victim
I-victim
R-victim
R-victim
R-victim
I-victim
ACM Symposium on Applied Computing (ACM SAC) 2004
17
March 15, 2004
CRF ’s pseudocode (1)
ACM Symposium on Applied Computing (ACM SAC) 2004
18
March 15, 2004
CRF ’s performance evaluation
• Examined CRF against LRU, LFU, Size, LFUDA,
GDS, SLRU, LUV, HLRU, LNCRW3
–
–
–
–
GDS be the representative of the family which includes GDS, GDSF
HRLU(6) be the representative of the HLRU family
LNCRW3 implemented so as to optimise the BHR instead of DSR
LUV tuning: we tried several values for the λ parameter, and we selected the
value 0.01, because it gave the best performance for small caches and the
best performance in most cases
• Generated synthetic Web request streams with
the ProWGen tool [15]
ACM Symposium on Applied Computing (ACM SAC) 2004
20
March 15, 2004
CRF ’s performance evaluation
Input parameters to ProWGen tool
ACM Symposium on Applied Computing (ACM SAC) 2004
21
March 15, 2004
Sensitivity to one-timers (aggregate)
CRF’s gain-loss wrt one-timers
ACM Symposium on Applied Computing (ACM SAC) 2004
24
March 15, 2004
Sensitivity to Zipfian slope (aggregate)
CRF’s gain-loss wrt Zipfian slope
ACM Symposium on Applied Computing (ACM SAC) 2004
27
March 15, 2004
Conclusions
• We proposed a new replacement policy for Web caches,
the CRF policy
• CRF was designed to address all the particularities of
the Web environment
• The performance evaluation confirmed that CRF is a
hybrid between recency and frequency-based policies
• CRF depicts a stable and overall improved performance
ACM Symposium on Applied Computing (ACM SAC) 2004
28
March 15, 2004
Thank you for
your attention
ACM Symposium on Applied Computing (ACM SAC) 2004
29
March 15, 2004
References (1)
1.
2.
3.
4.
5.
6.
7.
8.
C. Aggrawal, J. Wolf and P.S. Yu. Caching on the World Wide Web. IEEE
Transactions on Knowledge and Data Engineering, 11(1):94–107, 1999.
H. Bahn, K. Koh, S.H. Noh and S.L. Min. Efficient replacement of
nonuniform objects in Web caches. IEEE Computer, 35(6):65–73, 2002.
L. Breslau, P. Cao, L. Fan, G. Phillips and S. Shenker. Web caching and
Zipf-like distributions: Evidence and implications. Proceedings IEEE
INFOCOM Conf, pp.126-134, 1999.
P. Cao and S. Irani. Cost-aware WWW proxy caching algorithms.
Proceedings USITS Conf, pp.193–206, 1997.
E. Cohen, H. Kaplan and U. Zwick. Connection caching: model and
algorithms. Journal of Computer and System Sciences, 67(1):92–126, 2003.
J. Dilley and M. Arlitt. Improving proxy cache performance: analysis of
three replacement policies. IEEE Internet Computing, 3(6):44–50, 1999.
S. Jiang and X. Zhang. LIRS: an efficient low inter-reference recency
set replacement policy to improve buffer cache performance.
Proceedings ACM SIGMETRICS Conf, pp.31–42, 2002.
S. Jin and A. Bestavros. GreedyDual* Web caching algorithm: exploiting
the two sources of temporal locality in Web request streams. Computer
Communications, 24(2):174–183, 2001.
ACM Symposium on Applied Computing (ACM SAC) 2004
30
March 15, 2004
References (2)
9.
10.
11.
12.
13.
14.
15.
R. Karedla, J.S. Love and B.G. Wherry. Caching strategies to improve
disk system performance. IEEE Computer, 27(3):38–46, 1994.
N. Megiddo and D. S. Modha. ARC: a self-tuning low overhead
replacement cache. Proceedings USENIX FAST Conf, 2003.
A. Nanopoulos, D. Katsaros and Y. Manolopoulos. A data mining algorithm
for generalized Web prefetching. IEEE Transactions on Knowledge and
Data Engineering, 15(5):1155–1169, 2003.
L. Rizzo and L. Vicisano. Replacement policies for a proxy cache.
IEEE/ACM Transactions on Networking, 8(2):158–170, 2000.
J. Shim, P. Scheuermann and R. Vingralek. Proxy cache algorithms:
design, implementation and performance. IEEE Transactions on
Knowledge and Data Engineering, 11(4):549–562, 1999.
A. Vakali. Proxy cache replacement algorithms: a history-based
approach. World Wide Web Journal, 4(4):277–297, 2001.
C. Williamson. On filter effects in Web caching hierarchies. ACM
Transactions on Internet Technology, 2(1):47–77, 2002.
ACM Symposium on Applied Computing (ACM SAC) 2004
31