The Performance Impact of Kernel Prefetching on Buffer

Download Report

Transcript The Performance Impact of Kernel Prefetching on Buffer

The Performance Impact of Kernel
Prefetching on Buffer Cache
Replacement Algorithms
(ACM SIGMETRIC ’05 ) ACM International
Conference on Measurement & Modeling of
Computer Systems
Ali R. Butt, Chris Gniady, Y. Charlie Hu
Purdue University
Presented by Hsu Hao Chen
Outline



Introduction
Motivation
Replacement Algorithm










OPT
LRU
LRU-2
2Q
LIRS
LRFU
MQ
ARC
Performance Evaluation
Conclusion
Introduction

Improving file system performance



Design effective block replacement algorithms for the buffer
cache
Almost all buffer cache replacement algorithms have
been proposed and studied comparatively without
taking into account file system prefetching which
exists in all modern operating systems
Cache hit ratio is used as sole performance metric


The actual number of disk I/O requests?
The actual running time of applications?
Introduction (Cont.)

Kernel Prefetching in Linux

Various kernel components on the path
from file system operation to the disk
Beneficial for sequential accesses
Motivation

The goal of buffer replacement algorithm



Minimize the number of disk I/O
Reduce the running time of the applications
Example
Without prefetching,
Belady results in 16 misses
LRU results in 23 misses
With prefetching, Beladys is not optimal!
Replacement Algorithm

OPT

Evicts the block that will be referenced farthest in
the future


Often used for comparative studies
Prefetched blocks are assumed to be accessed
most recently, OPT can immediately determine
wrong or right prefetches
Replacement Algorithm

LRU


Replaces the page that has not been accessed
for the longest time
Prefetched blocks are inserted in the MRU just
like regular blocks
Replacement Algorithm

LRU pathological case



the working set size is larger than the cache
The application has a looping access pattern
In this case, LRU will replace all blocks
before they are used again
Replacement Algorithm

LRU-2


Try to avoid the pathological cases of LRU
LRU-K replaces a block based on the Kth-to-the-last
reference



Authors recommended K=2
LRU-2 can quickly remove cold blocks from the cache
Each block access requires log(N) operations to manipulate a
priority queue
N is the number of blocks in the cache
Replacement Algorithm

2Q

Proposed






Achieve similar page replacement performance to LRU-2
Low overehad way (constant LRU)
All missed blocks in A1in queue
Address of replaced blocks in A1out queue
Re-referenced blocks in Am queue
Prefetched blocks are treated as on-demand blocks and if
prefetched block is evicted from A1in queue before ondemand access, it is simply discarded
Replacement Algorithm

2Q
Replacement Algorithm

LIRS (Low Inter-reference Recency Set)


LIR block : if accessed again since inserted on the
LRU stack
HIR block : referenced less frequently
Insert prefetched blocks into the cache that maintains
HIR blocks
Replacement Algorithm

LRFU (Least Recently/Frequently Used)

Replaces the block with the smallest C(x) value
every block x,at every time t,λ a tunable parameter
Initially,assign a value C(x)=0

Prefetched blocks are treated as the most recently
accessed


Problem: how to assign the initial weight (c(x))
Solution: a prefetched flag is set


When the block is accessed on-demand
Initial value
Replacement Algorithm

MQ (Multi-Queue)

Use m LRU queues (typically m=8)


Q0,Q1,….Qm-1,where Qi contains blocks that have been at
least 2i times but no more than 2i+1-1 times recently
Not increments the reference counter when a block is
prefetched
Replacement Algorithm

MQ (Multi-Queue)
Replacement Algorithm

ARC (Adaptive Replacement Cache)

Maintains two LRU lists




Pages that have been referenced only once (L1)
Pages that have been referenced at least twice (L2)
Each list has same length c as cache
Cache contains tops of both lists: T1 and T2
L-1
|T1| + |T2| =
c
T1
L-2
T2
Replacement Algorithm


ARC attempts to maintain a Buffer size
B_T1 for list T1
When cache is full, ARC replacement

if |T1| > B_T1
LRU page from T1

otherwise
LRU page from T2

if prefetched block is already in the ghost queue,
it is not moved to the second queue, but to the
first queue
Performance Evaluation

Simulation Environment



implement a buffer cache simulator
functionally (prefetching, I/O clustering) Linux
With DiskSim, they simulate the I/O time of
applications
Application
Sequential access
Random access
Multi1 : workload in a code development environment
Multi2 : workload in a graphic development and simulation
Multi2 : workload in a database and a web index server
Performance Evaluation (Cont.)
cscope (sequential)
Hit ratio
# of clustered disk requests
Execution time
Performance Evaluation (Cont.)
cscope (sequential)
Hit ratio
# of clustered disk requests
Execution time
Performance Evaluation (Cont.)
glimpse (sequential)
Hit ratio
# of clustered disk requests
Execution time
Performance Evaluation (Cont.)
tph-h (random)
Hit ratio
# of clustered disk requests
Execution time
Performance Evaluation (Cont.)
tph-r (random)
Hit ratio
# of clustered disk requests
Execution time
Performance Evaluation (Cont.)

Concurrent applications




Multi1 : hit ratios and disk requests with or without prefetching
exhibit similar behavior as cscope
Multi2 : behavior is similar to multi1, but prefetching does not
improve the execution time (CPU-bound viewperf)
Multi3 : behavior is similar to tpc-h
Synchronous vs. asynchronous prefetching
With prefetching, number of requests
is at least 30% lower than without
prefetching except OPT, especially
when asynchronous prefetching is
used
Number and size of disk I/O (cscope at 128MB cache size)
Conclusion

Kernel prefetching performance can have significant
impact


Application file access patterns importance for
prefetching disk data



different replacement algorithms
Sequential access
Random access
With prefetching or without prefetching, hit ratio is
not sole performance metric