Transcript Slide 1
Tivoli Software
Using Machine Learning Techniques to Enhance
The Performance of an Automatic Backup and
Recovery System
Amir Ronen, Dan Pelleg, Machine Learning Group, HRL
Eran Raichstein (IBM Software Group)
Amir Ronen
1
© 2010 IBM Corporation
Tivoli Software
Motivation
IBM’s Fastback Automatic backup and recovery system
Incremental back up of disk volumes to repository
Instant restore (IR): allows applications to start working
immediately after recovery
Xpress mount: allows access to back up data without
recovering it (e.g. for taking tape dumps)
Goal
Accelerate IR and mount via machine learning and algorithmic
techniques
Minimum intervention in Fastback’s internals
Benefits: minimize bugs, easy upgrading, generality, …
2
© 2010 IBM Corporation
Tivoli Software
Outline
The Fastback system
Algorithm for automatic determination of read-ahead
– Basic observations
– The algorithm
– Experiments in the Fastback system
Prefetching
– Theoretical model and observation
– Basic prefetching algorithms
– Frequent pattern based algorithms
– Controlling and combining prefetch algorithms
Summary
3
© 2010 IBM Corporation
Tivoli Software
FastBack’s Instant Restore and Mount
Instant Restore allows users to start
using applications on the same disk
to which the volume is being
restored, while the restore operation
is still in process.
From an architectural perspective,
mount is somewhat similar
1. Activate Instant Restore
2. Read IOs from un-recovered areas
trigger block fetch from the repository
3. All other reads are performed as usual
New Production
Disk
Typical Production
Disk
Production
server
Xpress Restore
Server
New Production
server
repository
4
© 2010 IBM Corporation
Tivoli Software
CNF: An Algorithm for Readahead
Amount Determination
5
© 2010 IBM Corporation
Tivoli Software
The problem
A block is needed from repository
Xpress Restore
Server
New Production
server
Suppose that we are allowed to
bring additional subsequent blocks
How many to bring?
- too many may slow down the system
(in particular if they will not be used)
- too few will cause high total latency
6
repository
© 2010 IBM Corporation
Tivoli Software
Simple cost model: T ~ T1 + nT2 +
T1 “fixed” latency
T2 time to bring one block
n number of blocks
noise (assumed zero)
Key idea Suppose that we choose n such that T1 = nT2
The cost never more than doubles
In many settings n can be large
The algorithm is 2 competitive
7
© 2010 IBM Corporation
Tivoli Software
Problem 1
The latency T1 and the block cost T2 are not known
May vary over time
Solution
Hold a window of last k requests (e.g. 200)
Use linear regression to estimate T1 and T2
Update can be done in O(1)
Latency ~ 6.5
Block cost ~ 3
8
© 2010 IBM Corporation
Tivoli Software
Problem 2
What if the n-values are similar so we will not be able to
estimate?
Sampling ideas
We only need a few samples
If mean(n) is large we sample small values
If mean(n) is small, we sample 2*mean(n)
Low amortized cost
9
© 2010 IBM Corporation
Tivoli Software
The Algorithm
Hold a window of the last k requests
At each step update the linear regression
(Refresh from time to time)
If regression is possible:
– Estimate T1, T2
– Compute desired n value
– If the system asked for less, recommend readahead
Otherwise
– Sample as described
Additional Heuristics unreasonable values, smoothing,
mis-estimation…
10
© 2010 IBM Corporation
Tivoli Software
Impact on Fastback
Added latency per each request
Outperformed the predetermined values
Speedup up to X4
mounting continuous and fragmented data
11
© 2010 IBM Corporation
Tivoli Software
Comments & open issues
The algorithm may be applicable elsewhere
Extensions to more complicated cost models
Analyzing executions of parallel copies of the
algorithm
12
© 2010 IBM Corporation
Tivoli Software
Block Prediction and Prefetching for
Enhancing Instant Restore
13
© 2010 IBM Corporation
Tivoli Software
Motivation
IR needs to fetch blocks from the
repository according to its workload
Xpress Restore
Server
New Production
server
Ideally, blocks will predicted and
brought before they are needed
Comments
The network is not preemptive
so prefetching can also be harmful
Typical workloads are parallel
processes, each with some locality
of reference
14
repository
© 2010 IBM Corporation
Tivoli Software
A model for the prefetch problem
Workload is an unknown sequence of events L1, … Ln. Each
Lj is either:
An access to a block Bj
A process event
System is composed of a CPU and network that can be ran in
parallel. At each step j the system can do one of the following
1. Process (Lj is a process event, cost = 1 unit)
2. Access its local memory (If Lj is an access event and Bj is
already in the local memory, cost = 1 unit)
3. Fetch a block from the repository (this occupies the network
for C time units, can be done in parallel to 1 or 2)
15
© 2010 IBM Corporation
Tivoli Software
A model for the prefetch problem (cont.)
Slowdown Let L1, … Ln be a workload. The slowdown of the
system on L is the ratio between the total system time and the
time to perform the workload locally, i.e. Tsys / n.
Workload
B17
Process
Access
CPU
B18
Process
Process
Access
Fetch 18
…
…
Process
C=2
Delta
Network
Fetch 17
Slowdown is ~1,
Without prefetching, slowdown is around 2
16
© 2010 IBM Corporation
Tivoli Software
Simple prefetch algorithms
Delta rule
Whenever Bj is accessed put Bj+1 in queue
Whenever network is idle, prefetch in LIFO order
Very effective rule, simple to implement
No prefetch
Can be shown as 2-competitive!
Order by frequency
In train time, order blocks by their frequency
OPT Hypothetical optimal offline algorithm
17
© 2010 IBM Corporation
Tivoli Software
Frequent pattern mining based algorithms
CMiner (Li et el. FAST 2004)
Identifies reoccurring block sub-sequences in train time
Problematic runtime and space complexity in our settings
A,E,L Z
Hot item
B-tree
18
© 2010 IBM Corporation
Tivoli Software
Novel variants of CMiner
CMiner()
Identifies generic frequent delta rules
fetch
j B j B j 1 B j 2
B j 3
Efficient runtime and space complexity
CMiner-OBF
A two level variant of cminer
19
© 2010 IBM Corporation
Tivoli Software
Simulations
Setup
Used traces from OLTP financial transactions and of an SQL
stress tool.
Simulated the system under various parameters and
measured slowdown in various time points
20
© 2010 IBM Corporation
Tivoli Software
Simulations (cont)
Simple delta rules were hard to bit
Cminer() often improves upon them but not always
Some schemas are harmful
21
© 2010 IBM Corporation
Tivoli Software
Summary and open issues
Automatic read-ahead determination
Highly effective
Can be applicable elsewhere
Calls for more generalized cost models
Block prediction and prefetch
Simple delta rules seem hard to beat
Potential for improvement
Novel frequent pattern mining based algorithms. Might
be interesting in other context (e.g. caching)
22
© 2010 IBM Corporation