Transcript Slide 1

Tivoli Software
Using Machine Learning Techniques to Enhance
The Performance of an Automatic Backup and
Recovery System
Amir Ronen, Dan Pelleg, Machine Learning Group, HRL
Eran Raichstein (IBM Software Group)
Amir Ronen
1
© 2010 IBM Corporation
Tivoli Software
Motivation
IBM’s Fastback Automatic backup and recovery system
 Incremental back up of disk volumes to repository
 Instant restore (IR): allows applications to start working
immediately after recovery
 Xpress mount: allows access to back up data without
recovering it (e.g. for taking tape dumps)
Goal
 Accelerate IR and mount via machine learning and algorithmic
techniques
 Minimum intervention in Fastback’s internals
Benefits: minimize bugs, easy upgrading, generality, …
2
© 2010 IBM Corporation
Tivoli Software
Outline
 The Fastback system
 Algorithm for automatic determination of read-ahead
– Basic observations
– The algorithm
– Experiments in the Fastback system
 Prefetching
– Theoretical model and observation
– Basic prefetching algorithms
– Frequent pattern based algorithms
– Controlling and combining prefetch algorithms
 Summary
3
© 2010 IBM Corporation
Tivoli Software
FastBack’s Instant Restore and Mount
Instant Restore allows users to start
using applications on the same disk
to which the volume is being
restored, while the restore operation
is still in process.
From an architectural perspective,
mount is somewhat similar
1. Activate Instant Restore
2. Read IOs from un-recovered areas
trigger block fetch from the repository
3. All other reads are performed as usual
New Production
Disk
Typical Production
Disk
Production
server
Xpress Restore
Server
New Production
server
repository
4
© 2010 IBM Corporation
Tivoli Software
CNF: An Algorithm for Readahead
Amount Determination
5
© 2010 IBM Corporation
Tivoli Software
The problem
 A block is needed from repository
Xpress Restore
Server
New Production
server
 Suppose that we are allowed to
bring additional subsequent blocks
 How many to bring?
- too many may slow down the system
(in particular if they will not be used)
- too few will cause high total latency
6
repository
© 2010 IBM Corporation
Tivoli Software
Simple cost model: T ~ T1 + nT2 + 
 T1 “fixed” latency
 T2 time to bring one block
 n number of blocks
  noise (assumed zero)
Key idea Suppose that we choose n such that T1 = nT2
 The cost never more than doubles
 In many settings n can be large
The algorithm is 2 competitive
7
© 2010 IBM Corporation
Tivoli Software
Problem 1
 The latency T1 and the block cost T2 are not known
 May vary over time
Solution
 Hold a window of last k requests (e.g. 200)
 Use linear regression to estimate T1 and T2
 Update can be done in O(1)
Latency ~ 6.5
Block cost ~ 3
8
© 2010 IBM Corporation
Tivoli Software
Problem 2
 What if the n-values are similar so we will not be able to
estimate?
Sampling ideas
 We only need a few samples
 If mean(n) is large we sample small values
 If mean(n) is small, we sample 2*mean(n)
 Low amortized cost
9
© 2010 IBM Corporation
Tivoli Software
The Algorithm
 Hold a window of the last k requests
 At each step update the linear regression
(Refresh from time to time)
 If regression is possible:
– Estimate T1, T2
– Compute desired n value
– If the system asked for less, recommend readahead
 Otherwise
– Sample as described
Additional Heuristics unreasonable values, smoothing,
mis-estimation…
10
© 2010 IBM Corporation
Tivoli Software
Impact on Fastback
 Added latency per each request
 Outperformed the predetermined values
 Speedup up to X4
mounting continuous and fragmented data
11
© 2010 IBM Corporation
Tivoli Software
Comments & open issues
 The algorithm may be applicable elsewhere
 Extensions to more complicated cost models
 Analyzing executions of parallel copies of the
algorithm
12
© 2010 IBM Corporation
Tivoli Software
Block Prediction and Prefetching for
Enhancing Instant Restore
13
© 2010 IBM Corporation
Tivoli Software
Motivation
 IR needs to fetch blocks from the
repository according to its workload
Xpress Restore
Server
New Production
server
 Ideally, blocks will predicted and
brought before they are needed
Comments
 The network is not preemptive
so prefetching can also be harmful
 Typical workloads are parallel
processes, each with some locality
of reference
14
repository
© 2010 IBM Corporation
Tivoli Software
A model for the prefetch problem
Workload is an unknown sequence of events L1, … Ln. Each
Lj is either:
 An access to a block Bj
 A process event
System is composed of a CPU and network that can be ran in
parallel. At each step j the system can do one of the following
1. Process (Lj is a process event, cost = 1 unit)
2. Access its local memory (If Lj is an access event and Bj is
already in the local memory, cost = 1 unit)
3. Fetch a block from the repository (this occupies the network
for C time units, can be done in parallel to 1 or 2)
15
© 2010 IBM Corporation
Tivoli Software
A model for the prefetch problem (cont.)
Slowdown Let L1, … Ln be a workload. The slowdown of the
system on L is the ratio between the total system time and the
time to perform the workload locally, i.e. Tsys / n.
Workload
B17
Process
Access
CPU
B18
Process
Process
Access
Fetch 18
…
…
Process
C=2
Delta
Network
Fetch 17
 Slowdown is ~1,
 Without prefetching, slowdown is around 2
16
© 2010 IBM Corporation
Tivoli Software
Simple prefetch algorithms
Delta rule
 Whenever Bj is accessed put Bj+1 in queue
 Whenever network is idle, prefetch in LIFO order
 Very effective rule, simple to implement
No prefetch
 Can be shown as 2-competitive!
Order by frequency
 In train time, order blocks by their frequency
OPT Hypothetical optimal offline algorithm
17
© 2010 IBM Corporation
Tivoli Software
Frequent pattern mining based algorithms
CMiner (Li et el. FAST 2004)
 Identifies reoccurring block sub-sequences in train time
 Problematic runtime and space complexity in our settings
A,E,L  Z
Hot item
B-tree
18
© 2010 IBM Corporation
Tivoli Software
Novel variants of CMiner
CMiner()
 Identifies generic frequent delta rules
fetch
j B j B j 1 B j 2 
 B j 3
 Efficient runtime and space complexity
CMiner-OBF
 A two level variant of cminer
19
© 2010 IBM Corporation
Tivoli Software
Simulations
Setup
 Used traces from OLTP financial transactions and of an SQL
stress tool.
 Simulated the system under various parameters and
measured slowdown in various time points
20
© 2010 IBM Corporation
Tivoli Software
Simulations (cont)
 Simple delta rules were hard to bit
 Cminer() often improves upon them but not always
 Some schemas are harmful
21
© 2010 IBM Corporation
Tivoli Software
Summary and open issues
Automatic read-ahead determination
 Highly effective
 Can be applicable elsewhere
 Calls for more generalized cost models
Block prediction and prefetch
 Simple delta rules seem hard to beat
 Potential for improvement
 Novel frequent pattern mining based algorithms. Might
be interesting in other context (e.g. caching)
22
© 2010 IBM Corporation