Transcript Slide 1
Tivoli Software Using Machine Learning Techniques to Enhance The Performance of an Automatic Backup and Recovery System Amir Ronen, Dan Pelleg, Machine Learning Group, HRL Eran Raichstein (IBM Software Group) Amir Ronen 1 © 2010 IBM Corporation Tivoli Software Motivation IBM’s Fastback Automatic backup and recovery system Incremental back up of disk volumes to repository Instant restore (IR): allows applications to start working immediately after recovery Xpress mount: allows access to back up data without recovering it (e.g. for taking tape dumps) Goal Accelerate IR and mount via machine learning and algorithmic techniques Minimum intervention in Fastback’s internals Benefits: minimize bugs, easy upgrading, generality, … 2 © 2010 IBM Corporation Tivoli Software Outline The Fastback system Algorithm for automatic determination of read-ahead – Basic observations – The algorithm – Experiments in the Fastback system Prefetching – Theoretical model and observation – Basic prefetching algorithms – Frequent pattern based algorithms – Controlling and combining prefetch algorithms Summary 3 © 2010 IBM Corporation Tivoli Software FastBack’s Instant Restore and Mount Instant Restore allows users to start using applications on the same disk to which the volume is being restored, while the restore operation is still in process. From an architectural perspective, mount is somewhat similar 1. Activate Instant Restore 2. Read IOs from un-recovered areas trigger block fetch from the repository 3. All other reads are performed as usual New Production Disk Typical Production Disk Production server Xpress Restore Server New Production server repository 4 © 2010 IBM Corporation Tivoli Software CNF: An Algorithm for Readahead Amount Determination 5 © 2010 IBM Corporation Tivoli Software The problem A block is needed from repository Xpress Restore Server New Production server Suppose that we are allowed to bring additional subsequent blocks How many to bring? - too many may slow down the system (in particular if they will not be used) - too few will cause high total latency 6 repository © 2010 IBM Corporation Tivoli Software Simple cost model: T ~ T1 + nT2 + T1 “fixed” latency T2 time to bring one block n number of blocks noise (assumed zero) Key idea Suppose that we choose n such that T1 = nT2 The cost never more than doubles In many settings n can be large The algorithm is 2 competitive 7 © 2010 IBM Corporation Tivoli Software Problem 1 The latency T1 and the block cost T2 are not known May vary over time Solution Hold a window of last k requests (e.g. 200) Use linear regression to estimate T1 and T2 Update can be done in O(1) Latency ~ 6.5 Block cost ~ 3 8 © 2010 IBM Corporation Tivoli Software Problem 2 What if the n-values are similar so we will not be able to estimate? Sampling ideas We only need a few samples If mean(n) is large we sample small values If mean(n) is small, we sample 2*mean(n) Low amortized cost 9 © 2010 IBM Corporation Tivoli Software The Algorithm Hold a window of the last k requests At each step update the linear regression (Refresh from time to time) If regression is possible: – Estimate T1, T2 – Compute desired n value – If the system asked for less, recommend readahead Otherwise – Sample as described Additional Heuristics unreasonable values, smoothing, mis-estimation… 10 © 2010 IBM Corporation Tivoli Software Impact on Fastback Added latency per each request Outperformed the predetermined values Speedup up to X4 mounting continuous and fragmented data 11 © 2010 IBM Corporation Tivoli Software Comments & open issues The algorithm may be applicable elsewhere Extensions to more complicated cost models Analyzing executions of parallel copies of the algorithm 12 © 2010 IBM Corporation Tivoli Software Block Prediction and Prefetching for Enhancing Instant Restore 13 © 2010 IBM Corporation Tivoli Software Motivation IR needs to fetch blocks from the repository according to its workload Xpress Restore Server New Production server Ideally, blocks will predicted and brought before they are needed Comments The network is not preemptive so prefetching can also be harmful Typical workloads are parallel processes, each with some locality of reference 14 repository © 2010 IBM Corporation Tivoli Software A model for the prefetch problem Workload is an unknown sequence of events L1, … Ln. Each Lj is either: An access to a block Bj A process event System is composed of a CPU and network that can be ran in parallel. At each step j the system can do one of the following 1. Process (Lj is a process event, cost = 1 unit) 2. Access its local memory (If Lj is an access event and Bj is already in the local memory, cost = 1 unit) 3. Fetch a block from the repository (this occupies the network for C time units, can be done in parallel to 1 or 2) 15 © 2010 IBM Corporation Tivoli Software A model for the prefetch problem (cont.) Slowdown Let L1, … Ln be a workload. The slowdown of the system on L is the ratio between the total system time and the time to perform the workload locally, i.e. Tsys / n. Workload B17 Process Access CPU B18 Process Process Access Fetch 18 … … Process C=2 Delta Network Fetch 17 Slowdown is ~1, Without prefetching, slowdown is around 2 16 © 2010 IBM Corporation Tivoli Software Simple prefetch algorithms Delta rule Whenever Bj is accessed put Bj+1 in queue Whenever network is idle, prefetch in LIFO order Very effective rule, simple to implement No prefetch Can be shown as 2-competitive! Order by frequency In train time, order blocks by their frequency OPT Hypothetical optimal offline algorithm 17 © 2010 IBM Corporation Tivoli Software Frequent pattern mining based algorithms CMiner (Li et el. FAST 2004) Identifies reoccurring block sub-sequences in train time Problematic runtime and space complexity in our settings A,E,L Z Hot item B-tree 18 © 2010 IBM Corporation Tivoli Software Novel variants of CMiner CMiner() Identifies generic frequent delta rules fetch j B j B j 1 B j 2 B j 3 Efficient runtime and space complexity CMiner-OBF A two level variant of cminer 19 © 2010 IBM Corporation Tivoli Software Simulations Setup Used traces from OLTP financial transactions and of an SQL stress tool. Simulated the system under various parameters and measured slowdown in various time points 20 © 2010 IBM Corporation Tivoli Software Simulations (cont) Simple delta rules were hard to bit Cminer() often improves upon them but not always Some schemas are harmful 21 © 2010 IBM Corporation Tivoli Software Summary and open issues Automatic read-ahead determination Highly effective Can be applicable elsewhere Calls for more generalized cost models Block prediction and prefetch Simple delta rules seem hard to beat Potential for improvement Novel frequent pattern mining based algorithms. Might be interesting in other context (e.g. caching) 22 © 2010 IBM Corporation