Transcript Document
Split Snapshots and Skippy Indexing: Long Live the Past! Ross Shaull <[email protected]> Liuba Shrira <[email protected]> Brandeis University Our Idea of a Snapshot • A window to the past in a storage system • Access data as it was at time snapshot was requested • System-wide • Snapshots may be kept forever – I.e., “long-lived” snapshots • Snapshots are consistent – Whatever that means… • High frequency (up to CDP) Why Take Snapshots? • Fix operator errors • Auditing – When did Bob’s salary change, and who made the changes? • Analysis – How much capital was tied up in blue shirts at the beginning of this fiscal year? • We don’t necessarily know now what will be interesting in the future BITE • Give the storage system a new capability: Back-in-Time Execution • Run read-only code against current state and any snapshot • After issuing request for BITE, no special code required for accessing data in the snapshot Other Approaches: Databases • ImmortalDB, Time-Split BTree (Lomet) – Reorganizes current state – Complex • Snapshot isolation (PostgreSQL, Oracle) – Extension to transactions – Only for recent past • Oracle FlashBack – Page-level copy of recent past (not forever) – Interface seems similar to BITE Other Approaches: FS • WAFL (Hitz), ext3cow (Peterson) – Limited on-disk locality – Application-level consistency a challenge • VSS (Sankaran) – Blocks disk requests – Suitable for backup-type frequency A Different Approach • Goals: – – – – Avoid declustering current state Don’t change how current state is accessed Application requests snapshot Snapshots are “on-line” (not in warehouse) • Split Snapshots – Copy past out incrementally – Snapshots available through virtualized buffer manager Our Storage System Model • A “database” – Has transactions – Has recovery log – Organizes data in pages on disk Our Consistency Model • Crash consistency – Imagine that a snapshot is declared, but then before any modifications can be made, the system crashes – After restart, recovery kicks in and the current state is restored to *some* consistent point – All snapshots will have this same consistency guarantee after a crash Our Storage System Model Page Table P1 Address X P2 Address Y … Cache ISnapshot want record Now R Find Table P1 Find Root Search for R Disk P3 P1 … Access Methods Pn … Database Return R Application Retaining the Past Versus Copy-on-Write (COW) Snapshot PagePage Table Table “S” The old page table became the Snapshot page table P1 P2 P1 P1 P2 Page Table Operations: P1 Snapshot “S” P2 Modify P1 Expensive to update P2 in both page tables Split-COW Page Table P1 P2 P1 P2 SPT(S) SPT(S+1) P1 P1 P2 P2 P1 P1 P2 What’s next 1. How to manage the metadata? 2. How will snapshot pages be accessed? 3. Can we be non-disruptive? Metadata Solution • Metadata (page tables) created incrementally • Keeping many SPTs costly • Instead, write “mappings” into log • Materialize SPT on-demand Snap 6 P2 P1 P1 P2 Snap 5 Snap 4 Snap 3 Start P1 P1 P2 P1 Snap 2 Maplog Snap 1 Maplog • Mappings created incrementally • Added to append-only log • Start points to first mapping created after a snapshot is declared P1 P3 Snap 6 P2 P1 P1 P2 Snap 5 Snap 4 Snap 3 Start P1 P1 P2 P1 Snap 2 Maplog Snap 1 Maplog • Materialize SPT with scan • Scan for SPT(S) begins at Start(S) • Notice that we read some mappings that we do not need P1 P3 Cost of Scanning Maplog • Let overwrite cycle length L be the number of page updates required to overwrite entire database • Maplog scan cannot be longer than overwrite cycle • Let N be the number of pages in the database • For a uniformly random workload, L N ln N (by the “coupon collector’s waiting time” problem) • Skew in the update workload lengthens overwrite cycle • Skew of 80/20 (80% of updates to 20% of pages) increases L by a factor of 4 Skew hurts Skippy Skippy Level 1 • Copy first-encountered mapping (FEM) within node to next level P1 P2 P2 P1 P1 P3 P1 P1 P2 P1 P2 P1 P1 P2 Copies Maplog Snap 6 Snap 5 Snap 4 Snap 3 Snap 2 Start Snap 1 Pointers P1 P3 Cut redundant mapping count in half Skippy Start Snap 6 P2 P1 P1 P2 Snap 5 P1 P1 P2 P1 Snap 4 Maplog Snap 3 P1 P3 Snap 2 P1 P2 P2 P1 Snap 1 Skippy Level 1 P1 P3 K-Level Skippy • Can eliminate effect of skew — or more • Enables ad-hoc, on-line access to snapshots, whether they are old or young Skew # Skippy Levels Time to Materialize SPT (s) 50/50 0 13.8 80/20 0 19.0 1 15.8 2 14.7 3 13.9 0 33.3 1 6.69 99/1 Accessing Snapshots • Transparent to layers above cache • Indirection layer to redirect page requests from a BITE transaction into the snapstore Cache Read Current BITE State P1 P1 P2 P2 P2 P1 P1 P2 Non-Disruptiveness • Can we create Skippy and COW prestates without disrupting the current state? • Key idea: – Leverage recovery to defer all snapshotrelated writes – Write snapshot data in background to secondary disk Implementation • BDB 4.6.21 • Page cache augmented – COWs write-locked pages – Trickle COW’d pages out over time • Leverage recovery – Metadata created in-memory at transaction commit time, but only written at checkpoint time – After crash, snapshot pages and metadata can be recovered in one log pass • Costs – Snapshot log record – Extra memory – Longer checkpoints Early Disruptiveness Results 800 700 656 N o Snaps hots Snaps hots E very O ther T rans ac tion Snaps hots E very T rans ac tion 674 631 600 575 593 613 508 511 500 Time (s) • Single-threaded updating workload of 100,000 transactions • 66M database • We can retain a snapshot after every transaction for a 6–8% penalty to writers • Tests with readers show little impact on sequential scans (not depicted) 472 400 300 200 100 0 5 0 /5 0 8 0 /2 0 Skew 9 9 /1 Paper Trail • Upcoming poster and short paper at ICDE08 • “Skippy: a New Snapshot Indexing Method for Time Travel in the Storage Manager” to appear in SIGMOD08 • Poster and workshop talks – NEDBDay08, SYSTOR08 Questions? Backups… Recovery Sketch 1 • Snapshots are crash consistent • Must recover data and metadata for all snapshots since last checkpoint • Pages might have been trickled, so must truncate snapstore back to last mapping before previous checkpoint • We require only that a snapshot log record be forced into the log with a group commit, no other data/metadata must be logged until checkpoint. Recovery Sketch 2 • Walk backward through WAL, applying UNDOs • When snapshot record is encountered, copy the “dirty” pages and create a mapping • Trouble is that snapshots can be concurrent with transactions • Cope with this by “COWing” a page when an UNDO for a different transaction is applied to that page The Future • Sometimes we want to scrub the past – Running out of space? – Retention windows for SOX-compliance • Change past state representation – Deduplication – Compression