Transcript Document

Split Snapshots and Skippy Indexing:
Long Live the Past!
Ross Shaull <[email protected]>
Liuba Shrira <[email protected]>
Brandeis University
Our Idea of a Snapshot
• A window to the past in a storage system
• Access data as it was at time snapshot was
requested
• System-wide
• Snapshots may be kept forever
– I.e., “long-lived” snapshots
• Snapshots are consistent
– Whatever that means…
• High frequency (up to CDP)
Why Take Snapshots?
• Fix operator errors
• Auditing
– When did Bob’s salary change, and who made the
changes?
• Analysis
– How much capital was tied up in blue shirts at the
beginning of this fiscal year?
• We don’t necessarily know now what will be
interesting in the future
BITE
• Give the storage system a new
capability: Back-in-Time Execution
• Run read-only code against current
state and any snapshot
• After issuing request for BITE, no
special code required for accessing
data in the snapshot
Other Approaches: Databases
• ImmortalDB, Time-Split BTree (Lomet)
– Reorganizes current state
– Complex
• Snapshot isolation (PostgreSQL, Oracle)
– Extension to transactions
– Only for recent past
• Oracle FlashBack
– Page-level copy of recent past (not forever)
– Interface seems similar to BITE
Other Approaches: FS
• WAFL (Hitz), ext3cow (Peterson)
– Limited on-disk locality
– Application-level consistency a challenge
• VSS (Sankaran)
– Blocks disk requests
– Suitable for backup-type frequency
A Different Approach
• Goals:
–
–
–
–
Avoid declustering current state
Don’t change how current state is accessed
Application requests snapshot
Snapshots are “on-line” (not in warehouse)
• Split Snapshots
– Copy past out incrementally
– Snapshots available through virtualized buffer
manager
Our Storage System Model
• A “database”
– Has transactions
– Has recovery log
– Organizes data in pages on disk
Our Consistency Model
• Crash consistency
– Imagine that a snapshot is declared, but
then before any modifications can be
made, the system crashes
– After restart, recovery kicks in and the
current state is restored to *some*
consistent point
– All snapshots will have this same
consistency guarantee after a crash
Our Storage System Model
Page Table
P1  Address X
P2  Address Y
…
Cache
ISnapshot
want record
Now
R
Find Table
P1
Find Root
Search for R
Disk
P3
P1 …
Access
Methods
Pn
…
Database
Return R
Application
Retaining the Past
Versus
Copy-on-Write (COW)
Snapshot
PagePage
Table
Table
“S”
The old page
table became
the Snapshot
page table
P1 P2 P1
P1
P2
Page
Table
Operations:
P1
Snapshot “S”
P2
Modify P1
Expensive to
update P2 in both
page tables
Split-COW
Page
Table
P1 P2
P1
P2
SPT(S)
SPT(S+1)
P1
P1
P2
P2
P1 P1 P2
What’s next
1. How to manage the metadata?
2. How will snapshot pages be accessed?
3. Can we be non-disruptive?
Metadata Solution
• Metadata (page tables) created
incrementally
• Keeping many SPTs costly
• Instead, write “mappings” into log
• Materialize SPT on-demand
Snap 6
P2 P1 P1 P2
Snap 5
Snap 4
Snap 3
Start
P1 P1 P2 P1
Snap 2
Maplog
Snap 1
Maplog
• Mappings created incrementally
• Added to append-only log
• Start points to first mapping created
after a snapshot is declared
P1 P3
Snap 6
P2 P1 P1 P2
Snap 5
Snap 4
Snap 3
Start
P1 P1 P2 P1
Snap 2
Maplog
Snap 1
Maplog
• Materialize SPT with scan
• Scan for SPT(S) begins at Start(S)
• Notice that we read some mappings
that we do not need
P1 P3
Cost of Scanning Maplog
•
Let overwrite cycle length L be the number of page updates
required to overwrite entire database
•
Maplog scan cannot be longer than overwrite cycle
•
Let N be the number of pages in the database
•
For a uniformly random workload, L  N ln N (by the “coupon
collector’s waiting time” problem)
•
Skew in the update workload lengthens overwrite cycle
•
Skew of 80/20 (80% of updates to 20% of pages) increases L
by a factor of 4
Skew hurts
Skippy
Skippy Level 1
• Copy first-encountered mapping
(FEM) within node to next level
P1 P2 P2 P1
P1 P3
P1 P1 P2 P1
P2 P1 P1 P2
Copies
Maplog
Snap 6
Snap 5
Snap 4
Snap 3
Snap 2
Start
Snap 1
Pointers
P1 P3
Cut redundant
mapping count in
half
Skippy
Start
Snap 6
P2 P1 P1 P2
Snap 5
P1 P1 P2 P1
Snap 4
Maplog
Snap 3
P1 P3
Snap 2
P1 P2 P2 P1
Snap 1
Skippy Level 1
P1 P3
K-Level Skippy
• Can eliminate effect of skew — or more
• Enables ad-hoc, on-line access to snapshots,
whether they are old or young
Skew
# Skippy Levels
Time to Materialize SPT (s)
50/50
0
13.8
80/20
0
19.0
1
15.8
2
14.7
3
13.9
0
33.3
1
6.69
99/1
Accessing Snapshots
• Transparent to layers above cache
• Indirection layer to redirect page requests
from a BITE transaction into the snapstore
Cache
Read
Current
BITE
State
P1
P1 P2
P2
P2
P1 P1 P2
Non-Disruptiveness
• Can we create Skippy and COW prestates without disrupting the current
state?
• Key idea:
– Leverage recovery to defer all snapshotrelated writes
– Write snapshot data in background to
secondary disk
Implementation
• BDB 4.6.21
• Page cache augmented
– COWs write-locked pages
– Trickle COW’d pages out over time
• Leverage recovery
– Metadata created in-memory at transaction
commit time, but only written at checkpoint time
– After crash, snapshot pages and metadata can be
recovered in one log pass
• Costs
– Snapshot log record
– Extra memory
– Longer checkpoints
Early Disruptiveness Results
800
700
656
N o Snaps hots
Snaps hots E very O ther T rans ac tion
Snaps hots E very T rans ac tion
674
631
600
575
593
613
508 511
500
Time (s)
• Single-threaded
updating workload of
100,000 transactions
• 66M database
• We can retain a
snapshot after every
transaction for a 6–8%
penalty to writers
• Tests with readers show
little impact on
sequential scans (not
depicted)
472
400
300
200
100
0
5 0 /5 0
8 0 /2 0
Skew
9 9 /1
Paper Trail
• Upcoming poster and short paper at
ICDE08
• “Skippy: a New Snapshot Indexing
Method for Time Travel in the Storage
Manager” to appear in SIGMOD08
• Poster and workshop talks
– NEDBDay08, SYSTOR08
Questions?
Backups…
Recovery Sketch 1
• Snapshots are crash consistent
• Must recover data and metadata for all
snapshots since last checkpoint
• Pages might have been trickled, so must
truncate snapstore back to last mapping
before previous checkpoint
• We require only that a snapshot log record be
forced into the log with a group commit, no
other data/metadata must be logged until
checkpoint.
Recovery Sketch 2
• Walk backward through WAL, applying
UNDOs
• When snapshot record is encountered, copy
the “dirty” pages and create a mapping
• Trouble is that snapshots can be concurrent
with transactions
• Cope with this by “COWing” a page when an
UNDO for a different transaction is applied to
that page
The Future
• Sometimes we want to scrub the past
– Running out of space?
– Retention windows for SOX-compliance
• Change past state representation
– Deduplication
– Compression