SNAP, BITE, and Filter: A Long

Transcript SNAP, BITE, and Filter: A Long

A chicken in every pot:
a persistent snapshot memory
scaled in time
Liuba Shrira and Hao Xu
Brandeis University
Storage systems: the 7 year itch
1984: rotational delay – FFS
1991: large memory - LFS
1998: cheaper disk - Elephant
2005: .. a chicken in every pot :
snapshot box on the side..
Trends
Hardware: Disk
Cheap (1$/GB) and cheaper
Software Industry: Forbes (12/2004) says:
need for keeping past state is growing
Trends cont.
- A casino chases a card counter
- IT dept. chased by Sarbanes Oxley
- Hippocratic DB audited about patient
privacy preservation
Need to analyze past activity
SNAP: a snapshot system
for an object storage system
Goal:
Storage system capability for
back-in-time execution (BITE):
application runs against
read-only snapshots
without synchronization
analysis in retrospect
Baseline Requirements
for BITE
Consistent snapshots: same (old) invariants hold
BITE of general code: after-the-fact ad-hoc analysis
( vs predefined SQL access methods)
App chooses the snapshot: snapshot state meaningful
to app (vs “some time in the past” )
High time “resolution”: fine-grained past analysis (vs
backup for recovery)
Over long time-scales..
Living with the past: how close?
today: too close (Temporal DB, CVFS)
or too far (warehouse - Netezza)
Snapshots can be of long-term importance, or transient
today: uniform - apps can not discriminate
Inherent tension:
latency of access vs
cost of representation (space and time)
today: limited adaptation - compress or not
Capturing past states
Two ways:
Cheep - no-overwrite update
past stays put, copy new :
less to write, but
bloated DB, past inherits same rep
Opportunistic- in-place update
past is copied-out, separated:
more to write but can write smartly, can
tailor past rep, and DB stays clustered (vigor)
Our requirements:
Non-disruptive past: just right distance - separated
At adaptive distance:
e.g. faster BITE on more recent states
Discriminated past:
application classifies, snapshot system filters:
Some snapshots outlive others,
some can be accessed faster
Flexible classification: e.g. after the fact
Snapshot system operations
Request to take a snapshot (declaration):
sid: snapshot_request (filter_spec)
Request to access a snapshot v:
snapshot_access (sid)
Request to specify a filter for a snapshot v:
lazy_filter (sid,filter_spec)
T1, T2, S1, T3, T4, T5, S2,…
Baseline storage system
General interface:
pages and a page table
transactions access objects on pages
Server:
DB disk: slotted pages of objects
physical oid (page#,o#)
and a page table
Transaction Log
Cache: pages and modifed object cache
Storage system, cont.
optimistic CC+ARIES
Clients
fetch pages, run transactions
send modifed objects to server
Server
validates, commits (WAL)
caches committed modifications
no-force, no-STEAL
The snapshot system
Archive separated from DB:
Archive i/o sequential, DB random
Copy-on-write (COW):
copy out snapshot states into archive
just before updating DB
during cleaning.
Snapshot interface
Same as DB Snapshot Pages
Snapshot Page Table
So BITE is transparent:
BITE on snapshot S(v) uses PageTable(v)
Snapshot system:
below the interface:
Some S(v) pages are in the archive,
some in DB
and pages in the archive can have
a different representations
BITE (v): namespace redirection
Creating non-disruptive snapshots:
(i/o bound system)
Archiving snapshot states when cleaning
can slow down cleaning
compared to a system without snapshots.
Copying to the archive disk (sequential I/O)
in parallel
to database I/O (random)
can partially hide archiving cost
behind database I/O.
Creating snapshots: how well can
you hide?
Is determined by:
how much is archived:
compactness of snapshot representation,
frequency, snapshot
update workload (overwriting)
cost of archiving,
sequential, other archive traffic – BITE
Creating snapshots: some issues
Issue:
avoid overwriting snapshot states
(without blocking, pinning etc)
Issue:
update snapshot meta data efficiently
(large, dynamic page tables )
Issue:
filter out long-lived snaps (focus here)
New techniques
for copy-out snapshots:
- VMOB: in-memory versioned data
structure preserves snapshot states w/out
blocking
- LPT: incrementally archived page table
with logarithmic reconstruction cost
- Filtering: exploit smart representation for
past states (focus here)
Filtering: motivation
Want unlimited past at high resolution
but
some snapshots are transient
others of long-term interest to application
application needs to discriminate
between snapshots
Thresher: a filtering system for
SNAP
Snapshot representation
What can representation do for filtering?
life-time based allocation –
avoids fragmentation
diff-based encoding –
reduces cost of copying
adaptive combination real winner
Example: hierarchical snapshots at
multiple time granularity
ICU patient monitoring DB takes snapshots::
minute by minute vital sign monitor readings
hourly includes nurse’s writeup summarizing
monitor readings
daily includes doctor’s notes summarizing
nurse’s checkups
Doctor’s have longer life-time than nurse’s
…
Brief overview: snapshot creation
Some notation:
Snapshot span
Recorded pages
example:
.. v4, T: w (x_P), T’: w (y_S), v5, T’’..
Span of v4 : T, T’
Pages recorded by snapshot v4: P, S
Incremental snapshot creation:
Archived snapshot pages: dispersed:
v4 P S
v5 P Q
…-|-----------------------|------------------------
Archived snapshot page tables (PT):
PT(v4): addr (P4), addr(S4); PT(v5): addr(P5), addr(Q5)..
…-|-----------------------|-------------------------
Another talk: how to construct archived page tables:
:Construct APT (v4) = recorded (v4) + Construct APT (v5)
Filtering example:
filter out short-lived v5
Doctor’s
Nurse’s
v4 P S
v5 P Q
v6
…-|-----------------------|-----------------------|- Archive
Filter: long-lived v4, reclaim v5:
reclaim
retain
P5
Q5
(v4 needs it)
filtering incremental snapshots creates fragmentation
Problem: fragmentation
• fragmented archive, over time:
non sequential archive writes
or
random reads to copy out long lived states
Our approach: filter-spec
Filter spec determines
relative snapshot lifetime
“App knows best”:
the app supplies a filter spec
the system filters
avoid fragmentation with filter-spec
Known at snapshot declaration –
use lifetime-based allocation
After the fact use a flexible rep to filter lazily
rep allows adaptive trade-off:
cost of filtering vs cost of BITE
App specifies filter at declaration
P4 S4 Q5
long-lived pages
…-|-----------------------------------------------
P5
short-lived
…-|-----------------------------------------------
Invariant : to reclaim w/out fragmentation,
short-lived areas store no long-lived pages
FilterTree: filter pages for free
After-the-fact (lazy) filtering
Some applications want
to defer filter specification
Lazy filtering requires copying
We can specialize representation (compact)
to reduce copying cost
Compact representation: diffs
Two components filtered separately:
compact diffs – reduce cost of copying
(diffs clustered by page)
checkpoints – accelerate BITE
(page-based snapshots
system-declared, can use FilterTree)
Adaptive trade-off
Like recovery log:
less frequent checkpoints
increase compactness
more frequent checkpoints
accelerate BITE
Lazy filtering:
checkpoints filtered for free
FilterTree for checkpoints
Archive regions for diff extents
E
…
G2(diffs)
… B1
G1(diffs)
B1
B2
B3
E1
E2
E3
But some applications want more:
lazy filtering
and
faster BITE
e.g.
- app runs BITE on batch of recent snapshots
to decide which ones to retain needs fast BITE to keep up..
Combined hybrid
Faster BITE in recent window
and
Lazy filtering
Hybrid: checkpoints and checkpoint
filtered for free
Status
Implemented:
SNAP and Thresher for Thor storage system
Performance results –
encouraging.
here is a 5000 feet view:
Performance metrics
Cost of filtering:
non-disruptiveness = rate-of-drain/ rate-of-pour
t_clean determins rate-of-drain
workload parameter: overwriting
Compactness of diff-based rep:
retention relative to page-based rep
R_diff - fixed
R_ckp - tunable by frequency of checkpoints
workload parameter: density
BITE - page-based snapshots, vs diff-based vs DB
Non-disruptiveness
Storage system w/hybrid snapshots vs
w/out snapshots (Thor)
How much drop in
rate-of-drain / rate-of-pour
Experimental configuration
Workoads:
extend multiuser 007 to control
density
overwriting
System configuration:
single client, medium 007 – small DB 185MB
multiple clients – large DB 140GB
FIlterTree
Free!
Non-disruptiveness/ single client
“summertime …life is easy”
Non-disruptiveness/multi user:
“DB works harder”
Summary:
non-disruptive snapshot memory
Unlimited filtered past
is cheaper than you may think.
.. A chicken in every pot..
Every storage system
can have a snapshot box on the side..
To get there:
Generalize:
ARIES/ STEAL / underway
file systems / need extended interfaces
Beyond:
upgrades/ have techniques
provenance / need ideas..

SNAP, BITE, and Filter: A Long

Transcript SNAP, BITE, and Filter: A Long

Directory