Toward Recovery-Oriented Computing
Download
Report
Transcript Toward Recovery-Oriented Computing
A Recovery-Friendly, Self-Managing
Session State Store
Benjamin Ling and Armando Fox
{bling,fox}@cs.stanford.edu
Outline
Motivation: What is Session State?
Existing solutions
SSM: Architecture and Algorithm
SSM: Recovery-friendly
SSM: Self-Managing
Related and Future Work
Conclusion
© 2003 Benjamin Ling
Example of Session State
© 2003 Benjamin Ling
Session State and Existing Solutions
We focus on a subcategory of session state
Single-user, serial access, semi-persistent data
Examples: Temporary application data, application
workflow
Example of usage (e.g. J2EE):
2
1
App Server
Browser
6
3
4
5
© 2003 Benjamin Ling
Existing solutions :
File System and Databases
Poor failure behavior
Lose data (FS)
Slow recovery (Both)
Difficult to administer (DB)
Difficult to tune (both)
In-memory replication using primary/secondary:
Performance coupling
Poor failover (uneven load balancing)
© 2003 Benjamin Ling
Goal
Build a session state store that is:
Failure-friendly
Does
not lose data on crash
Degrades gracefully
Recovery-friendly
Recovers
fast
Self-Managing
High performance
Avoids
performance coupling
© 2003 Benjamin Ling
Session State Manager (SSM)
AppServer
Redundant, in-memory
hash table distributed
across nodes
RAM, Network Interface
S
T
U
B
Brick 1
Brick 2
Brick 3
AppServer
S
T
U
B
Brick 4
Brick 5
Algorithm: Redundancy similar to
quorums
• Write to many random nodes, wait for
few
(avoid performance coupling)
• Read one
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
Browser
AppServer
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
Browser
AppServer
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
Browser
AppServer
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
AppServer
Browser
1
4
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Algorithm Properties
Client remembers metadata
Fate sharing
Stubs are stateless
Negative feedback loop
© 2003 Benjamin Ling
SSM: Recovery-Friendly
Failure
No data is lost, WQ-1 copies of the data remain
State is available for R/W during failure
Recovery
Start a new brick – don’t need to recover anything
No special case recovery code (restart=recovery)
State is available for R/W during brick restart
Repair phase does not reduce throughput/performance
Session state is self-recovering
User’s access pattern will cause data to be rewritten
© 2003 Benjamin Ling
SSM: Self-Managing
Adaptive:
Stub maintains count of maximum allowable in-flight requests to
each brick
Additive increase on successful request
Multiplicative decrease on timeout
Stubs discover load capacity of each brick
Self-Tuning
Admission control
Stubs say “no” if insufficient bricks
Propagate backpressure from bricks to clients
Turn users away under overload
Self-Protecting
© 2003 Benjamin Ling
Self-Tuning and Self-Protecting
Without Add Inc/Mult Dec adapatation…
OVERLOAD
NORMAL LOAD
5000
#req/s
3000
2000
1000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
2
3
4
5
tim e in S
6
7
8
9
10 11 12 13 14
time in s
Overload with AI/MD adaptation
Throughput 250 senders (windowing)
# req/s
# req / s
4000
4000
3500
3000
2500
2000
1500
1000
500
0
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14
time in s
© 2003 Benjamin Ling
Other implementation details
Garbage collection
Generational hash table
Hash
table of hash tables
Each hash table has an associated time range
When time has passed, GC that table
No reference counting, scanning, etc.
© 2003 Benjamin Ling
Is it cheap? Is it fast? Is it easy to use?
How much does replication cost?
With 10 bricks, 1G memory, state size 8k, replication
factor of 3
Serve around 416,000 concurrent users
Configurable request timeout – currently 60 ms
Dwarfed by computation time and client RT time
Easy to add a brick, kill a brick
System continues running
© 2003 Benjamin Ling
Publications
The Case for a Session State Storage Layer
Ben Ling, Armando Fox
9th Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue,
HI, May 2003
A Self-Managing Session State Layer
Ben Ling, Armando Fox
Accepted to the 5th Annual Workshop On Active Middleware Services
(AMS 2003), Seattle, WA, June 2003
http://swig.stanford.edu/public/publications
© 2003 Benjamin Ling
Related Work
Palimpsest – Timothy Roscoe, Intel
Temporal storage
Erasure coding
No guarantees, just estimates
DeStor – Andy Huang, Stanford
Persistent, multi-user, non-transactional data
FAB – HP Labs
Enterprise disk storage
Redundancy at disk block level
© 2003 Benjamin Ling
Future Work
Do fault analysis and model failure
Memory and network failure modes
Performance faults?
How to choose replication factor?
10 bricks, WQ of 3, inter-request rate of 5 minutes ->
“5 nines” of availability if MTTF of bricks > 22 minutes
Adaptively change replication factor?
© 2003 Benjamin Ling
SSM: Relaxing ACID
A – we guarantee
C – guaranteed by workload (full rewrite of state)
I – guaranteed by workload (single user, serial-access)
D – relaxed (ephemeral guarantee, RAM enough)
Fast, simple, clean recovery
No data loss on failure
Data can be R/W during failure/recovery
Self-Managing
© 2003 Benjamin Ling
Summary
We have built a system for:
Semi-persistent storage for single-user, serial-access data
Recovery friendly:
Crash Only – Crash-safe, fast recovery
No special case recovery code
Reboot any individual node
Continuous data availability
Self-Managing:
Self-Tuning and Protecting
Simple management and fault enforcement model
Benjamin Ling
[email protected]
http://swig.stanford.edu/
© 2003 Benjamin Ling
SSM: Recovery-Friendly, Self-Managing Store
Questions or Comments?
Benjamin Ling
[email protected]
http://swig.stanford.edu/
© 2003 Benjamin Ling