Toward Recovery-Oriented Computing

Transcript Toward Recovery-Oriented Computing

A Recovery-Friendly, Self-Managing
Session State Store
Benjamin Ling and Armando Fox
{bling,fox}@cs.stanford.edu
Outline

Motivation: What is Session State?

Existing solutions

SSM: Architecture and Algorithm

SSM: Recovery-friendly

SSM: Self-Managing

Related and Future Work

Conclusion
© 2003 Benjamin Ling
Example of Session State
© 2003 Benjamin Ling
Session State and Existing Solutions

We focus on a subcategory of session state

Single-user, serial access, semi-persistent data

Examples: Temporary application data, application
workflow

Example of usage (e.g. J2EE):
2
1
App Server
Browser
6
3
4
5
© 2003 Benjamin Ling
Existing solutions :

File System and Databases

Poor failure behavior


Lose data (FS)

Slow recovery (Both)

Difficult to administer (DB)

Difficult to tune (both)
In-memory replication using primary/secondary:

Performance coupling

Poor failover (uneven load balancing)
© 2003 Benjamin Ling
Goal

Build a session state store that is:

Failure-friendly
 Does
not lose data on crash
 Degrades gracefully

Recovery-friendly
 Recovers
fast

Self-Managing

High performance
 Avoids
performance coupling
© 2003 Benjamin Ling
Session State Manager (SSM)
AppServer
Redundant, in-memory
hash table distributed
across nodes
RAM, Network Interface
S
T
U
B
Brick 1
Brick 2
Brick 3
AppServer
S
T
U
B
Brick 4
Brick 5
Algorithm: Redundancy similar to
quorums
• Write to many random nodes, wait for
few
(avoid performance coupling)
• Read one
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
Browser
AppServer
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
Browser
AppServer
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
Browser
AppServer
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Few”
Try to write to W random bricks, W = 4
Must wait for WQ bricks to reply, WQ = 2
Brick 1
AppServer
Browser
1
4
S
T
U
B
Brick 2
Brick 3
Brick 4
Brick 5
© 2003 Benjamin Ling
Algorithm Properties

Client remembers metadata

Fate sharing

Stubs are stateless

Negative feedback loop
© 2003 Benjamin Ling
SSM: Recovery-Friendly


Failure

No data is lost, WQ-1 copies of the data remain

State is available for R/W during failure
Recovery

Start a new brick – don’t need to recover anything

No special case recovery code (restart=recovery)

State is available for R/W during brick restart


Repair phase does not reduce throughput/performance
Session state is self-recovering

User’s access pattern will cause data to be rewritten
© 2003 Benjamin Ling
SSM: Self-Managing

Adaptive:

Stub maintains count of maximum allowable in-flight requests to
each brick
 Additive increase on successful request
 Multiplicative decrease on timeout

Stubs discover load capacity of each brick
 Self-Tuning

Admission control

Stubs say “no” if insufficient bricks

Propagate backpressure from bricks to clients
 Turn users away under overload
 Self-Protecting
© 2003 Benjamin Ling
Self-Tuning and Self-Protecting
Without Add Inc/Mult Dec adapatation…
OVERLOAD
NORMAL LOAD
5000
#req/s
3000
2000
1000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
2
3
4
5
tim e in S
6
7
8
9
10 11 12 13 14
time in s
Overload with AI/MD adaptation
Throughput 250 senders (windowing)
# req/s
# req / s
4000
4000
3500
3000
2500
2000
1500
1000
500
0
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14
time in s
© 2003 Benjamin Ling
Other implementation details

Garbage collection

Generational hash table
 Hash
table of hash tables
 Each hash table has an associated time range
 When time has passed, GC that table

No reference counting, scanning, etc.
© 2003 Benjamin Ling
Is it cheap? Is it fast? Is it easy to use?


How much does replication cost?

With 10 bricks, 1G memory, state size 8k, replication
factor of 3

Serve around 416,000 concurrent users
Configurable request timeout – currently 60 ms


Dwarfed by computation time and client RT time
Easy to add a brick, kill a brick

System continues running
© 2003 Benjamin Ling
Publications
The Case for a Session State Storage Layer
Ben Ling, Armando Fox
9th Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue,
HI, May 2003
A Self-Managing Session State Layer
Ben Ling, Armando Fox
Accepted to the 5th Annual Workshop On Active Middleware Services
(AMS 2003), Seattle, WA, June 2003
http://swig.stanford.edu/public/publications
© 2003 Benjamin Ling
Related Work


Palimpsest – Timothy Roscoe, Intel

Temporal storage

Erasure coding

No guarantees, just estimates
DeStor – Andy Huang, Stanford


Persistent, multi-user, non-transactional data
FAB – HP Labs

Enterprise disk storage

Redundancy at disk block level
© 2003 Benjamin Ling
Future Work


Do fault analysis and model failure

Memory and network failure modes

Performance faults?
How to choose replication factor?


10 bricks, WQ of 3, inter-request rate of 5 minutes ->
“5 nines” of availability if MTTF of bricks > 22 minutes
Adaptively change replication factor?
© 2003 Benjamin Ling
SSM: Relaxing ACID

A – we guarantee

C – guaranteed by workload (full rewrite of state)

I – guaranteed by workload (single user, serial-access)

D – relaxed (ephemeral guarantee, RAM enough)



Fast, simple, clean recovery

No data loss on failure

Data can be R/W during failure/recovery
Self-Managing
© 2003 Benjamin Ling
Summary

We have built a system for:

Semi-persistent storage for single-user, serial-access data

Recovery friendly:
 Crash Only – Crash-safe, fast recovery
 No special case recovery code
 Reboot any individual node
 Continuous data availability

Self-Managing:
 Self-Tuning and Protecting
 Simple management and fault enforcement model
Benjamin Ling
[email protected]
http://swig.stanford.edu/
© 2003 Benjamin Ling
SSM: Recovery-Friendly, Self-Managing Store
Questions or Comments?
Benjamin Ling
[email protected]
http://swig.stanford.edu/
© 2003 Benjamin Ling

Toward Recovery-Oriented Computing

Transcript Toward Recovery-Oriented Computing

Directory