Crash-Only Software
Download
Report
Transcript Crash-Only Software
Why Recovery Should Be Free,
And Often Can Be
Armando Fox, Stanford University
June 2003 ROC Retreat
Recovery Should Be Free, and Can Be
Already espouse arguments about lowering MTTR:
Mitigates impact on service as a whole [Fox & Patterson, 2002]
Results in higher end-user-perceived availability, given same
overall availability [Xie et al. 2002]
etc
Tim Chou, Oracle: maybe more important to make recovery
predictable (so can plan provisioning, anticipate impact of
outage, etc.)...if we understand it, we can optimize its speed
© 2003 Armando Fox
Real win: Recovery management is hard
Determining when to recover is hard
How to detect that something’s wrong?
How do you know when recovery is really necessary? (fail-stutter, etc.)
Will recovery make things worse? (cascading recovery)
Knowing what happens when you recover is hard
Will a particular recovery technique work? (the machinery needed to
perform the recovery may also be broken)
What is the effect on online performance? (recovery can be expensive)
What if you needlessly “over-recover”? (cost of making a mistake is
high)
If recovery were predictable and fast, it would simplify both failure
detection and recovery management.
© 2003 Armando Fox
Simplifying Recovery Management: Crash-Only Software
Goal: enforce simple invariants on recovery behavior, from outside
the component(s) being recovered
Crash-only component provides PWR switch: stop = crash:
clean shutdown = loss of power = kernel panic = ...
One way to go down one way to come up: start = recover
Power switch is external uniform behavior
kill -9, “turning off” (process kill) a VM, pull power cord
Intuition: the “infrastructure” supporting the power switch is usually
simpler than the applications using it, and common across all those
applications
Can crash-only software actually be built, and if so, how?
(a) provide building blocks
(b) formalize C/O definition and provide developer
© 2003 Armando Fox
Crash-only Building Blocks
JAGR/ROC-2, a self-recovering J2EE app server [Candea et al.,
WIAPP 2003]
Micro-reboots used for recovery, application-generic failure-path
inference used for determining recovery strategy
Significantly improves performability relative to whole-app redeploy
SSM: a CO session state manager [Ling, Fox, AMS 2003]
DStore: a CO persistent single-key state manager [Huang, Fox,
submitted to SRDS 2003]
Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]
Common features of both SSM and DStore:
Redundancy used for persistence
Workload semantics exploited to simplify consistency model & recovery
Recovery=restart, safe to reboot any node at any time
Safe to coerce any failure to a crash (fail-stop) at any time
© 2003 Armando Fox
Building blocks, cont.
Pinpoint, statistical-anomaly-based failure detection
Standard tension: accuracy vs. precision (false positives problem)
Different clustering techniques seem to be good at detecting
different kinds of problems
Surprising result from a CS241 project: character-frequency histograms
are a good app-generic way to detect end-user-visible failures
Mostly integrated with JAGR and SSM
On burner: discussions with BEA Systems for integrating into WebLogic
Server
Insight: if cost of “over-recovering” is low, aggressive statisticsbased failure detection becomes more appealing
© 2003 Armando Fox
Toward a crash-only formalism
Component frameworks force you into certain app-writing patterns
Inter-EJB calls through runtime-managed level of indirection
Restrictions on how persistent state mgt can be expressed
Restrictions on state sharing: difficult to do without using explicit
external store
Hypothesis: these are the elements that allow C/O to work
Ongoing work: formalize crash-only SW
One possibility: observational equivalence with respect to a request
stream
Can be expressed using a design pattern or denotational semantics
Ideally, will lead to a tool (“co-lint”) telling you whether your component
is crash-only
© 2003 Armando Fox
Summary: Toward a Crash-only World
Goal: simplify recovery management
diagnosis: statistical methods even more appealing if the cost of
making a mistake is low
recovery: crash-only enforces invariants about what happens when
recovery is attempted
allows aggressive use of fault model enforcement [Martin et al 2002]
Good progress on providing building blocks for app writers
JAGR: J2EE app server that allows fast recovery via micro-reboots and
application-generic fault injection
SSM: a crash-only session state store (in process of integrating with
JAGR)
DStore: a crash-only persistent single-key store
PinPoint: statistics-based failure detection (integrated with JAGR, mostly
integrated with SSM)
© 2003 Armando Fox
Xie et al: MTTR and End-User Availability
Let AU=user-perceived unavailability, AS=system unavailability
Hypothesis: if users retry failed requests, and retry succeeds
because system had fast recovery, they will perceive higher
availability
When retry rate is sufficiently frequent, AU approaches AS (for AS
=99.3%, this threshold is 200-300 sec)
Method: model user retry behavior and system failure/recovery
using Markov models; solve using numerical methods
Finding: Given 2 systems with same AS, the one with shorter MTTR
(even though it also has lower MTTF) appears better to the user.
Goal of this project: validate that result empirically (Jeff Raymakers,
Yee-Jiun Song, Wendy Tobagus)
© 2003 Armando Fox
User perceived unavailability vs retry rate
“sweet spot”
Higher user retry rates
yields little improvement
in perceived availability.
© 2003 Armando Fox
Surprise! MTTF eventually catches up with you
“sweet spot”
Variable MTTR, but fixed system
availability (low MTTR -> low MTTF)
At low MTTR, lowering
MTTR and MTTF at the
same time results in
worse user perceived
unavailability!
© 2003 Armando Fox
Optimization Choices
User Perceived
Unavailability
Fixed MTTF
Fixed MTTR
System
Unavailability
© 2003 Armando Fox
Results Summary
We can find a “sweet spot” (for a given system
availability) beyond which higher user retry rates yield
little benefit.
For two systems of a given availability, the one with
lower MTTR does not always yield better user perceived
availability.
For a given system, we can determine whether improving
MTTR or MTTF will yield more user-visible benefits.
© 2003 Armando Fox
“Clean” shutdown vs. restart?
Impractical to guarantee zero crashes robust systems
must be crash-safe anyway
In that case, why support any other kind of shutdown?
Historically, for performance (avoid synchronous writes, do
buffering/caching, etc) - leads to replicated/mirrored state, more
code, special recovery code paths...
Total recovery time
may be shorter even
if crash is forced
WinXP can be
(mostly) crashrebooted for
upgrades
VMS sysadmins
would sometimes
crash the system
rather than shut it
down (if no users
were logged on)
Crash-only software must:
(a) be crash-safe & (b) recover quickly
© 2003 Armando Fox
Why Crash-Only Simplifies Recovery
“Hardware works, software doesn’t”
Hardware interlocks, timers, etc. have small state spaces of
behavior, hence high confidence they will work as designed
Crash-only PWR switch is a way to approach that same property
for software
Crash-only makes recovery policies easier to reason
about
Opportunity to aggressively apply SW rejuvenation
“Recovery” code exercised on every restart; no exotic-but-rarelyused code paths
“Over-recovery” may be OK from performability standpoint: if
recovery is free (performance & correctness), you stop thinking
about it as recovery and start thinking about it as normal aspect
of operation
© 2003 Armando Fox
Towards a Crash-Only World
Existing software that is crash-only or near-crash-only
Stateless apps: most Web servers
Most RDBMS’s: crash-safe, but long recovery
Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main
codepath
Some appliance storage devices: separate but pretty fast recovery path
Our goals...
Focus on Internet (“3 tier”) applications; already “crash-mostly” except
for persistence tier(s)
Make the app server, middle-tier persistence, and back-end tier (to the
extent possible) truly crash-only
Deploy application-generic failure detection techniques (which may overrecover, but the goal is to make that OK)
Quantify improvement (we hope!) in performability resulting from these
changes
By doing it in the middleware, any app on that middleware can benefit
© 2003 Armando Fox