Crash-Only Software

Download Report

Transcript Crash-Only Software

Why Recovery Should Be Free,
And Often Can Be
Armando Fox, Stanford University
June 2003 ROC Retreat
Recovery Should Be Free, and Can Be

Already espouse arguments about lowering MTTR:

Mitigates impact on service as a whole [Fox & Patterson, 2002]

Results in higher end-user-perceived availability, given same
overall availability [Xie et al. 2002]

etc

Tim Chou, Oracle: maybe more important to make recovery
predictable (so can plan provisioning, anticipate impact of
outage, etc.)...if we understand it, we can optimize its speed
© 2003 Armando Fox
Real win: Recovery management is hard



Determining when to recover is hard

How to detect that something’s wrong?

How do you know when recovery is really necessary? (fail-stutter, etc.)

Will recovery make things worse? (cascading recovery)
Knowing what happens when you recover is hard

Will a particular recovery technique work? (the machinery needed to
perform the recovery may also be broken)

What is the effect on online performance? (recovery can be expensive)

What if you needlessly “over-recover”? (cost of making a mistake is
high)
If recovery were predictable and fast, it would simplify both failure
detection and recovery management.
© 2003 Armando Fox
Simplifying Recovery Management: Crash-Only Software

Goal: enforce simple invariants on recovery behavior, from outside
the component(s) being recovered

Crash-only component provides PWR switch: stop = crash:




clean shutdown = loss of power = kernel panic = ...
One way to go down  one way to come up: start = recover
Power switch is external  uniform behavior

kill -9, “turning off” (process kill) a VM, pull power cord

Intuition: the “infrastructure” supporting the power switch is usually
simpler than the applications using it, and common across all those
applications
Can crash-only software actually be built, and if so, how?

(a) provide building blocks

(b) formalize C/O definition and provide developer
© 2003 Armando Fox
Crash-only Building Blocks

JAGR/ROC-2, a self-recovering J2EE app server [Candea et al.,
WIAPP 2003]

Micro-reboots used for recovery, application-generic failure-path
inference used for determining recovery strategy

Significantly improves performability relative to whole-app redeploy

SSM: a CO session state manager [Ling, Fox, AMS 2003]

DStore: a CO persistent single-key state manager [Huang, Fox,
submitted to SRDS 2003]


Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]
Common features of both SSM and DStore:

Redundancy used for persistence

Workload semantics exploited to simplify consistency model & recovery

Recovery=restart, safe to reboot any node at any time

Safe to coerce any failure to a crash (fail-stop) at any time
© 2003 Armando Fox
Building blocks, cont.

Pinpoint, statistical-anomaly-based failure detection



Standard tension: accuracy vs. precision (false positives problem)
Different clustering techniques seem to be good at detecting
different kinds of problems

Surprising result from a CS241 project: character-frequency histograms
are a good app-generic way to detect end-user-visible failures

Mostly integrated with JAGR and SSM

On burner: discussions with BEA Systems for integrating into WebLogic
Server
Insight: if cost of “over-recovering” is low, aggressive statisticsbased failure detection becomes more appealing
© 2003 Armando Fox
Toward a crash-only formalism


Component frameworks force you into certain app-writing patterns

Inter-EJB calls through runtime-managed level of indirection

Restrictions on how persistent state mgt can be expressed

Restrictions on state sharing: difficult to do without using explicit
external store

Hypothesis: these are the elements that allow C/O to work
Ongoing work: formalize crash-only SW

One possibility: observational equivalence with respect to a request
stream

Can be expressed using a design pattern or denotational semantics

Ideally, will lead to a tool (“co-lint”) telling you whether your component
is crash-only
© 2003 Armando Fox
Summary: Toward a Crash-only World

Goal: simplify recovery management

diagnosis: statistical methods even more appealing if the cost of
making a mistake is low

recovery: crash-only enforces invariants about what happens when
recovery is attempted


allows aggressive use of fault model enforcement [Martin et al 2002]
Good progress on providing building blocks for app writers

JAGR: J2EE app server that allows fast recovery via micro-reboots and
application-generic fault injection

SSM: a crash-only session state store (in process of integrating with
JAGR)

DStore: a crash-only persistent single-key store

PinPoint: statistics-based failure detection (integrated with JAGR, mostly
integrated with SSM)
© 2003 Armando Fox
Xie et al: MTTR and End-User Availability
Let AU=user-perceived unavailability, AS=system unavailability

Hypothesis: if users retry failed requests, and retry succeeds
because system had fast recovery, they will perceive higher
availability

When retry rate is sufficiently frequent, AU approaches AS (for AS
=99.3%, this threshold is 200-300 sec)

Method: model user retry behavior and system failure/recovery
using Markov models; solve using numerical methods

Finding: Given 2 systems with same AS, the one with shorter MTTR
(even though it also has lower MTTF) appears better to the user.

Goal of this project: validate that result empirically (Jeff Raymakers,
Yee-Jiun Song, Wendy Tobagus)
© 2003 Armando Fox
User perceived unavailability vs retry rate
“sweet spot”
Higher user retry rates
yields little improvement
in perceived availability.
© 2003 Armando Fox
Surprise! MTTF eventually catches up with you
“sweet spot”
Variable MTTR, but fixed system
availability (low MTTR -> low MTTF)
At low MTTR, lowering
MTTR and MTTF at the
same time results in
worse user perceived
unavailability!
© 2003 Armando Fox
Optimization Choices
User Perceived
Unavailability
Fixed MTTF
Fixed MTTR
System
Unavailability
© 2003 Armando Fox
Results Summary

We can find a “sweet spot” (for a given system
availability) beyond which higher user retry rates yield
little benefit.

For two systems of a given availability, the one with
lower MTTR does not always yield better user perceived
availability.

For a given system, we can determine whether improving
MTTR or MTTF will yield more user-visible benefits.
© 2003 Armando Fox
“Clean” shutdown vs. restart?

Impractical to guarantee zero crashes  robust systems
must be crash-safe anyway



In that case, why support any other kind of shutdown?
Historically, for performance (avoid synchronous writes, do
buffering/caching, etc) - leads to replicated/mirrored state, more
code, special recovery code paths...
Total recovery time
may be shorter even
if crash is forced

WinXP can be
(mostly) crashrebooted for
upgrades

VMS sysadmins
would sometimes
crash the system
rather than shut it
down (if no users
were logged on)
Crash-only software must:
(a) be crash-safe & (b) recover quickly
© 2003 Armando Fox
Why Crash-Only Simplifies Recovery


“Hardware works, software doesn’t”

Hardware interlocks, timers, etc. have small state spaces of
behavior, hence high confidence they will work as designed

Crash-only PWR switch is a way to approach that same property
for software
Crash-only makes recovery policies easier to reason
about

Opportunity to aggressively apply SW rejuvenation

“Recovery” code exercised on every restart; no exotic-but-rarelyused code paths

“Over-recovery” may be OK from performability standpoint: if
recovery is free (performance & correctness), you stop thinking
about it as recovery and start thinking about it as normal aspect
of operation
© 2003 Armando Fox
Towards a Crash-Only World


Existing software that is crash-only or near-crash-only

Stateless apps: most Web servers

Most RDBMS’s: crash-safe, but long recovery

Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main
codepath

Some appliance storage devices: separate but pretty fast recovery path
Our goals...

Focus on Internet (“3 tier”) applications; already “crash-mostly” except
for persistence tier(s)

Make the app server, middle-tier persistence, and back-end tier (to the
extent possible) truly crash-only

Deploy application-generic failure detection techniques (which may overrecover, but the goal is to make that OK)

Quantify improvement (we hope!) in performability resulting from these
changes

By doing it in the middleware, any app on that middleware can benefit
© 2003 Armando Fox