Recovery Oriented Computing: Overview

Transcript Recovery Oriented Computing: Overview

Toward Recovery-Oriented Computing
Armando Fox, Stanford University
David Patterson, UC Berkeley
and a cast of tens
Outline


Whither recovery-oriented computing?

research/industry agenda of last 15 years

today’s pressing problem: availability (we knew that) - but what
is new/different compared to previous F/T work, databases, etc?
Recovery-Oriented Computing as an approach to
availability

Motivation and philosophy

sampling of research avenues

what ROC is not
© 2002 Armando Fox
Reevaluating goals & assumptions


Goals of last 15 years

Goal #1: Improve performance

Goal #2: Improve performance

Goal #3: Improve cost-performance
Assumptions

Humans are perfect (they don’t make mistakes during
installation, wiring, upgrade, maintenance or repair)

Software will eventually be bug free
(good programmers will write bug-free code, debugging works)

Hardware MTBF is already very large (~100 years between
failures), and will continue to increase
© 2002 Armando Fox
Results of this successful agenda


Good news: faster computers, denser disks, cheaper $

computation faster by >3 orders of magnitude

disk capacity greater by >3 orders of magnitude

Result: TCO dominated by administration, not hardware cost
Bad news: complex, brittle systems that fail frequently

65% of IT managers report that their websites were unavailable
to customers over a 6-month period (25%: 3 or more outages)
[Internet Week, 4/3/2000]


outage costs: negative press, “click overs” to competitor, stock
price, market cap…
Yet availability is key metric for online services!
© 2002 Armando Fox
Direct Downtime Costs (per Hour)
Brokerage operations
Credit card authorization
Ebay (22 hour outage)
Amazon.com
Package shipping services
Home shopping channel
Catalog sales center
Airline reservation center
Cellular service activation
On-line network fees
ATM service fees
$6,450,000
$2,600,000
$225,000
$180,000
$150,000
$113,000
$90,000
$89,000
$41,000
$25,000
$14,000
Sources: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8.
”...based on a survey done by Contingency Planning Research."
© 2002 Armando Fox
So, what are today’s challenges?


We all seem to agree on goals

Dave Patterson, IPTS 2002: ACME “availability, change,
maintenance, evolution”

Jim Gray, HPTS 2001: FAASM “functionality, availability, agility,
scalability, manageability”

Butler Lampson, SOSP 1999: “Always available, evolving while
they run, growing without practical limit”

John Hennessy, FCRC 1999: “Availability, maintainability and ease
of upgrades, scalability”

Fox & Brewer, HotOS 1997: BASE “best-effort service, availability,
soft state, eventual consistency”
We’re all singing the same tune, but what is new?…
© 2002 Armando Fox
What’s New and Different


Evolution and change are integral

not true of many “traditional” five nines systems: long design
cycle, changes incur high overhead for design/spec/testing

Last version of space shuttle software: 1 bug in 420 KLOC, cost
$35M/yr to maintain (good quality commercial SW: 1 bug/KLOC)

But, recent upgrade for GPS support required generating 2,500
pages of specs before changing anything in 6.3 KLOC (1.5%)
Performance still important, but focus changed

Interactive performance and availability to end users is key

Users appear willing to occasionally tolerate temporary
degradation (“service quality”) in exchange for improved
availability

How to capture this tradeoff: soft/stale state, partial performance
degradation, imprecise answers…
© 2002 Armando Fox
ROC Philosophy

ROC philosophy (“Peres’s Law”):
“If a problem has no solution, it may not be a problem, but a fact; not to be
solved, but to be coped with over time”
Shimon Peres


Failures (hardware, software, operator-induced) are a fact;
recovery is how we cope with them over time

Availability = MTTF/MTBF= MTTF / (MTTF + MTTR)
Rather than just making MTTF very large, make MTTR << MTTF
Why?
1.
Human errors will still cause outages => minimize recovery time
2.
Recovery time is directly measurable, and directly captures
impact on users of a specific outage incident (MTTF doesn’t)
3.
Rapid evolution makes exhaustive testing/validation impossible
=> unexpected/transient failures will still occur
© 2002 Armando Fox
1. Human Error Is Inevitable


Human error major factor in downtime…

PSTN: Half of all outage incidents and outage-minutes from
1992-1994 were due to human error (including errors by phone
company maintenance workers)

Oracle: up to half of DB failures due to human error (1999)

Microsoft blamed human error for ~24-hour outage in Jan 2001
Approach:

Learn from psychology of human error and disaster case studies

Build in system support for recovery from human errors

Use tools such as error injection, virtual machine technology to
provide “flight simulator” training for operators
© 2002 Armando Fox
The 3R undo model

Undo == time travel for system operators

Three R’s for recovery

Rewind: roll system state backwards in time

Repair: change system to prevent failure



e.g., edit history, fix latent error, retry unsuccessful operation,
install preventative patch
Replay: roll system state forward, replaying end-user
interactions lost during rewind
All three R’s are critical



rewind enables undo
repair lets user/administrator fix problems
replay preserves updates, propagates fixes forward
© 2002 Armando Fox
Example e-mail scenario

Before undo:

virus-laden message arrives

user copies it into a folder without looking at it

Operator invokes undo (rewind) to install virus filter
(repair)

During replay:

message is redelivered but now discarded by virus filter

copy operation is now unsafe (source message doesn’t exist)

compensating action: insert placeholder for message

now copy command can be executed, making history replayacceptable
© 2002 Armando Fox
First implementation attempt

Undo wrapper for open source IMAP email store
3R Layer
State
Tracker
SMTP
IMAP
3R
Proxy
Undo
Log
Email Server
Includes:
- user state
- mailboxes
- application
- operating system
Non-overwriting
Storage
© 2002 Armando Fox
3. Handling Transient Failures via Restart

Many failures are either (a) transient and fixable through reboot, or (b) non-transient,
but reboot is the lowest-MTTR fix

Recursive Restarts: To minimize MTTR, restarts the minimal set of subsystems that

Partial restarts/reboots
 Return system (mostly) to well-tested, well-understood start state
 High confidence way to reclaim stale/leaked resources
 Unlike true checkpointing, reboot more likely to avoid repeated failure due to
corrupted state
 We focus on proactive restarts; can also be reactive (SW rejuvenation)
 “Easier to run a system 365 times for 1 day than 365 days”

Goals:
 What is the software structure that can best accommodate such failure
management while still preserving all other requirements (functionality,
performance, consistency, etc.)
 Develop methodology for building and managing RR systems (concrete
engineering methods)
 Develop the tools for building, testing, deploying, and managing RR systems
 Design for fast restartability in online-service building blocks
could cure a failure; if that doesn’t help, restart the next-higher containing set, etc.
© 2002 Armando Fox
A Hierarchy of Restartable Units


Siblings highly fault-isolated

low level: by high-confidence, lowlevel, HW-assisted machinery, (eg
MMU, physical isolation)

higher level: by VM-level abstractions
based on the above machinery (eg
JVM, HW VM, process)
R-map (=hierarchy of restartable component groups)
captures restart dependencies

Groups of restart units can be restarted by common parent

Restarting a node restarts everything in its subtree

A failure is minimally curable at a specific node
 Restarts
farther up tree are more expensive, but
higher confidence for curing transients
© 2002 Armando Fox
RR-ifying a satellite ground station


Biggest improvement: MTTF/MTTR-based boundary redrawing

Ability to isolate unstable components without penalizing whole system

Achieve a balanced MTTF/MTTR ratio across components at the same level
Lower MTTR may be strictly better than higher MTTF

unplanned downtime is more expensive than planned downtime, and
downtime under a heavy/critical workload (e.g., satellite pass) is
more expensive than downtime under a light/non-critical workload.

high MTTF doesn’t guarantee failure-free operation interval, but sufficiently
low MTTR may mitigate impact of failure

Current work is applying RR to a ubiquitous computing environment, a
J2EE application server, and an OSGI-based platform for cars  new
lessons will emerge (e.g., r-tree needs to be a r-DAG)

Most of these lessons are not surprising, but RR provides a uniform
framework within which to discuss them
© 2002 Armando Fox
MTTR Captures Outage Costs

Recent software-related outages at Ebay: 4.5 hours in
Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98

Assume two 4-hour (“newsworthy”) outages/year


A=(182*24 hours)/(182*24 + 4 hours) = 99.9%

Dollar cost: Ebay policy for >2 hour outage, fees credited to all
affected users (US$3-5M for Jun99)

Customer loyalty: after Jun99 outage, Yahoo Auctions reported
statistically significant increase in users

Ebay’s market cap dropped US$4B after Jun99 outage, stock
price dropped 25%
Newsworthy due to number of users affected, given
length of outage
© 2002 Armando Fox
Outage costs, cont.



What about a 10-minute outage once per week?

A=(7*24 hours)/(7*24 + 1/6 hours) = 99.9% - the same

Can we quantify “savings” over the previous scenario?
Shorter outages affect fewer users at a time

Typical AOL email “outage” affects 1-2% of users

Many short outages may affect different subsets of users
Shorter outages typically not news-worthy
© 2002 Armando Fox
When Low MTTR Trumps High MTTF


MTTR is directly measurable; MTTF usually not

Component MTTF’s -> tens of years

Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01)

Result: “measuring” MTTF requires 100’s of system-years

But, MTTR’s are minutes to hours, even for complex SW
components
MTTR more directly captures impact of a specific outage

Very low MTTR (~10 seconds) achievable with redundancy and
failover

Keeps response time below user threshold of distraction [Miller
1968, Bhatti et al 2001, Zona Research 1999]
© 2002 Armando Fox
Degraded Service vs. Outage

How about longer MTTR’s (minutes or hours)?

Can service be designed so that “short” outages appear
to users as temporary degradation instead?


How much degradation will users tolerate?

For how long (until they abandon the site because it feels like a
true outage - abandonment can be measured)

How frequently?
Even if above thresholds can be deduced, how to design
service so that transient failures can be mapped onto
degraded quality?
© 2002 Armando Fox
Examples of degraded service
Nature of
degradation
Users
affected
Thresholds
See only headers
(not body) for
some email
messages (AOL)
1.5-2%
typical
If >1 minute,
Messages on failed servers
treated as outage unavailable, but metadata
kept on Tandem cluster
Reduced search
harvest (Inktomi)
All users
Varies
Lossy reads to avoid failed
servers
Above-the-fold
content only
(CNN.com)
All users
Varies
Fast but “manual”
reconfiguration of front page
on dynamic content server
Slower service
(Anonymizer)
Non-paying
users
Indefinite
?

Mechanism
Goal: derive a set of service “primitives” that directly reflect
parameterizable degradation due to transient failure (“theory” is too
strong…)
© 2002 Armando Fox
Two Frequently Asked Questions
1.
Is ROC the same as autonomic computing™?
2.
Are you saying we should build lousy hardware and
software and mask all those failures with ROC
mechanisms?
© 2002 Armando Fox
1. Does ROC==autonomic computing?



Self-administering?

For now, focus on empowering administrators, not eliminating them

Humans are good at detecting and learning from own mistakes, so
why not? (avoiding automation irony)

We’re not sure we understand sysadmins’ current techniques well
enough to think about automation
Self-healing, self-reprovisioning, self-load-balancing…?

Sure - Web services and datacenters already do this for many
situations; many techniques and tools are “well known”

But - do we know how (“theory”) to design the app software to make
these techniques possible
Digital immune system - it’s in WinXP
© 2002 Armando Fox
2. What ROC is not


We do not advocate for…

producing buggy software

building lousy hardware

slacking on design, testing, or careful administration

discarding existing useful techniques or tools
We do advocate for…

an increased focus on lowering MTTR specifically

increased examination of when some guarantees can be traded
for lower MTTR

systematic exploration of “design for fast recovery” in the context
of a variety of applications

stealing great ideas from systems, Internet protocols, psychology,
safety-critical systems design
© 2002 Armando Fox
Summary: ROC and Online Services


Current software realities lead to new foci

Rapid evolution => traditional FT methodologies difficult to apply

Human error inevitable, but humans are good at identifying own
errors => provide facilities to allow recovery from these

HW and SW failure inevitable => use redundancy and designedin ability to substitute temporary degradation for outages
(“design for recovery”)
Trying to stay relevant via direct contact with
designers/operators of large systems

Need real data on how large systems fail

Need real data on how different kinds of failures are perceived by
users
© 2002 Armando Fox
Interested in ROCing?

Are you willing to anonymously share failure data?

Already great relationships (and in some cases data-sharing
agreements) with BEA, IBM, HP, Keynote, Microsoft, Oracle,
Tellme, Yahoo!, others

See http://roc.stanford.edu or http://roc.cs.berkeley.edu
for publications, talks, research areas, etc.

Contact Armando Fox ([email protected])
or Dave Patterson ([email protected])
© 2002 Armando Fox
Discussion Question
3.
[For discussion] So what if you pick the low hanging
fruit? The challenge is in reaching the highest leaves.
© 2002 Armando Fox

Recovery Oriented Computing: Overview

Transcript Recovery Oriented Computing: Overview

Directory