Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University Mission-Critical Datacenters COTS Datacenters Online e-tailers, search engines, corporate applications Web-services Mission-Critical Apps Need: Scalability, Availability, Fault-Tolerance … Timeliness!
Download
Report
Transcript Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University Mission-Critical Datacenters COTS Datacenters Online e-tailers, search engines, corporate applications Web-services Mission-Critical Apps Need: Scalability, Availability, Fault-Tolerance … Timeliness!
Reliable Multicast for
Time-Critical Systems
Mahesh Balakrishnan
Ken Birman
Cornell University
Mission-Critical Datacenters
COTS Datacenters
Online
e-tailers, search engines, corporate
applications
Web-services
Mission-Critical Apps
Need:
Scalability, Availability, Fault-Tolerance
… Timeliness!
The Time-Critical Datacenter
Migrating time-critical applications to
commodity datacenters…
… conversely, providing datacenter webservices with time-critical performance.
What’s a Time-Critical System?
Not ‘real time’, but ‘real fast’!
Financial calculators, military command and
control… air traffic control (ATC)
… foobooks.com!
Technology Gap: Real-Time focuses on
determinism, scale-up architectures
The French ATC System
Mid to Late 90’s
Teams of 3-5 air traffic controllers on a
cluster of desktop consoles
50-200 of these console clusters in an air
traffic control center
Why study the French ATC?
ATC Subsystems
Radar Image
Weather Alert
Track Updates
Updates to Flight Plans
Console to Console State Updates
System Management and Monitoring
ATC center to center Updates
Multicast ubiquitous…
Two Kinds of Multicast
Virtually Synchronous Multicast: very
reliable, not particularly fast
Unreliable Multicast: very fast, not
particularly reliable
Nothing in between!
Two Kinds of Subsystems
Category 1: Complete reliability (virtual
synchrony) e.g: Routing decisions
Category 2: Careful application design +
natural hardware properties +
management policies. e.g: Radar
Multicast in the French ATC
Engineering Lessons:
Structure
application to tolerate partial failures
Exploit natural hardware properties
Can we generalize to modern systems?
Research Direction: Time-Critical Reliability
Can
we design communication primitives that
encapsulate these lessons?
Anatomy of a Cloned Service
Updates
multicast to
whole group
RACS
Queries
unicast to
single nodes
Services
An Amazon web-page is constructed by
100s of co-operating services*
Multicast is used for:
Updating
Cloned Services
Publish-Subscribe / Eventing
Datacenter Management/Monitoring
* Werner Vogels, CTO of amazon.com, at SOSP 2005
Multicast in the Datacenter
A node is in many
multicast groups:
One
for each service it
hosts
One for each topic it
subscribes to
One or more
administration groups
Large Numbers of Overlapping Groups!
Service Semantics
Product
Popularity
Service
User
History
Service
Store Inventory
Data Store
Services: stale
data can result
in overselling /
underselling
loss of realworld dollars
Shipping
Scheduler
User Profile
Data
Cache
Services:
updated
periodically
by back-end
data-stores
Product
Recommendations
The Challenge
Datacenter Blades
are failure-prone:
Crash
failures
Byzantine behavior
Bursty Packet Loss :
End-hosts kernels
drop packets when
subjected to traffic
spikes.
A New Reliability Model
Rapid delivery is more important than
perfect reliability
Probabilistic Timeliness
Graceful Degradation
Wanted: a multicast primitive that
1.
2.
3.
4.
5.
Scales to large numbers of arbitrarily
overlapping multicast groups
Delivers multicasts quickly
Tolerates datacenter failure modes – bursty
packet loss, node failures
Offers probabilistic properties
‘Gives up’ on lost data after a threshold period
Ricochet: Lateral Error Correction
Receivers exchange error correction
XORs of multicast traffic
Works very well with multiple groups –
scales upto a thousand groups per node
Probabilistic Timeliness:
probability distribution of delivery
latencies
Predictive Total Ordering (Plato)
Delivers messages to applications with no
ordering delay in most cases
Orders messages only if there is a high
probability of out-of-order delivery across
different nodes
Probabilistic Timeliness: probability
distribution of ordered delivery latency
Performance
SRM takes seconds
to recover lost
packets
Ricochet recovers
almost all packets
within ~70
milliseconds
Conclusion
Move from R/T to T/C yields huge benefits!
Ricochet is faster… slashes latency… scalable…
Clean delivery delay curve a powerful design tool,
replaced traditional hard (but conservative) limits
We’re open for business:
Software and detailed paper available for download
Give it a try… tell us what you think!
www.cs.cornell.edu/projects/quicksilver/ricochet.html