Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University Mission-Critical Datacenters  COTS Datacenters  Online e-tailers, search engines, corporate applications  Web-services  Mission-Critical Apps  Need: Scalability, Availability, Fault-Tolerance … Timeliness!

Transcript Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University Mission-Critical Datacenters  COTS Datacenters  Online e-tailers, search engines, corporate applications  Web-services  Mission-Critical Apps  Need: Scalability, Availability, Fault-Tolerance … Timeliness!

Reliable Multicast for
Time-Critical Systems
Mahesh Balakrishnan
Ken Birman
Cornell University
Mission-Critical Datacenters

COTS Datacenters
 Online
e-tailers, search engines, corporate
applications
 Web-services

Mission-Critical Apps
 Need:
Scalability, Availability, Fault-Tolerance
… Timeliness!
The Time-Critical Datacenter

Migrating time-critical applications to
commodity datacenters…

… conversely, providing datacenter webservices with time-critical performance.
What’s a Time-Critical System?

Not ‘real time’, but ‘real fast’!

Financial calculators, military command and
control… air traffic control (ATC)

… foobooks.com!

Technology Gap: Real-Time focuses on
determinism, scale-up architectures
The French ATC System
Mid to Late 90’s
 Teams of 3-5 air traffic controllers on a
cluster of desktop consoles
 50-200 of these console clusters in an air
traffic control center
 Why study the French ATC?

ATC Subsystems

Radar Image
Weather Alert
Track Updates
Updates to Flight Plans
Console to Console State Updates
System Management and Monitoring
ATC center to center Updates

Multicast ubiquitous…






Two Kinds of Multicast

Virtually Synchronous Multicast: very
reliable, not particularly fast

Unreliable Multicast: very fast, not
particularly reliable

Nothing in between!
Two Kinds of Subsystems

Category 1: Complete reliability (virtual
synchrony) e.g: Routing decisions

Category 2: Careful application design +
natural hardware properties +
management policies. e.g: Radar
Multicast in the French ATC

Engineering Lessons:
 Structure
application to tolerate partial failures
 Exploit natural hardware properties

Can we generalize to modern systems?

Research Direction: Time-Critical Reliability
 Can
we design communication primitives that
encapsulate these lessons?
Anatomy of a Cloned Service
Updates
multicast to
whole group
RACS
Queries
unicast to
single nodes
Services

An Amazon web-page is constructed by
100s of co-operating services*

Multicast is used for:
 Updating
Cloned Services
 Publish-Subscribe / Eventing
 Datacenter Management/Monitoring
* Werner Vogels, CTO of amazon.com, at SOSP 2005
Multicast in the Datacenter

A node is in many
multicast groups:
 One
for each service it
hosts
 One for each topic it
subscribes to
 One or more
administration groups
Large Numbers of Overlapping Groups!
Service Semantics
Product
Popularity
Service
User
History
Service
Store Inventory
Data Store
Services: stale
data can result
in overselling /
underselling 
loss of realworld dollars
Shipping
Scheduler
User Profile
Data
Cache
Services:
updated
periodically
by back-end
data-stores
Product
Recommendations
The Challenge

Datacenter Blades
are failure-prone:
 Crash
failures
 Byzantine behavior
 Bursty Packet Loss :
End-hosts kernels
drop packets when
subjected to traffic
spikes.
A New Reliability Model
Rapid delivery is more important than
perfect reliability
 Probabilistic Timeliness
 Graceful Degradation

Wanted: a multicast primitive that
1.
2.
3.
4.
5.
Scales to large numbers of arbitrarily
overlapping multicast groups
Delivers multicasts quickly
Tolerates datacenter failure modes – bursty
packet loss, node failures
Offers probabilistic properties
‘Gives up’ on lost data after a threshold period
Ricochet: Lateral Error Correction
Receivers exchange error correction
XORs of multicast traffic
 Works very well with multiple groups –
scales upto a thousand groups per node
 Probabilistic Timeliness:
probability distribution of delivery
latencies

Predictive Total Ordering (Plato)
Delivers messages to applications with no
ordering delay in most cases
 Orders messages only if there is a high
probability of out-of-order delivery across
different nodes
 Probabilistic Timeliness: probability
distribution of ordered delivery latency

Performance

SRM takes seconds
to recover lost
packets

Ricochet recovers
almost all packets
within ~70
milliseconds
Conclusion

Move from R/T to T/C yields huge benefits!



Ricochet is faster… slashes latency… scalable…
Clean delivery delay curve a powerful design tool,
replaced traditional hard (but conservative) limits
We’re open for business:


Software and detailed paper available for download
Give it a try… tell us what you think!
www.cs.cornell.edu/projects/quicksilver/ricochet.html

Directory