Brief Announcement: Atomic Consistency and Partition

Download Report

Transcript Brief Announcement: Atomic Consistency and Partition

&
Programming Model and Protocols
for Reconfigurable Distributed Systems
COSMIN IONEL ARAD
Doctoral Thesis Defense, 5th June 2013
KTH Royal Institute of Technology
https://www.kth.se/profile/icarad/page/doctoral-thesis/
Presentation Overview
•
•
•
•
•
Context, Motivation, and Thesis Goals
introduction & Design philosophy
Distributed abstractions & P2P framework
Component execution & Scheduling
Distributed systems Experimentation
– Development cycle: build, test, debug, deploy
•
scalable & consistent key-value store
– System architecture and testing using
– Scalability, Elasticity, and Performance Evaluation
• Conclusions
2
Trend 1: Computer systems are
increasingly distributed
• For fault-tolerance
– E.g.: replicated state machines
• For scalability
– E.g.: distributed databases
• Due to inherent geographic distribution
– E.g.: content distribution networks
3
Trend 2: Distributed systems are
increasingly complex
connection management, location and routing,
failure detection, recovery, data persistence,
load balancing, scheduling, self-optimization,
access-control, monitoring, garbage collection,
encryption, compression, concurrency control,
topology maintenance, bootstrapping, ...
4
Trend 3: Modern Hardware is
increasingly parallel
• Multi-core and many-core processors
• Concurrent/parallel software is needed
to leverage hardware parallelism
• Major software concurrency models
– Message-passing concurrency

• Data-flow concurrency viewed as a special case
– Shared-state concurrency

5
Distributed Systems are still Hard…
• … to implement, test, and debug
• Sequential sorting is easy
– Even for a first-year computer science student
• Distributed consensus is hard
– Even for an experienced practitioner
having all the necessary expertise
6
Experience from building Chubby,
Google’s lock service, using Paxos
“The fault-tolerance computing community
has not developed the tools to make it
easy to implement their algorithms.
The fault-tolerance computing community has
not paid enough attention to testing,
a key ingredient for building fault-tolerant
systems.” [Paxos Made Live]
Tushar Deepak Chandra
Edsger W. Dijkstra Prize in Distributed Computing 2010
7
A call to action
“It appears that the fault-tolerant distributed
computing community has not developed the
tools and know-how to close the gaps
between theory and practice with the same
vigor as for instance the compiler community.
Our experience suggests that these gaps are
non-trivial and that they merit attention by
the research community.” [Paxos Made Live]
Tushar Deepak Chandra
Edsger W. Dijkstra Prize in Distributed Computing 2010
8
Thesis Goals
• Raise the level of abstraction in
programming distributed systems
• Make it easy to implement, test, debug,
and evaluate distributed systems
• Attempt to bridge the gap between the
theory and the practice of fault-tolerant
distributed computing
9
We want to build distributed systems
10
by composing distributed protocols
Application
Consensus
Broadcast
Failure detector
Network
Timer
Application
Consensus
Application
Broadcast
Consensus
Failure detector
Broadcast
Network
Timer
Failure detector
Network
Timer
Application
Application
Consensus
Consensus
Broadcast
Broadcast
Failure detector
Failure detector
Network
Network
Timer
Timer
Application
Application
Consensus
Consensus
Broadcast
Broadcast
Failure detector
Failure detector
Network
Network
Timer
Timer
11
implemented as reactive,
concurrent components
Application
Consensus
Broadcast
Failure detector
Network
Timer
Application
Application
Consensus
Application
Broadcast
Failure detector
Network
Broadcast
Consensus
Consensus
Failure detector
Network
Timer
Timer
Broadcast
Application
Application
Consensus
Consensus
Broadcast
Broadcast
Failure detector
Failure detector
Network
Network
Timer
Timer
Failure detector
Application
Application
Consensus
Consensus
Broadcast
Failure detector
Network
Timer
Network
Timer
Broadcast
Failure detector
Network
Timer
12
with asynchronous communication
and message-passing concurrency
Application
Consensus
Broadcast
Failure detector
Network
Timer
13
Design principles
• Tackle increasing system complexity through
abstraction and hierarchical composition
• Decouple components from each other
– publish-subscribe component interaction
– dynamic reconfiguration for always-on systems
• Decouple component code from its executor
– same code executed in different modes:
production deployment, interactive stress testing,
deterministic simulation for replay debugging
14
Nested hierarchical composition
• Model entire sub-systems as first-class
composite components
– Richer architectural patterns
• Tackle system complexity
– Hiding implementation details
– Isolation
• Natural fit for developing distributed systems
– Virtual nodes
– Model entire system: each node as a component
15
Message-passing concurrency
• Compositional concurrency
– Free from the idiosyncrasies of locks and threads
• Easy to reason about
– Many concurrency formalisms: the Actor model (1973),
CSP (1978), CCS (1980), π-calculus (1992)
• Easy to program
– See the success of Erlang, and Go, Rust, Akka, ...
• Scales well on multi-core hardware
– Almost all modern hardware
16
Loose coupling
• “Where ignorance is bliss, 'tis folly to be wise.”
– Thomas Gray, Ode on a Distant Prospect of Eton College (1742)
• Communication integrity
– Law of Demeter
• Publish-subscribe communication
• Dynamic reconfiguration
17
Design Philosophy
1. Nested hierarchical composition
2. Message-passing concurrency
3. Loose coupling
4. Multiple execution modes
18
Component Model
• Event
Event
• Port
Port
• Component
• Channel
• Handler
Component
channel
handler
• Subscription
• Publication / Event trigger
19
20
21
A simple distributed system
Process1
Process2
Application
Application
handler1
<Ping>
handler2
Ping
<Pong>
handler1
Pong
<Ping>
handler2
<Pong>
Network
Network
Network
Network
handler
<Message>
handler
Pong
< >
Network Comp
handler
<Message>
handler
< Ping
>
Network Comp
22
A Failure Detector Abstraction using
a Network and a Timer Abstraction
Eventually Perfect Failure Detector
Ping Failure Detector
Network
+
Eventually Perfect Failure Detector

Timer
Network
Timer
MyNetwork
MyTimer
+

Suspect
Restore
StartMonitoring
StopMonitoring
23
A Leader Election Abstraction using
a Failure Detector Abstraction
Leader Election
Ω Leader Elector
Eventually Perfect Failure Detector
+

Leader Election
+

Leader
Eventually Perfect Failure Detector
Ping Failure Detector
24
A Reliable Broadcast Abstraction using
a Best-Effort Broadcast Abstraction
Broadcast
+

Broadcast
Reliable Broadcast
Broadcast
Broadcast
Best-Effort Broadcast
Network
Deliver
+

Broadcast
RbBroadcast 
RbDeliver 
Broadcast
Deliver
BebBroadcast 
BebDeliver 
Broadcast
Deliver
25
A Consensus Abstraction using a
Broadcast, a Network, and a
Leader Election Abstraction
Consensus
Paxos Consensus
Broadcast
Network
Leader Election
Broadcast
Network
Leader Election
Best-Effort Broadcast
MyNetwork
Ω Leader Elector
26
A Shared Memory Abstraction
Atomic Register
+

ABD
Broadcast
Atomic Register
Network
Broadcast
Best-Effort Broadcast
+

ReadResponse
WriteResponse
ReadRequest
WriteRequest
Network
Network
MyNetwork
27
A Replicated State Machine using
a Total-Order Broadcast Abstraction
Replicated State Machine
+

Replicated State Machine
State Machine Replication
+

Total-Order Broadcast
Total-Order Broadcast
Uniform Total-Order Broadcast
+

+

Consensus
+

Execute
Total-Order Broadcast
Consensus
Paxos Consensus
Output
Consensus
+

TobDeliver
TobBroadcast
Decide
Propose
28
Probabilistic Broadcast and Topology
Maintenance Abstractions using a
Peer Sampling Abstraction
Topology
Probabilistic Broadcast
T-Man
Epidemic Dissemination
Network
Peer Sampling
Peer Sampling
Network
Peer Sampling
Cyclon Random Overlay
Network
Timer
29
A Structured Overlay Network
implements a Distributed Hash Table
Distributed Hash Table
Structured Overlay Network
Overlay Router
Consistent Hashing Ring Topology
One-Hop Router
Chord Periodic Stabilization
Network
Peer Sampling
Failure Detector
Peer Sampling
Cyclon Random Overlay
Network
Network
Failure Detector
Ping Failure Detector
Network
Timer
Network
Timer
Timer
30
A Video on Demand Service using
a Content Distribution Network
and a Gradient Topology Overlay
Video On-Demand
Network
Content Distribution Network
Content Distribution Network
BitTorrent
Network
Tracker
Tracker
Tracker
Peer Exchange
Distributed Tracker
Peer Sampling
Distributed Hash Table
Timer
Gradient Topology
Gradient Topology
Gradient Overlay
Peer Sampling
Network
Tracker
Centralized Tracker Client
Network
Timer
31
Generic Bootstrap and Monitoring
Services provided by the Kompics
Peer-to-Peer Protocol Framework
PeerMain
BootstrapServerMain
MyWebServer
MonitorServerMain
MyWebServer
MyWebServer
– Web
– Web
– Web
+
+
+
Web
Peer
Web
BootstrapServer
Web
MonitorServer
– Network
– Timer
– Network
– Timer
– Network
– Timer
+
+
+
+
+
+
Network
MyNetwork
Timer
MyTimer
Network
MyNetwork
Timer
MyTimer
Network
MyNetwork
Timer
MyTimer
32
Whole-System Repeatable Simulation
Deterministic
Simulation
Scheduler
Network
Model
Experiment
Scenario
33
Experiment scenario DSL
• Define parameterized scenario events
– Node failures, joins, system requests, operations
• Define “stochastic processes”
–
–
–
–
Finite sequence of scenario events
Specify distribution of event inter-arrival times
Specify type and number of events in sequence
Specify distribution of each event parameter value
• Scenario: composition of “stochastic processes”
– Sequential, parallel:
34
Local Interactive Stress Testing
Work-Stealing
Multi-Core
Scheduler
Network
Model
Experiment
Scenario
35
execution profiles
• Distributed Production Deployment
– One distributed system node per OS process
– Multi-core component scheduler (work stealing)
• Local / Distributed Stress Testing
– Entire distributed system in one OS process
– Interactive stress testing, multi-core scheduler
• Local Repeatable Whole-system Simulation
– Deterministic simulation component scheduler
– Correctness testing, stepped / replay debugging
36
Incremental Development & Testing
• Define emulated network topologies
– processes and their addresses: <id, IP, port>
– properties of links between processes
• latency (ms)
• loss rate (%)
• Define small-scale execution scenarios
– the sequence of service requests initiated by
each process in the distributed system
• Experiment with various topologies / scenarios
– Launch all processes locally on one machine
37
Distributed System Launcher
38
The script of service
requests of the process
is shown here…
After the Application
completes the script it
can process further
commands input here…
39
Programming in the Large
• Events and ports are interfaces
– service abstractions
– packaged together as libraries
• Components are implementations
– provide or require interfaces
– dependencies on provided / required
interfaces
• expressed as library dependencies [Apache Maven]
• multiple implementations for an interface
– separate libraries
• deploy-time composition
40
Kompics Scala, by Lars Kroll
41
Kompics Python, by Niklas Ekström
42
Case study
A Scalable, Self-Managing Key-Value Store
with
Atomic Consistency and Partition Tolerance
43
Key-Value Store?
• Store.Put(key, value)  OK
• Store.Get(key)  value
[write]
[read]
Put(”www.sics.se”, ”193.10.64.51”)  OK
Get(”www.sics.se”) ”193.10.64.51”
44
Consistent Hashing
Dynamo
Incremental
scalability
Self-organization
Simplicity
Project Voldemort
45
Single client, Single server
Put(X, 1)
Ack(X)
Get(X)
Return(1)
Client
Server
X=1
X=0
46
Multiple clients, Multiple servers
Put(X, 1)
Ack(X)
Client 1
Get(X)
Return(0) Get(X)
Return(1)
Client 2
Server 1
X=1
X=0
Server 2
X=0
X=1
47
Atomic Consistency Informally
• put/get ops appear to occur instantaneously
• Once a put(key, newValue) completes
– new value immediately visible to all readers
– each get returns the value of the last completed
put
• Once a get(key) returns a new value
– no other get may return an older, stale value
48
Distributed Hash Table
CATS Node
Web
Status
Web
Load Balancer
Aggregation
CATS Web Application
Network
Status
Overlay Router
Distributed Hash Table
Peer Sampling
Broadcast
Aggregation
Broadcast
Network
Peer Sampling
Status
Overlay Router
Ring Topology
Network
Ring Topology
Timer
Timer
Timer
Status
Network
Failure Detector
Bootstrap Client
Network
Network
Network
Replication
Status
Group Member
Data Transfer
Network
Status
Bulk Data Transfer
Timer
Status
Local Store
Local Store
Ping Failure Detector
Timer
Timer
Replication
Data Transfer
Consistent Hashing Ring
Failure Detector
Network
Garbage Collector
Reconfiguration Coordinator
Bootstrap
Network
Network
Status
Status
Status
Cyclon Random Overlay
Bootstrap
Status
Operation Coordinator
Epidemic Dissemination
Peer Sampling
Status Monitor
Peer Status
Distributed Hash Table
Status
One-Hop Router
Network
Peer Status
Network
Status
Persistent Storage
Timer
Timer
49
Simulation and Stress Testing
CATS Simulation Main
CATS Simulator
Simulation
Scheduler
Web
Web
DHT
Web
DHT
Web
DHT
Web
DHT
Web
DHT
CATS
Node
Web
DHT CATSWeb
Node
DHT
CATS
Node
Web
DHT
CATS
Node Timer
Web
DHT CATS
Network
Node
Web
DHT
Network
Timer
CATS
Node
Web
DHT CATS
Network
Timer
Node
Web
DHT
Network
Timer
CATS
Node
Web
DHT CATS
Network
Timer
Node
Web
DHT
Network
Timer
CATS
Node
Web
DHT
Network
Timer
CATS
Node Timer
Web
DHT CATS
Network
Node
Network
Timer
CATS
Node
Network
Timer
CATS
Node Timer
Network
CATS
Node
Network
Timer
CATS
Node Timer
Network
Network
Timer
Network
Timer
Network
Timer
CATS Stress Testing Main
CATS Simulator
Multi-core
Scheduler
Web
Web
DHT
Web
DHT
Web
DHT
Web
DHT
Web
DHT
CATS
Node
Web
DHT CATSWeb
Node
DHT
CATS
Node
Web
DHT
CATS
Node Timer
Web
DHT CATS
Network
Node
Web
DHT
Network
Timer
CATS
Node
Web
DHT CATS
Network
Timer
Node
Web
DHT
Network
Timer
CATS
Node
Web
DHT CATS
Network
Timer
Node
Web
DHT
Network
Timer
CATS
Node
Web
DHT
Network
Timer
CATS
Node Timer
Web
DHT CATS
Network
Node
Network
Timer
CATS
Node
Network
Timer
CATS
Node Timer
Network
CATS
Node
Network
Timer
CATS
Node Timer
Network
Network
Timer
Network
Timer
Network
Timer
Network
Timer
CATS Experiment
Network
Timer
CATS Experiment
Network
Timer
CATS Experiment
Network
Timer
CATS Experiment
Network
Model
Discrete-Event Simulator
Experiment Scenario
Network
Model
Generic Orchestrator
Experiment Scenario
50
Example Experiment Scenario
51
Reconfiguration Protocols
Testing and Debugging
• Use whole-system repeatable simulation
• Protocol Correctness Testing
– Each experiment scenario is a Unit test
• Regression test suite
– Covered all “types” of churn scenarios
– Tested each scenario for 1 million RNG seeds
• Debugging
– Global state snapshot on every change
– Traverse state snapshots
• Forward and backward in time
52
Global State Snapshot: 25 joined
53
Snapshot During Reconfiguration
54
Reconfiguration Completed OK
Distributed Systems Debugging Done Right!
55
CATS Architecture for
Distributed Production Deployment
56
Demo: SICS Cluster Deployment
57
58
An Interactive Put Operation
An Interactive Get Operation
59
CATS Architecture
for Production Deployment
and Performance Evaluation
CATS Client Main
CATS Peer Main
YCSB Benchmark
Bootstrap Server Main
Application Jetty Web Server
Jetty Web Server
Distributed Hash Table
DHT
Web
Web
Distributed Hash Table
DHT
Web
Web
CATS Node
CATS Client
CATS Bootstrap Server
Network
Timer
Network
Timer
Network
Timer
Network
Timer
Network
Timer
Network
Timer
Grizzly Network
MyTimer
Grizzly Network
MyTimer
Grizzly Network
MyTimer
60
Experimental Setup
• 128 Rackspace Cloud Virtual Machines
– 16GB RAM, 4 virtual cores
– 1 client for every 3 servers
• Yahoo! Cloud Serving Benchmark (YCSB)
– Read-intensive workload: 95% reads, 5% writes
– Write-intensive workload: 50% reads, 50% writes
• CATS nodes equally-distanced on the ring
– Avoid load-imbalance
61
Performance (50% reads, 50% writes)
62
Performance (95% reads, 5% writes)
63
Scalability (50% reads, 50% writes)
64
Scalability (95% reads, 5% writes)
65
Elasticity (read-only workload)
* Experiment ran on SICS cloud machines [1 YCSB client, 32 threads]
66
Overheads (50% reads, 50% writes)
24%
67
Overheads (95% reads, 5% writes)
4%
68
CATS vs Cassandra (50% read, 50% write)
69
CATS vs Cassandra (95% reads, 5% writes)
70
Summary
Atomic Data
Consistency
Scalability
Elasticity
Decentralization
Network
Partition
Tolerance
Selforganization
Fault
Tolerance
Atomic data consistency is affordable!
71
Related work
• Dynamo [SOSP’07], Cassandra, Riak, Voldemort
– scalable, not consistent
(key-value stores)
• Chubby [OSDI’06], ZooKeeper
(meta-data stores)
– consistent, not scalable, not auto-reconfigurable
• RAMBO [DISC’02], RAMBO II [DSN’03], SMART [EuroSys’06],
RDS [JPDC’09], Dynastore [JACM’11] (replication systems)
– Reconfigurable, consistent, not scalable
• Scatter
[SOSP’11]
– Scalable and linearizable DHT
– Reconfiguration needs distributed transactions
72
is practical
•
•
•
•
•
•
scalable key-value stores
structured overlay networks
gossip-based protocols
peer-to-peer media streaming
video-on-demand systems
NAT-aware peer-sampling services
• Teaching: broadcast, concurrent objects,
consensus, replicated state machines, etc.
73
Related work
• Component models and ADLs: Fractal,
OpenCom, ArchJava, ComponentJ, …
– blocking interface calls vs. message passing
• Protocol composition frameworks: x-Kernel,
Ensemble, Horus, Appia, Bast, Live Objects, …
– static, layered vs. dynamic, hierarchical
composition
• Actor models: Erlang, Kilim, Scala, Unix pipes
– flat / stacked vs. hierarchical architecture
• Process calculi: π-calculus, CCS, CSP, Oz/K
– synchronous vs. asynchronous message passing
74
Summary
• Message-passing, hierarchical component
model facilitating concurrent programming
• Good for distributed abstractions and systems
• Multi-core hardware exploited for free
• Hot upgrades by dynamic reconfiguration
• Same code used in production deployment,
deterministic simulation, local execution
• DSL to specify complex simulation scenarios
• Battle-tested in many distributed systems
75
Acknowledgements
• Seif Haridi
• Jim Dowling
•
•
•
•
•
•
•
Tallat M. Shafaat
Muhammad Ehsan ul Haque
Frej Drejhammar
Lars Kroll
Niklas Ekström
Alexandru Ormenișan
Hamidreza Afzali
76
http://kompics.sics.se/
BACKUP SLIDES
Sequential consistency
• A concurrent execution is sequentially
consistent if there is a sequential way to
reorder the client operations such that:
– (1) it respects the semantics of the objects,
as defined by their sequential specification
– (2) it respects the order of operations at the
client that issued the operations
79
Linearizability
• A concurrent execution is linearizable if
there is a sequential way to reorder the
client operations such that:
– (1) it respects the semantics of the objects,
as defined by their sequential specification
– (2) it respects the order of non-overlapping
operations among all clients
80
Consistency: naïve solution
Replicas act as a distributed shared-memory register
35
30
r1
20
r2
40
r3
45
81
The problem
Asynchrony
Impossible to accurately detect process failures
82
Incorrect failure detection
52
50
Key 45
42
Nonintersecting
quorums
48
X
•50 thinks 48 has failed
• 48 thinks replication group for (42, 48] is {48, 50, 52}
• 50 thinks replication group for (42, 48] is {50, 52, 60}
• PUT(45) may contact majority quorum {48, 50}
• GET(45) may contact majority quorum {52, 60}
successor pointer
predecessor pointer
83
Solution: Consistent Quorums
• A consistent quorum is a quorum of nodes
that are in the same view when the quorum
is assembled
– Maintain consistent view of replication group
membership
– Modified Paxos using consistent quorums
– Essentially a reconfigurable RSM (state == view)
• Modified ABD using consistent quorums
– Dynamic linearizable read-write register
84
Guarantees
• Concurrent reconfigurations are applied at
every node in a total order
• For every replication group, any two
consistent quorums always intersect
– Same view, consecutive, non-consecutive views
• In a partially synchronous system,
reconfigurations and operations terminate
– once network partitions cease
• Consistent mapping from key-ranges to
replication groups
85