Transcript PPT
Dynamo
Highly Available Key-Value Store
Dennis Kafura – CS5204 – Operating Systems
1
Dynamo
Context
Core e-commerce services need scalable and reliable storage for
massive amounts of data
Size and scalability require a storage architecture that is
n x 100 of services
n x 100,000 concurrent sessions on key services
highly decentralized
high component count
commodity hardware
High component count creates reliability problems (“treats
failure handling as the normal case”)
Address reliability problems by replication
Replication raises issues of:
Consistency (replicas differ after failure)
Performance
When to enforce consistency (on read, on write)
Who enforces consistency (client, storage system)
Dennis Kafura – CS5204 – Operating Systems
2
Dynamo
System Elements
Maintain state of services with
Used only internally
High reliability requirements
Latency-sensitive performance
Control tradeoff between consistency and performance
Can leverage characteristics of services and workloads
Non-hostile environment (no security requirements)
Simple key-value interface
Applications do not require more complicated (e.g. database)
semantics or hierarchical name space
Key is unique identifier for data item; Value is a binary object (blob)
No operations over multiple data items
Adopts weaker model of consistency (eventual consistency) in favor
of higher availability
Service level agreements (SLA)
At 99.9% percentile
Key factors: service latency at a given request rate
Example: response time of 300ms for 99.9% of requests at peak
client load of 500 requests per second
State manage (storage) efficiency a key factor in SLAs
Dennis Kafura – CS5204 – Operating Systems
3
Dynamo
Design Considerations
Consistency vs. availability
Strict consistency means that data is unavailable in case of
failure to one of the replicas
To improve availability,
use weaker form of consistency (eventual consistency)
allow optimistic updates (changes propagate in the background)
Conflicts
Can lead to conflicting changes which must be detected and
resolved
Dynamo applications require “always writeable” storage
Perform conflict detection/resolution on reads
Other factors
Incremental scalability
Symmetry/decentralization (P2P organization/control)
Heterogeneity (not all servers the same)
Dennis Kafura – CS5204 – Operating Systems
4
Dynamo
Design Overview
Dennis Kafura – CS5204 – Operating Systems
5
Dynamo
Partitioning
Interface
get(key)
Returns context and
A single object or a list of conflicting objects
put(key, context, object)
Context from previous read
Object placement/replication
MD5 hash of key yields
128 bit identifier
Consistent hashing
preference list
Dennis Kafura – CS5204 – Operating Systems
6
Dynamo
Versioning
Failure free operation
put
replicas
What to do in case of failure?
?
put
Dennis Kafura – CS5204 – Operating Systems
replicas
7
Dynamo
Versioning
Object content is treated as immutable and an
update operation creates a new version
put
v2
v1
v2
v1
v2
Dennis Kafura – CS5204 – Operating Systems
v1
8
Dynamo
Versioning
Versioning can lead to inconsistency
Due to network partitioning
put
v2
v1
v2
v1
v1
Dennis Kafura – CS5204 – Operating Systems
9
Dynamo
Versioning
Versioning can lead to inconsistency
Due to concurrent updates
v2b v2a v1
puta
v2a
v2b v1
putb
v2b v2a
Dennis Kafura – CS5204 – Operating Systems
v1
10
Dynamo
Object Resolution
Uses vector-clocks
Conflicting versions
passed to application as
output of get operation
Application resolves
conflicts and puts a
new (consistent)
version
Inconsistent version
rare: 99.94% of get
operations saw exactly
one version
Dennis Kafura – CS5204 – Operating Systems
11
Dynamo
Handling get/put operations
Operating handled by coordinator:
First among the top N nodes in the preference list
Located by
Quorum voting
call to load balancer (no Dynamo-specific node needed in application
but may require extra level of indirection)
Direct call to coordinator (via Dynamo-specific client library)
R nodes must agree to a get operation
W nodes must agree to a put operation
R+W > N
(N, R, W) can be chosen to achieve desired tradeoff
Common configuration is (3,2,2)
“Sloppy quorum”
Top N’ healthy nodes in the preference list
Coordinator is first in this group
Replicas sent to node contain a “hint” indicating the
(unavailable) original node that should hold the replica
Hinted replicas are stored by available node and sent forwarded
when original node recovers.
Dennis Kafura – CS5204 – Operating Systems
12
Dynamo
Replica synchronization
Accelerates detection of
inconsistent replicas using
Merkle tree
Separate tree maintained by
each node for each key range
Adds overhead to maintain
Merkle trees
Dennis Kafura – CS5204 – Operating Systems
13
Dynamo
Ring membership
Nodes are explicitly added to/removed from a ring
Membership, partitioning, and placement information
propagates via periodic exchanges (a gossip protocol)
Existing nodes transfer key ranges to newly added
node or receive key ranges from exiting nodes
Nodes eventually know key ranges of its peers and
can forward requests to them
Some “seed” nodes are well-known
Nodes failures detected by lack of responsiveness and
recovery detected by periodic retry
Dennis Kafura – CS5204 – Operating Systems
14
Dynamo
Partition/Placement Strategies
Strategy
Placement
Partition
1
T random tokens per node
Consecutive tokens create a partition
2
T random tokens per node
Q equal sized partitions
3
Q/S tokens per node
Q equal sized partitions
S = number of nodes
Dennis Kafura – CS5204 – Operating Systems
15
Dynamo
Strategy Performance Factors
Strategy 1
Bootstrapping of new node is lengthy
It must acquire its key ranges from other nodes
Other nodes process scanning/transmission of key
ranges for new node as background activities
Has taken a full day during peak periods
Numerous nodes many have to adjust their Merkle
trees when a new node joins/leaves system
Archival process difficult
Key ranges may be in transit
No obvious synchronization/checkpointing structure
Dennis Kafura – CS5204 – Operating Systems
16
Dynamo
Strategy Performance Factors
Strategy 2
Decouples partition and placement
Allows changing of placement scheme at run-time
Strategy 3
Decouples partition and placement
Faster bootstrapping/recovery and ease of archiving
because key ranges can be segregates into
different files that can be shared/archived
separately
Dennis Kafura – CS5204 – Operating Systems
17
Dynamo
Partition Strategies - Performance
Strategies have different tuning parameters
Fair comparison: evaluate the skew in their load distributions for a
fixed amount of space to maintain membership information
Strategy 3 superior
Dennis Kafura – CS5204 – Operating Systems
18
Dynamo
Client- vs Server-Side Coordination
Any node can coordinate read requests; write requests handled by coordinator
State-machine for coordination can be in load balancing server or incorporated
into client
Client-driven coordination has lower latency because it avoids extra network
hop (redirection)
Dennis Kafura – CS5204 – Operating Systems
19