CSCI-572 Presentation - Center for Software Engineering

Download Report

Transcript CSCI-572 Presentation - Center for Software Engineering

Dynamo: Amazon’s Highly Available Key-value Store
Giuseppe DeCandia et al.
[Amazon.com]
Jagrut Sharma
[email protected]
CSCI-572 (Prof. Chris Mattmann)
20-Jul-2010
Outline of Talk
•
•
•
•
•
•
•
•
•
•
•
•
Motivation (1)
Contribution (1)
Context (1)
Background (3)
Related Work (2)
System Architecture (7)
Implementation (1)
Experiences, Results & Lessons Learnt (4)
Conclusion (1)
Pros (1)
Cons (1)
Questions (1)
2
Motivation
Tens of millions of
customers
Tens of thousands of servers
Financial
consequences
Customer
Trust
Globally distributed data centers
24 * 7 * 365 operations
DATA
MGMT
Performance
Reliability
Efficiency
Scalability
3
Contribution
•
Evaluation of how different techniques can be combined to provide a
highly-available system
•
Demonstration of how a consistent storage system (like Dynamo) can
be used in production environment with demanding applications
•
Provision of tuning methods to meet requirements of production
systems with very strict performance demands
4
Context
•
Amazon’s e-commerce platform
•
•
•
•
•
•
•
Highly de-centralized
Loosely coupled
Service-oriented architecture
Hundreds of services
Millions of components
Failure is a way of life
Critical requirement
• Always available storage
•
Storage techniques
• S3 (Amazon Simple Storage Service)
• Dynamo
• Highly available and scalable distributed data store for Amazon’s platform
• Provides primary-key only interface for selected applications (e.g.
shopping cart)
• Combined multiple, high-performance techniques & algorithms
• Excellent performance in real-world scenarios
5
Background (1 of 3)
•
•
•
E-commerce platform services: Stateless & Stateful
Relational Databases an over-kill for stateful lookups by primary key
Dynamo:
•
•
•
•
•
•
Simple key/value interface
Highly available
Efficient in resource usage
Scalable
Each service that uses Dynamo runs its own Dynamo instances
Dynamo’s target applications:
• Store small-sized objects (<1 MB)
• Operate with weaker consistency if this gives high availability
•
•
•
•
Simple read-write to a data item uniquely identified by a key
No query operations span multiple data items
Services use Dynamo to give priority to latency & throughput
Amazon’s SLAs are expressed and measured at the 99.9th percentile of
the distribution (in contrast to common industry approach of using
average, median and expected variance)
6
Background (2 of 3)
Assumptions About Dynamo
•
Used only by Amazon’s internal services
•
Operation environment is non-hostile
•
There are no security-related requirements (e.g. authentication,
authorization)
•
Each service uses its distinct instance of Dynamo
•
Dynamo’s initial design targets a scale of up to hundreds of storage
hosts
7
Background (3 of 3)
SOA of Amazon’s platform
Dynamo Design Considerations
•
Conflict resolution between replication &
consistency ?
• Eventually consistent data store
•
When to resolve update conflicts ?
• “always writeable” data store
•
Who performs conflict resolution?
• Both data store & application allowed
•
•
•
•
Incremental scalability at node-level
Symmetry among nodes
Favors decentralization
Capable of exploiting infrastructure
heterogeneity
8
Related Work (1 of 2)
•
Peer to Peer Systems
• Tackle problems of data storage and distribution
• Only support flat namespaces
• Unstructured P2P: Freenet, Gnutella
• Search query floods network
• Structured P2P systems: Pastry, Chord, Oceanstore, PAST
• Employ globally consistent query routing protocol
• Bounded number of hops
• Maintain local routing tables
• Provide rich storage services with conflict resolution
•
Distributed File Systems and Databases
•
•
•
•
•
•
•
Support both flat & hierarchical namespaces
Ficus, Coda: high availability at expense of consistency
Farsite: high availability and scalability using replication
Google File System: master server, chunkservers
Bayou: Distributed RDBMS, disconnected operations
Antiquity: Wide-area distributed storage system
BigTable: Distributed storage system for structured data
9
Related Work (2 of 2)
Dynamo Vs Other Systems
1. Targeted mainly at apps that need an “always writeable” data store
2. Built for an infrastructure within a single administrative domain where
all nodes are assumed to be trusted
3. Applications using Dynamo do not require support for hierarchical
namespaces or complex relational schema
4. Built for latency sensitive applications that require at least 99.9% of
read and write operations to be performed within a few hundred
milliseconds.
5. Avoids routing requests through multiple nodes. Hence, similar to a
zero-hop Distributed Hash Table.
10
System Architecture (1 of 7)
List Of Techniques Used By Dynamo & Their Advantages
Problem
Technique
Advantage
Partitioning
Consistent Hashing
Incremental Scalability
High Availability for
writes
Vector clocks with reconciliation
during reads
Version size is decoupled from update
rates
Handling temporary
failures
Sloppy Quorum and hinted
handoff
Provides high availability and durability
guarantee when some of the replicas
are not available
Recovering from
permanent failures
Anti-entropy using Merkle trees
Synchronizes divergent replicas in the
background
Membership and
failure detection
Gossip-based membership
protocol and failure detection
Preserves symmetry and avoids
having a centralized registry for
storing membership and node liveness
information
11
System Architecture (2 of 7)
System Interface
•
get (key)
• locates the object replicas associated with key in the storage system
• Returns a single object/list of objects with conflicting versions + context
•
put(key, context, object)
• Determines where the replicas of the object should be placed based on the
associated key
• Writes replicas to disk
•
context
• encodes system metadata about object
• includes additional information (e.g. object version)
•
•
key, object: considered as an opaque array of bytes
MD5 hash (key) -> 128-bit identifier, used to determine the storage
nodes that are responsible for serving the key
12
System Architecture (3 of 7)
Partitioning Algorithm
•
•
Provides mechanism to dynamically partition the data over the set of
nodes (i.e. storage hosts)
Uses variant of consistent hashing (output range of a hash function is
treated as a fixed circular space or ‘ring’ - largest hash value wraps
around to the smallest hash value)
• Advantage: departure or arrival of a node only affects its immediate
neighbors
• Limitation 1: leads to non-uniform data and load distribution
• Limitation 2: oblivious to heterogeneity in the performance of nodes
•
•
(single node) -> multiple points in the ring i.e. virtual nodes
Advantages of virtual nodes:
• Graceful handling of failure of a node
• Easy accommodation of a new node
• Heterogeneity in physical infrastructure can be exploited
13
System Architecture (4 of 7)
Replication
•
•
•
•
Each data item replicated at N hosts
N is configured per-instance
Each node is responsible for the region of the ring between it and its Nth
predecessor
Preference list: List of nodes responsible for storing a particular key
Data Versioning
•
•
•
•
•
•
•
Eventual consistency: Allows updates to be propagated to all replicas
asynchronously
put() may return to caller before update has been applied at all replicas
get() may return an object that does not have the latest updates
Multiple versions of an object can be present in the system at same time
syntactic reconciliation: performed by system
semantic reconciliation: performed by client
vector clock: (node, counter) pair. Used for capturing causality between different
versions of the same object. One vector clock per version per object.
14
System Architecture (5 of 7)
Execution of get() and put() Operations
•
Any storage node in Dynamo is eligible to receive client get() and put()
operations for any key
•
Client can select a node using:
• generic load balancer
• partition-aware client library
•
Coordinator:
• node handing read or write operation
• typically, first among the top N nodes in the preference list
•
Consistency protocol used to maintain consistency among replicas.
Two key configurable values are:
• R: min. no. of nodes that must participate in a successful read operation
• W: min. no. of nodes that must participate in a successful write operation
• R + W > N is preferable
15
System Architecture (6 of 7)
Handling Failures: Hinted Handoff
•
Mechanism to ensure that the read and write operations are not failed
due to temporary node or network failures.
•
All read and write operations are performed on the first N healthy
nodes from the preference list, which may NOT always be the first N
nodes encountered while walking the consistent hashing ring.
•
Each object is replicated across multiple data centers, which are
connected through high-speed network links.
Handling Permanent Failures: Replica Synchronization
•
Dynamo implements an anti-entropy protocol to keep replicas
synchronized. Uses Merkle trees.
•
Merkle tree: A hash tree where leaves are hashes of the values of
individual keys.
16
System Architecture (7 of 7)
Membership and Failure Detection
•
Explicit mechanism available to initiate the addition and removal of
nodes from a Dynamo ring.
•
To prevent logical partitions, some Dynamo nodes play the role of seed
nodes.
•
Seeds: Nodes that are discovered by an external mechanism and
•
Failure detection of communication done in a purely local manner.
•
Gossip-based distributed failure detection and membership protocol
known to all nodes.
17
Implementation
Storage
Node
Request Coordination
Membership & Failure
Detection
Local Persistence
Engine
Pluggable Storage Engines
• Built on top of eventdriven messaging substrate
• Uses Java NIO
• Coordinator executes
client read & write requests
• State machines created
on nodes serving requests
• Each state machine
instance handles exactly
one client request
• State machine contains
entire process and failure
handling logic
• Berkeley Database (BDB)
Transactional Data Store
• BDB Java Edition
• MySQL
•In-memory buffer with
persistent backing store
Chosen based on
application’s object size
distribution
18
Experiences, Results & Lessons Learnt (1 of 4)
•
Main Dynamo Usage Patterns
1. Business logic specific reconciliation
• E.g. Merging different versions of a customer’s shopping cart
2. Timestamp based reconciliation
• E.g. Maintaining customer’s session information
3. High performance read engine
• E.g. Maintaining product catalog and promotional items
•
Client applications can tune parameters to achieve specific objectives:
•
•
•
•
•
•
N: Performance {no. of hosts a data item is replicated at}
R: Availability {min. no. of participating nodes in a successful read opr}
W: Durability {min. no. of participating nodes in a successful write opr}
Commonly used configuration (N,R,W) = (3,2,2)
Dynamo exposes data consistency & reconciliation logic to developers
Dynamo adopts a full membership model – each node is aware of the
data hosted by its peers
19
Experiences, Results & Lessons Learnt (2 of 4)
•
•
Typical SLA of service using Dynamo: 99.9% of the read and write
requests execute within 300 ms
Balancing Performance and Durability
Average & 99.9th percentile
latencies of Dynamo’s read
and write operations during
a period of 30 days
Comparison of performance
of 99.9th percentile latencies
for buffered vs. non-buffered
writes over 24 hours
20
Experiences, Results & Lessons Learnt (3 of 4)
•
Ensuring Uniform Load Distribution
• Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution.
• Node “in-balance”: request load for node deviates from the average load by
a value less than a certain threshold. Otherwise, Node “out-of-balance”
• Imbalance ratio = Nodes out-of-balance / Total Nodes
Node imbalance & Workload
Comparison of load distribution
efficiency of different strategies
21
Experiences, Results & Lessons Learnt (4 of 4)
•
Three strategies for load distribution
1. T random tokens per node and partition by token value
2. T random tokens per node and equal sized partitions
3. Q/S tokens per node, equal-sized partitions (S= #allnodes, Q=
#partitions)
•
Divergent versions of data item (rarely) arise in two scenarios:
1. System is facing failure scenarios (node/data center/network)
2. Large number of concurrent writers to a single data item
•
Server-driven coordination: client requests are uniformly assigned to nodes
•
Client-driven coordination: client applications use a library to perform
in the ring by a load balancer.
request coordination locally.
99.9th
99.9th
percentile read
latency (ms)
Average read
percentile write latency (ms)
latency (ms)
Average write
latency (ms)
Server-driven
68.9
68.5
3.9
4.02
Client-driven
30.4
30.4
1.55
1.9
22
Conclusion
Dynamo:
• Is a highly available and scalable data store
• Is used for storing state of a number of core services of Amazon.com’s
e-commerce platform
• Has provided desired levels of availability and performance and has
been successful in handling:
• Server failures
• Data center failures
• Network partitions
•
•
•
•
Is incrementally scalable
Sacrifices consistency under certain failure scenarios
Extensively uses object versioning and application-assisted conflict
resolution
Allows service owners to:
• scale up and down based on their current request load
• customize their storage system to meet desired performance, durability
and consistency SLAs by allowing tuning of N, R, W parameters
•
Combination of decentralized techniques can be combined to provide a
23
single highly-available system.
Pros
•
Excellent description of core distributed systems techniques used in
Dynamo:
• partitioning, replication, versioning, membership, failure handling, scaling
•
•
•
•
•
Liberal use of diagrams, charts and tables to explain concepts
Real-world examples have been provided to enable the user to
understand and appreciate the theoretical concepts
Theoretical and implementation-level differences have been clearly
explained
Exhaustive list of references for the interested researcher
Well-written paper with logical transition from one topic to the next
24
Cons
•
Little description of supporting techniques used in Dynamo for:
• state transfer, concurrency & job scheduling, request marshalling, request
routing, system monitoring and alarming
•
•
•
•
Certain problems which are theoretically possible, have not been
investigated in detail, since they have not been encountered in
production systems.
Sophisticated comparison with existing systems has not been
provided.
For protecting Amazon.com’s business interests, certain parts of the
system have either not been entirely described or described at a veryhigh level.
Future work and possible extensions have not been mentioned clearly.
25
Questions
26