Transcript Slide 1

DYNAMO: AMAZON’S HIGHLY
AVAILABLE KEY-VALUE STORE
Professor : Dr Sheykh Esmaili
1
Presenters:
Pourya Aliabadi
Boshra Ardallani
Paria Rakhshani
INTRODUCTION
 Amazon
runs a world-wide e-commerce
platform that serves tens of millions
customers at peak times using tens of
thousands of servers located in many data
centers around the world
 Reliability at massive scale is one of the
biggest challenges we face at
Amazon.com, one of the largest ecommerce operations in the world; even
the slightest outage has significant
financial consequences and impacts
customer trust
2
INTRODUCTION
 One
of the lessons our organization has
learned from operating Amazon’s platform
is that the reliability and scalability of a
system is dependent on how its
application state is managed
 To meet the reliability and scaling needs,
Amazon has developed a number of
storage technologies, of which the Amazon
Simple Storage Service (S3)
 There are many services on Amazon’s
platform that only need primary-key
access to a data store
3
SYSTEM ASSUMPTIONS
REQUIREMENTS

AND
Query Model
Operations to a data item that is uniquely identified
by a key
 State is stored as binary objects
 No operations span multiple data items
 Dynamo targets applications that need to store
objects that are relatively small (less than 1 MB)

4
SYSTEM ASSUMPTIONS
REQUIREMENTS

AND
ACID Properties
ACID (Atomicity, Consistency, Isolation, Durability)
 ACID is a set of properties that guarantee that
database transactions are processed reliably
 Dynamo targets applications that operate with
weaker consistency
 Dynamo does not provide any isolation guarantees
and permits only single key updates

5
SYSTEM ASSUMPTIONS
REQUIREMENTS

AND
Efficiency
The system needs to function on a commodity
hardware infrastructure
 Services must be able to configure Dynamo such that
they consistently achieve their latency and
throughput requirements.
 The tradeoffs are in performance, cost efficiency,
availability, and durability guarantees.

6
SYSTEM ASSUMPTIONS
REQUIREMENTS
AND
Dynamo is used only by Amazon’s internal services
 We will discuss the scalability limitations of Dynamo
and possible scalability related extensions

7
SERVICE LEVEL AGREEMENTS (SLA)
 To
guarantee that the application can deliver its
functionality in a bounded time, each and every
dependency in the platform needs to deliver its
functionality with even tighter bounds
 An example of a simple SLA is a service
guaranteeing that it will provide a response within
300ms for 99.9% of its requests for a peak client load
of 500 requests per second
 For example a page request to one of the e-commerce
sites typically requires the rendering engine to
construct its response by sending requests to over
150 services
 These services often have multiple dependencies
8
9
Figure shows an abstract view of the architecture of Amazon’s platform
DESIGN CONSIDERATIONS
 Incremental
scalability: Dynamo
should be able to scale out one storage
host (henceforth, referred to as “node”) at
a time, with minimal impact on both
operators of the system and the system
itself
 Symmetry: Every node in Dynamo should
have the same set of responsibilities as its
peers; there should be no distinguished
node or nodes that take special roles or
extra set of responsibilities
10
DESIGN CONSIDERATIONS
Decentralization: An extension of symmetry,
the design should favor decentralized peer-topeer techniques over centralized control. In the
past, centralized control has resulted in outages
and the goal is to avoid it as much as possible.
This leads to a simpler, more scalable, and more
available system.
 Heterogeneity: The system needs to be able to
exploit heterogeneity in the infrastructure it runs
on. e.g. the work distribution must be
proportional to the capabilities of the individual
servers. This is essential in adding new nodes
with higher capacity without having to upgrade
all hosts at once.

11
SYSTEM ARCHITECTURE


The Dynamo data storage system contains items
that are associated with a single key
Operations that are implemented: get( ) and put(
)
get(key): locates object with key and returns object or
list of objects with a context
 put(key, context, object): places an object at a replica
along with the key and context
 Context: metadata about object

12
PARTITIONING
Provides mechanism to dynamically partition the
data over the set of nodes
 Use consistent hashing
 Similar to Chord

Each node gets an ID from the space of keys
 Nodes are arranged in a ring
 Data stored on the first node clockwise of the current
placement of the data key

13
VIRTUAL NODE


(single node) -> multiple points in the ring i.e.
virtual nodes
Advantages of virtual nodes:
Graceful handling of failure of a node
 Easy accommodation of a new node
 Heterogeneity in physical infrastructure can be
exploited

14
REPLICATION
Each data item replicated at N hosts
 N is configured per-instance
 Each node is responsible for the region of the ring
between it and its Nth predecessor
 Preference list: List of nodes responsible for
storing a particular key

15
VERSIONING

Multiple versions of an object can be present in
the system at same time

Vector clock is used for version control

Vector clock size issue
16
EXECUTION OF GET() AND PUT() OPERATIONS


Operations can originate at any node in the
system
Coordinator:


node handing read or write operation
The coordinator contacts R nodes for reading and
W nodes for writing, where R + W > N
17
HANDLING FAILURES

Temporary failures: Hinted Handoff


Mechanism to ensure that the read and write
operations are not failed due to temporary node or
network failures.
Handling Permanent Failures: Replica
Synchronization
Synchronize with another node
 Use Merkle Trees

18
MEMBERSHIP AND FAILURE DETECTION



Explicit mechanism available to initiate the
addition and removal of nodes from a Dynamo
ring
To prevent logical partitions, some Dynamo
nodes play the role of seed nodes
Gossip-based distributed failure detection and
membership protocol
19
IMPLEMENTATION
Storage
Node
Membership & Failure
Detection
Local Persistence
Engine
• Built on top of eventdriven messaging
substrate
• Each state machine
instance handles exactly
one client request
• Coordinator executes
client read & write
requests
• State machine contains
entire process and
failure handling logic
Pluggable Storage
Engines
• Berkeley Database
(BDB) Transactional Data
Store
• BDB Java Edition
• MySQL
•In-memory buffer with
persistent backing store
•Chosen based on
application’s object size
distribution
Request Coordination
• State machines created
on nodes serving requests
2
0
EXPERIENCES, RESULTS & LESSONS LEARNT

Main Dynamo Usage Patterns
1.
Business logic specific reconciliation

2.
Timestamp based reconciliation

3.
E.g. Maintaining customer’s session information
High performance read engine


E.g. Merging different versions of a customer’s shopping cart
E.g. Maintaining product catalog and promotional items
Client applications can tune parameters to achieve specific
objectives:
N: Performance {no. of hosts a data item is replicated at}
R: Availability {min. no. of participating nodes in a successful
read opr}
 W: Durability {min. no. of participating nodes in a successful
write opr}
 Commonly used configuration (N,R,W) = (3,2,2)


21
EXPERIENCES, RESULTS & LESSONS LEARNT

Balancing Performance and Durability
Average & 99.9th
percentile latencies of
Dynamo’s read and write
operations during a
period of 30 days
Comparison of performance
of 99.9th percentile
latencies for buffered vs.
non-buffered writes over 24
hours
22
CONCLUSION
Dynamo:



Is a highly available and scalable data store
Is used for storing state of a number of core services of
Amazon.com’s e-commerce platform
Has provided desired levels of availability and performance and
has been successful in handling:
Server failures
 Data center failures
 Network partitions


Is incrementally scalable

Sacrifices consistency under certain failure scenarios

Extensively uses object versioning

Combination of decentralized techniques can be combined to
provide a single highly-available system.
23
thanks
24