NoSQL Oakland
Download
Report
Transcript NoSQL Oakland
Project Voldemort: What’s New
Alex Feinberg
The plan
Introduction
Motivation
Inspiration
Implementation
Present day
New features within the last months
New features in active development
The roadmap
Wanted Features
Q&A
Introduction
Project Voldemort: a scalable, highly available,
distributed, key/value store
Data Platform team at LinkedIn
– Data driven features
– The infrastructure to run them
Original work by Jay Kreps, Bhupesh Bansal
The presenter: just hired a month ago to work full time on
Voldemort
Data Driven Features…
Motivation
Data driven features are data intensive in terms of reads,
writes and the size of the datasets
Scaling a relational database: if data can’t be federated,
RDBMS becomes a de-facto K/V store
SQL
– Relational algebra is a powerful tool, but not a universal solution
– Passing strings around is cumbersome, ORMs can be leaky abstractions
“The Exploits of a Mom” © XKCD
Non-relational Alternatives
Memcached is an excellent in-memory key/value cache
– Used extensively by high traffic websites, including LinkedIn
– High throughput, low latency
– Excellent scalability
Hadoop
– Used extensively by the Data Platform team
– High average throughput, but high latency
– Excellent scalability
Wanted
–
–
–
–
Persistence and replication
Low latency
No single points of failure
Scalable: accommodate more data by adding more machines
Inspiration
Amazon’s Dynamo
SOSP paper late 2007
Key-value store
Consistent hashing, vector clocks
Gossip protocol
Hinted handoff, Merkle Trees
Consistent Hashing
A key belongs to a partition
A node can hold multiple partitions
There is a tunable replication factor
(N)
If N is 3, a key mapped to partition
P is written to P-1, P and P+1
Vector Clocks
From Leslie Lamport (also author of LaTeX)
Want to determine the order of writes
Total order demands strong consistency
– Partial ordering: determine “x came before y” relation in most cases
Associate a vector clock with a value
–
–
–
–
Versioned value is a (value, vector clock) tuple
Multiple versioned values can exist for a key
We can use a vector clock to determine causality
If two versioned values aren’t causally related, allow application to
reconcile
– Shopping cart example
Vector Clocks: Initial State
Vector Clocks: Event Occurs
Vector Clocks: Multi-cast the Vector Clock
Vector Clocks: Node Becomes Partitioned
Vector Clocks: Causality Determined
Implementation
Customization at all layers
– Pluggable serialization (JSON, protocol
buffers, Thrift) allows keys and values to
be structures rather than just strings
Tunable R, W, N parameters
Storage engines
– No persistent data structure that is good at
everything
– BDB is most popular
– Read only stores
Present day
Production use at LinkedIn
–
–
–
–
Multiple clusters
Data Platform usage
Other teams’ usage
Read only stores for data built out in Hadoop
Production use outside of LinkedIn
– Gilt Group, KaChing, others
Revision control through git
– Hosted on github
Active developer community, inside and outside LinkedIn
Recently Added: Read Only Stores
Motivation
Offline batch/computing
Optimize the store for atomic swaps
and rollbacks
Leverage what Hadoop provides
Implementation
Memory mapped files
Integration with Hadoop
Driver program to initiate fetch
and swap in parallel
Recently Added: NIO
Non-blocking IO, why?
– Scalability and the c10k problem
Java’s NIO framework
– Added in 1.4, greatly improved in 1.5 and 1.6
– Will use native scalable poll implementation
Tricky to get good performance
Contributed by Kirk True
NIO Performance and Scalability
Recently Added: Data Compression
Motivation: smaller data size
–
–
–
–
–
–
Denormalized data leads to big blobs
Less to transfer between client and server
More of the data can be stored in main memory
Less to transfer from disk to memory
Compression/decompression is fast
If we’re I/O bound, less bytes to express the same data implies better
performance
Implementation
Usage
Monitoring and Administration
In place: JMX hooks
– View statistics (how many queries are made? How long are they taking?)
– Perform operations (analogous to SNMP traps)
Admin Server
– Functionality which is needed, but shouldn’t be performed by regular store
clients
– Ability to update and retrieve cluster/store metadata
– Functionality efficiently stream keys and values in a partition
Network class loader/server side filtering
On The Roadmap
Failure detection
Large value support
Publish/subscribe
Rebalancing
On The Roadmap: Rebalancing
Rebalancing: ability to add a server to a cluster while the
cluster is still running
Node enters a cluster, “steals” a partition from other
nodes (fetches it as a stream using the admin protocol)
Pull-based gossip protocol to let other nodes know that
it’s in the cluster
– Metadata about cluster membership treated as data, conflicts reconciled
using vector clocks
While the new node is transferring the partitions, gets
sent to it are redirected to the donor node(s)
Stability and Infrastructure
Testing “in the cloud”
Distributed systems have to be tested on multinode clusters
Distributed systems have complex failure
scenarios
A storage system, above all, must be stable
Automated testing allows rapid iteration while
maintaining confidence in systems’ correctness
and stability
EC2-based testing framework
Tests are invoked programmatically
Contributed by Kirk True
Adaptable to other cloud hosting providers
Will run on a regular basis
Regular releases for new features
and bug fixes
Trunk stays stable
Wanted Features
Clients for other languages
Outside of the JVM
Ruby, PHP (popular for web development)
On the JVM
JRuby, Scala, Clojure
Different languages have different idioms
Java’s idiom is objects with mutable state
Views
Inspired by CouchDB
Want to change a value for a key without transfering that value back and
forth
Example: adding to a list, incrementing a counter
Less collisions/conflicts
Contributions are Welcome
Thriving open source community
– Fork us on Github: http://github.com/voldemort/voldemort
– Wiki: http://wiki.github.com/voldemort/voldemort
Fun projects: http://wiki.github.com/voldemort/voldemort/fun-projects
– IRC channel: #Voldemort on Freenode (irc.freenode.org)
Want to work on this full time? LinkedIn is hiring!
Just in the Data Platform group
Other technologies: Scala, Hadoop, ZooKeeper, Lucene, Netty
Projects: real time faceted search, distributed graph databases, machine
learning, data mining, information retrieval / extraction, NLP
Open source projects: Zoie, Bobo, Sensei-search, decomposer, kamikaze (three
more on the way!)
More elsewhere!
Contact me
http://www.linkedin.com/in/alexfeinberg
[email protected]
Questions?
Questions?