NoSQL Oakland

Download Report

Transcript NoSQL Oakland

Project Voldemort: What’s New
Alex Feinberg
The plan
 Introduction
 Motivation
 Inspiration
 Implementation
 Present day
 New features within the last months
 New features in active development
 The roadmap
 Wanted Features
 Q&A
Introduction
 Project Voldemort: a scalable, highly available,
distributed, key/value store
 Data Platform team at LinkedIn
– Data driven features
– The infrastructure to run them
 Original work by Jay Kreps, Bhupesh Bansal
 The presenter: just hired a month ago to work full time on
Voldemort
Data Driven Features…
Motivation
 Data driven features are data intensive in terms of reads,
writes and the size of the datasets
 Scaling a relational database: if data can’t be federated,
RDBMS becomes a de-facto K/V store
 SQL
– Relational algebra is a powerful tool, but not a universal solution
– Passing strings around is cumbersome, ORMs can be leaky abstractions
“The Exploits of a Mom” © XKCD
Non-relational Alternatives
 Memcached is an excellent in-memory key/value cache
– Used extensively by high traffic websites, including LinkedIn
– High throughput, low latency
– Excellent scalability
 Hadoop
– Used extensively by the Data Platform team
– High average throughput, but high latency
– Excellent scalability
 Wanted
–
–
–
–
Persistence and replication
Low latency
No single points of failure
Scalable: accommodate more data by adding more machines
Inspiration
 Amazon’s Dynamo
 SOSP paper late 2007
 Key-value store
 Consistent hashing, vector clocks
 Gossip protocol
 Hinted handoff, Merkle Trees
Consistent Hashing
 A key belongs to a partition
 A node can hold multiple partitions
 There is a tunable replication factor
(N)
 If N is 3, a key mapped to partition
P is written to P-1, P and P+1
Vector Clocks
 From Leslie Lamport (also author of LaTeX)
 Want to determine the order of writes
 Total order demands strong consistency
– Partial ordering: determine “x came before y” relation in most cases
 Associate a vector clock with a value
–
–
–
–
Versioned value is a (value, vector clock) tuple
Multiple versioned values can exist for a key
We can use a vector clock to determine causality
If two versioned values aren’t causally related, allow application to
reconcile
– Shopping cart example
Vector Clocks: Initial State
Vector Clocks: Event Occurs
Vector Clocks: Multi-cast the Vector Clock
Vector Clocks: Node Becomes Partitioned
Vector Clocks: Causality Determined
Implementation
 Customization at all layers
– Pluggable serialization (JSON, protocol
buffers, Thrift) allows keys and values to
be structures rather than just strings
 Tunable R, W, N parameters
 Storage engines
– No persistent data structure that is good at
everything
– BDB is most popular
– Read only stores
Present day
 Production use at LinkedIn
–
–
–
–
Multiple clusters
Data Platform usage
Other teams’ usage
Read only stores for data built out in Hadoop
 Production use outside of LinkedIn
– Gilt Group, KaChing, others
 Revision control through git
– Hosted on github
 Active developer community, inside and outside LinkedIn
Recently Added: Read Only Stores
 Motivation
 Offline batch/computing
 Optimize the store for atomic swaps
and rollbacks
 Leverage what Hadoop provides
 Implementation
 Memory mapped files
 Integration with Hadoop
 Driver program to initiate fetch
and swap in parallel
Recently Added: NIO
 Non-blocking IO, why?
– Scalability and the c10k problem
 Java’s NIO framework
– Added in 1.4, greatly improved in 1.5 and 1.6
– Will use native scalable poll implementation
 Tricky to get good performance
 Contributed by Kirk True
NIO Performance and Scalability
Recently Added: Data Compression
 Motivation: smaller data size
–
–
–
–
–
–
Denormalized data leads to big blobs
Less to transfer between client and server
More of the data can be stored in main memory
Less to transfer from disk to memory
Compression/decompression is fast
If we’re I/O bound, less bytes to express the same data implies better
performance
 Implementation
 Usage
Monitoring and Administration
 In place: JMX hooks
– View statistics (how many queries are made? How long are they taking?)
– Perform operations (analogous to SNMP traps)
 Admin Server
– Functionality which is needed, but shouldn’t be performed by regular store
clients
– Ability to update and retrieve cluster/store metadata
– Functionality efficiently stream keys and values in a partition
 Network class loader/server side filtering
On The Roadmap
 Failure detection
 Large value support
 Publish/subscribe
 Rebalancing
On The Roadmap: Rebalancing
 Rebalancing: ability to add a server to a cluster while the
cluster is still running
 Node enters a cluster, “steals” a partition from other
nodes (fetches it as a stream using the admin protocol)
 Pull-based gossip protocol to let other nodes know that
it’s in the cluster
– Metadata about cluster membership treated as data, conflicts reconciled
using vector clocks
 While the new node is transferring the partitions, gets
sent to it are redirected to the donor node(s)
Stability and Infrastructure
 Testing “in the cloud”
 Distributed systems have to be tested on multinode clusters
 Distributed systems have complex failure
scenarios
 A storage system, above all, must be stable
 Automated testing allows rapid iteration while
maintaining confidence in systems’ correctness
and stability
 EC2-based testing framework




Tests are invoked programmatically
Contributed by Kirk True
Adaptable to other cloud hosting providers
Will run on a regular basis
 Regular releases for new features
and bug fixes
 Trunk stays stable
Wanted Features
 Clients for other languages
 Outside of the JVM
 Ruby, PHP (popular for web development)
 On the JVM
 JRuby, Scala, Clojure
 Different languages have different idioms
 Java’s idiom is objects with mutable state
 Views
 Inspired by CouchDB
 Want to change a value for a key without transfering that value back and
forth
 Example: adding to a list, incrementing a counter
 Less collisions/conflicts
Contributions are Welcome
 Thriving open source community
– Fork us on Github: http://github.com/voldemort/voldemort
– Wiki: http://wiki.github.com/voldemort/voldemort
 Fun projects: http://wiki.github.com/voldemort/voldemort/fun-projects
– IRC channel: #Voldemort on Freenode (irc.freenode.org)
 Want to work on this full time? LinkedIn is hiring!
 Just in the Data Platform group
 Other technologies: Scala, Hadoop, ZooKeeper, Lucene, Netty
 Projects: real time faceted search, distributed graph databases, machine
learning, data mining, information retrieval / extraction, NLP
 Open source projects: Zoie, Bobo, Sensei-search, decomposer, kamikaze (three
more on the way!)
 More elsewhere!
 Contact me
 http://www.linkedin.com/in/alexfeinberg
 [email protected]
Questions?
 Questions?