Transcript 2011SS-03
Cassandra – A Decentralized
Structured Storage System
A. Lakshaman1, P.Malik1
1Facebook
SIGOPS ‘10
2011. 03. 18.
Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University
The Rise of NoSQL
Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earli
er 2009 when Johan Oskarsson of Last.fm wanted to organize an eve
nt to discuss open-source distributed databases.
The name attempted to label the emergence of growing distributed d
ata stores that often did not attempt to provide ACID guarantees
Refer to http://www.google.com/trends?q=no
sql
Copyright 2010 by CEBT
2
NoSQL Database
Based on Key-value
memchached, Dynamo, Volemort, Tokyo Cabinet
Based on Column
Google BigTable, Cloudata, Hbase, Hypertable, Cassandra
Based on Document
MongoDB, CouchDB
Based on Graph
Meo4j, FlockDB, InfiniteGraph
Copyright 2010 by CEBT
3
NoSQL BigData Database
Based on Key-Value
memchached, Dynamo, Volemort, Tokyo Cabinet
Based on Column
Google BigTable, Cloudata, Hbase, Hypertable, Cassandra
Based onDocument
MongoDB, CouchDB
Based on Graph
Meo4j, FlockDB, InfiniteGraph
Copyright 2010 by CEBT
4
Copyright 2010 by CEBT
5
Refer to http://blog.nahurst.com/visual-guide-to-nosql-sy
Contents
Introduction
Operations
Remind: Dynamo
WRITE
Cassandra
READ
Consistency level
Data Model
System Architecture
Performance Benchmark
Partitioning
Case Study
Replication
Conclusion
Membership
Bootstrapping
Copyright 2010 by CEBT
6
Remind: Dynamo
Distributed Hash Table
BASE
Basically Available
Soft-state
Eventually Consistent
Client Tunable consistency/availability
NRW Configuration
W=N, R=1
Read optimized strong consistency
W=1, R=N
Write optimized strong consistency
W+R ≦ N
Weak eventual consistency
W+R > N
Strong consistency
Copyright 2010 by CEBT
7
Cassandra
Dynamo-Bigtable lovechild
Column-based data model
Distributed Hash Table
Tunable tradeoff
–
Consistency vs. Latency
Properties
No single point of Failure
Linearly scalable
Flexible partitioning, replica placement
High Availability (eventually consistency)
Copyright 2010 by CEBT
8
Data Model
Cluster
Key Space is corresponding to db or table space
Column Family is corresponding to table
Column is unit of data stored in Cassandra
Row Key
Column Family: “User”
“userid1”
name: Username, value: uname1
name: Email, value: [email protected]
name: Tel, value: 123-4567
“userid2”
name: Username, value: uname2
name: Email, value: [email protected]
name: Tel, value: 123-4568
“userid3”
name: Username, value: uname3
name: Email, value: [email protected]
name: Tel, value: 123-4569
Copyright 2010 by CEBT
Column Family: “Article”
name: ArticleId, value:userid2-1
name: ArticleId, value:userid2-2
name: ArticleId, value:userid2-3
9
System Architecture
Partitioning
Replication
Membership
Bootstraping
Copyright 2010 by CEBT
10
Partitioning Algorithm
Distributed Hash Table
Data and Server are located in the same address space
Consistent Hashing
Key Space Partition: arrangement of the key
Overlay Networking: Routing Mechanism
Hash(key1)
N1
value
high
N3
N2
hash(key1)
N3
N2
N1
low
N2 is deemed the coordinator of key 1
Copyright 2010 by CEBT
11
Partitioning Algorithm (cont’d)
Challenges
Non-uniform data and load distribution
Oblivious to the heterogenity in the performance of nodes
Solutions
Nodes get assigned to multiple positions in the circle (like Dynamo)
Analyze load information on the ring and have lightly loads move on the
ring to alleviate heavily loaded nodes (like Cassandra)
N3
N1
N1
N2
N2
N3
N1
N3
N2
N1
N2
N3
N2
N2
Copyright 2010 by CEBT
12
Replication
RackUnware
Coordinator
of data 1
A
J
B
data1
C
I
D
H
E
G
F
RackAware
DataCenterShared
Copyright 2010 by CEBT
13
Cluster Membership
Gossip Protocol is used for cluster membership
Super lightweight with mathematically provable properties
State disseminated in O(logN) rounds
Every T Seconds each member increments its heartbeat counte
r and selects one other member send its list to
A member merges the list with its own list
Copyright 2010 by CEBT
14
Gossip Protocol
t1
t2
t3
t4
server 1
server 1
server 1
server 1
server1: t1
server1: t1
server1: t1
server2: t2
server1: t4
server2: t2
server 2
server 2
server 2
server2: t2
server2: t2
server1: t4
server2: t2
t6
server 1
server1: t6
server2: t2
server3 :t5
server 2
server1: t6
server2: t6
server3: t5
t5
server 3
server1: t6
server2: t6
server3: t5
server 1
server1: t4
server2: t2
server3 :t5
server 2
server 3
server3: t5
server1: t4
server2: t2
15
Copyright 2010 by CEBT
Accrual Failure Detector
Valuable for system management, replication, load balancing
Designed to adapt to changing network conditions
The value output, PHI, represents a suspicion level
Applications set an appropriate threshold, trigger suspicions an
d perform appropriate actions
In Cassandra the average time taken to detect a failure is 10-1
5 seconds with the PHI threshold set at 5
F(t) = -log10 (P(tnow - tlast ))
Copyright 2010 by CEBT
where
P(t) = (1- e- lt )
16
Bootstraping
New node gets assigned a token such that it can alleviate a he
avily loaded node
N1
N1
N2
N3
Copyright 2010 by CEBT
N2
17
WRITE
Interface
Simple: put(key,col,value)
Complex: put(key,[col:val,…,col:val])
Batch
WRITE Opertation
Commit log for durability
–
Configurable fsync
–
Sequential writes only
MemTable
–
Nodisk access (no reads and seek)
Sstables are final
–
Read-only
–
indexes
Always Writable
Copyright 2010 by CEBT
18
READ
Interface
get(key,column)
get_slice(key,SlicePredicate)
Get_range_sllices(keyRange,
SlicePredicate)
READ
Practically lock-free
Sstable proliferation
Row cache
Key cache
Copyright 2010 by CEBT
19
Consistency Level
Tuning the consistency level for each WRITE/READ operation
Level
Description
Level
Description
ZERO
Hail Mary
ZERO
N/A
ANY
1 replica
ANY
N/A
ONE
1 replica
ONE
1 replica
QUORUM
(N/2)+1
QUORUM
(N/2)+1
ALL
All replica
ALL
All replica
Write Operation
Read Operation
Copyright 2010 by CEBT
20
Performance Benchmark
Random and Sequential Writes
Limited by bandwidth
Facebook Inbox Search
Two kinds of Search
–
Term Search
–
Interactions
50+TB on 150 node cluster
Latency Stat
Search Interactions
Term Search
Min
7.69ms
7.78ms
Median
15.69ms
18.27ms
Max
26.13ms
44.41ms
Copyright 2010 by CEBT
21
vs MySQL with 50GB Data
MySQL
~300ms write
~350ms read
Cassandra
~0.12ms write
~15ms read
Copyright 2010 by CEBT
22
Case Study
Cassandra as primary data store
Datacenter and rack-aware replication
~1,000,000 ops/s
high sharding and low replication
Inbox Search
100TB
5,000,000,000 writes per day
Copyright 2010 by CEBT
23
Conclusions
Cassandra
Scalability
High Performance
Wide Applicability
Future works
Compression
Atomicity
Secondary Index
Copyright 2010 by CEBT
24