Transcript 2011SS-03

Cassandra – A Decentralized
Structured Storage System
A. Lakshaman1, P.Malik1
1Facebook
SIGOPS ‘10
2011. 03. 18.
Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University
The Rise of NoSQL

Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earli
er 2009 when Johan Oskarsson of Last.fm wanted to organize an eve
nt to discuss open-source distributed databases.

The name attempted to label the emergence of growing distributed d
ata stores that often did not attempt to provide ACID guarantees
Refer to http://www.google.com/trends?q=no
sql
Copyright  2010 by CEBT
2
NoSQL Database
 Based on Key-value

memchached, Dynamo, Volemort, Tokyo Cabinet
 Based on Column

Google BigTable, Cloudata, Hbase, Hypertable, Cassandra
 Based on Document

MongoDB, CouchDB
 Based on Graph

Meo4j, FlockDB, InfiniteGraph
Copyright  2010 by CEBT
3
NoSQL BigData Database
 Based on Key-Value

memchached, Dynamo, Volemort, Tokyo Cabinet
 Based on Column

Google BigTable, Cloudata, Hbase, Hypertable, Cassandra
 Based onDocument

MongoDB, CouchDB
 Based on Graph

Meo4j, FlockDB, InfiniteGraph
Copyright  2010 by CEBT
4
Copyright  2010 by CEBT
5
Refer to http://blog.nahurst.com/visual-guide-to-nosql-sy
Contents
 Introduction
 Operations

Remind: Dynamo

WRITE

Cassandra

READ

Consistency level
 Data Model
 System Architecture
 Performance Benchmark

Partitioning
 Case Study

Replication
 Conclusion

Membership

Bootstrapping
Copyright  2010 by CEBT
6
Remind: Dynamo
 Distributed Hash Table
 BASE

Basically Available

Soft-state

Eventually Consistent
 Client Tunable consistency/availability
NRW Configuration
W=N, R=1
Read optimized strong consistency
W=1, R=N
Write optimized strong consistency
W+R ≦ N
Weak eventual consistency
W+R > N
Strong consistency
Copyright  2010 by CEBT
7
Cassandra
 Dynamo-Bigtable lovechild

Column-based data model

Distributed Hash Table

Tunable tradeoff
–
Consistency vs. Latency
 Properties

No single point of Failure

Linearly scalable

Flexible partitioning, replica placement

High Availability (eventually consistency)
Copyright  2010 by CEBT
8
Data Model
 Cluster
 Key Space is corresponding to db or table space
 Column Family is corresponding to table
 Column is unit of data stored in Cassandra
Row Key
Column Family: “User”
“userid1”
name: Username, value: uname1
name: Email, value: [email protected]
name: Tel, value: 123-4567
“userid2”
name: Username, value: uname2
name: Email, value: [email protected]
name: Tel, value: 123-4568
“userid3”
name: Username, value: uname3
name: Email, value: [email protected]
name: Tel, value: 123-4569
Copyright  2010 by CEBT
Column Family: “Article”
name: ArticleId, value:userid2-1
name: ArticleId, value:userid2-2
name: ArticleId, value:userid2-3
9
System Architecture
 Partitioning
 Replication
 Membership
 Bootstraping
Copyright  2010 by CEBT
10
Partitioning Algorithm
 Distributed Hash Table

Data and Server are located in the same address space

Consistent Hashing

Key Space Partition: arrangement of the key

Overlay Networking: Routing Mechanism
Hash(key1)
N1
value
high
N3
N2
hash(key1)
N3
N2
N1
low
N2 is deemed the coordinator of key 1
Copyright  2010 by CEBT
11
Partitioning Algorithm (cont’d)


Challenges

Non-uniform data and load distribution

Oblivious to the heterogenity in the performance of nodes
Solutions

Nodes get assigned to multiple positions in the circle (like Dynamo)

Analyze load information on the ring and have lightly loads move on the
ring to alleviate heavily loaded nodes (like Cassandra)
N3
N1
N1
N2
N2
N3
N1
N3
N2
N1
N2
N3
N2
N2
Copyright  2010 by CEBT
12
Replication
 RackUnware
Coordinator
of data 1
A
J
B
data1
C
I
D
H
E
G
F
 RackAware
 DataCenterShared
Copyright  2010 by CEBT
13
Cluster Membership
 Gossip Protocol is used for cluster membership
 Super lightweight with mathematically provable properties
 State disseminated in O(logN) rounds
 Every T Seconds each member increments its heartbeat counte
r and selects one other member send its list to
 A member merges the list with its own list
Copyright  2010 by CEBT
14
Gossip Protocol
t1
t2
t3
t4
server 1
server 1
server 1
server 1
server1: t1
server1: t1
server1: t1
server2: t2
server1: t4
server2: t2
server 2
server 2
server 2
server2: t2
server2: t2
server1: t4
server2: t2
t6
server 1
server1: t6
server2: t2
server3 :t5
server 2
server1: t6
server2: t6
server3: t5
t5
server 3
server1: t6
server2: t6
server3: t5
server 1
server1: t4
server2: t2
server3 :t5
server 2
server 3
server3: t5
server1: t4
server2: t2
15
Copyright  2010 by CEBT
Accrual Failure Detector
 Valuable for system management, replication, load balancing
 Designed to adapt to changing network conditions
 The value output, PHI, represents a suspicion level
 Applications set an appropriate threshold, trigger suspicions an
d perform appropriate actions
 In Cassandra the average time taken to detect a failure is 10-1
5 seconds with the PHI threshold set at 5
F(t) = -log10 (P(tnow - tlast ))
Copyright  2010 by CEBT
where
P(t) = (1- e- lt )
16
Bootstraping
 New node gets assigned a token such that it can alleviate a he
avily loaded node
N1
N1
N2
N3
Copyright  2010 by CEBT
N2
17
WRITE


Interface

Simple: put(key,col,value)

Complex: put(key,[col:val,…,col:val])

Batch
WRITE Opertation


Commit log for durability
–
Configurable fsync
–
Sequential writes only
MemTable
–


Nodisk access (no reads and seek)
Sstables are final
–
Read-only
–
indexes
Always Writable
Copyright  2010 by CEBT
18
READ
 Interface

get(key,column)

get_slice(key,SlicePredicate)

Get_range_sllices(keyRange,
SlicePredicate)
 READ

Practically lock-free

Sstable proliferation

Row cache

Key cache
Copyright  2010 by CEBT
19
Consistency Level
 Tuning the consistency level for each WRITE/READ operation
Level
Description
Level
Description
ZERO
Hail Mary
ZERO
N/A
ANY
1 replica
ANY
N/A
ONE
1 replica
ONE
1 replica
QUORUM
(N/2)+1
QUORUM
(N/2)+1
ALL
All replica
ALL
All replica
Write Operation
Read Operation
Copyright  2010 by CEBT
20
Performance Benchmark
 Random and Sequential Writes

Limited by bandwidth
 Facebook Inbox Search


Two kinds of Search
–
Term Search
–
Interactions
50+TB on 150 node cluster
Latency Stat
Search Interactions
Term Search
Min
7.69ms
7.78ms
Median
15.69ms
18.27ms
Max
26.13ms
44.41ms
Copyright  2010 by CEBT
21
vs MySQL with 50GB Data
 MySQL

~300ms write

~350ms read
 Cassandra

~0.12ms write

~15ms read
Copyright  2010 by CEBT
22
Case Study
 Cassandra as primary data store
 Datacenter and rack-aware replication
 ~1,000,000 ops/s
 high sharding and low replication
 Inbox Search

100TB

5,000,000,000 writes per day
Copyright  2010 by CEBT
23
Conclusions
 Cassandra

Scalability

High Performance

Wide Applicability
 Future works

Compression

Atomicity

Secondary Index
Copyright  2010 by CEBT
24