.NET and NoSQL Introducing Cassandra

Download Report

Transcript .NET and NoSQL Introducing Cassandra

.NET and NoSQL
Introducing Cassandra
{
John Zablocki
Development Manager, HealtcareSource
Organizer, Beantown ALT.NET
Beantown ALT.NET
2011-10-26



New England Code Camp – 10/29/2011
WP7 Location @ Dev Boston Meetup –
11/3/2011
DDD w/ Steve Bohlen @ Beantown ALT.NET –
11/28/2011
Shameless Plugs






NoSQL Overview
Cassandra Basic Concepts
Cassandra Data Model
Client API
Cassandra and .NET
Questions?
Agenda
NoSQL
{
Not Only SQL




Coined in 1998 by Carlos Strozzi to describe a
database that did not expose a SQL interface
In 2008, Eric Evans reintroduced the term to
describe the growing non-RDBMS movement
Broadly refers to a set of data stores that do not
use SQL or a relational data model
Popularized by large web presences such as
Google, Facebook and Amazon
What is NoSQL?
NoSQL Databases

NoSQL databases come in a variety of flavors *







XML (myXMLDB, Tamino, Sedna)
Tabular (Hbase, Big Table)
Key/Value (Redis, Memcached with BerkleyDB)
Object (db4o, JADE)
Graph (Trinity, neo4j, InfoGrid)
Document store (CouchDB, MongoDB)
Eventually Consistent Key/Value Store
(Cassandra, Dynamo)
* loose taxonomies
NoSQL Databases
Why NoSQL?




RDBMS Administrators are highly paid
Highly paid individuals often buy larger than
average homes or cars
Larger than average homes and cars require
more energy than smaller home and cars
Therefore RDMBSs contribute to global
warming more than NoSQL databases which
typically do not require the addition of a DBA
RDBMs and the Environment




RDBMSs often require high end servers and
that are taxing on disks
High end servers consume more electricity than
mid-range servers
Taxed disks fail more often than untaxed disks
Therefore RDBMSs require more energy and
produce more waste (lots of hard drives in
landfills) than NoSQL DBs, which run on midrange servers.
Even More Why NoSQL?




The current healthcare crisis requires talented
software engineers to fix the outdated or nonexistent IT systems of the hospital system
Talented software engineers spend a great deal
of time mapping objects to tables in RDBMSs
Talented software engineers are unable to fix
healthcare because they are too busy mapping
objects to tables
Therefore RDBMSs are causing illnessnes
NoSQL and Healthcare

Three disruptive technologies you should be
paying attention to today…



NoSQL databases and big data technologies especially MongoDB, CouchDB, Cassandra,
Hbase, MapReduce and Hadoop
Evented I/O Web Servers – especially Node.js
and to a lesser extent Tornado
Functional programming languages – especially
Scala, F# and Erlang
Please Pardon the Interruption…
Introducing Cassandra
{


Open source, Apache supported project
Originally written by Facebook for Inbox search feature





Written in Java. Yes, Java.
Column-oriented with row-oriented properties
Schemaless
Data stored in sparse, multidimensional hashtables






FB now uses a proprietary fork
Sparse meaning that rows may have one or more columns
Distributed and Decentralized
Highly Available and Fault Tolerant
Elastic Scalability
Tunable Consistency
MapReduce via Hadoop
About Cassandra

Column-Oriented

Content stored by column, rather than by row:






1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
More efficient when an aggregate needs to be computed
over many rows
More efficient when writing new values for a column to
all rows at once
Better compression is possible, due to the fact that modern
compression schemes make use of the similarity of
adjacent data (column data is uniform)
Less efficient for multi-column reads
Less efficient for multi-column writes
Column-Oriented

Cassandra is meant to run on multiple nodes


Single node is possible, but Cassandra’s benefits
will not be realized
Every node is identical


No Master/Slave
Peer-to-peer protocol keeps data in sync (gossip)
Distributed and Decentralized






Periodic, pairwise interactions
Bounded size information exchange
One agent changes the state of another
Reliable communication is not assumed
Low frequency of interactions to minimize
protocol costs
Some form of randomness in peer selection
Gossip Protocol

Vertical Scaling



Horizontal Scaling (Clustering)



Throw hardware at the problem
More memory, faster CPU, etc.
Add more machines
Possibly partition the data across machines
Elastic Scaling



Horizontal cluster that can scale up and scale down
seamlessly
New nodes can be brought online and begin
serving requests with partial data
New nodes come online without service distruption
Elastic Scalability


Consistency - ensures transactions move a database
from one consistent state to another
Cassandra supports tunable consistency

Strict (sequential) consistency – all nodes see all
writes in the same order


Causal consistency – potentially causally related
operations seen by all nodes in the same order




A read always returns the most recent write
Concurrent writes are not causally related
Timestamps used to determine the cause of events
Weak (eventual) consistency – all updates will
propagate to all nodes, but not immediately
See Eric Brewer’s CAP Theorem
Tunable Consistency

Large-scale distributed systems have three
competing requirements




Consistency – all nodes see the same data at the
same time
Availability – All clients will always be able to read
and write data and all requests will receive a
response of success or failure
Partition Tolerance – The system will continue to
function, even in the face of network segmentation
failures
Theorem states that a distributed system can satisfy
only 2 of these 3 properties at the same time
Brewer’s CAP Theorem

Consistency and Availability
 Two-phase commit for distributed transactions


Consistency and Partition Tolerance



System blocks on a network partition
Pessimistic locking
Node failure hinders availability
Availability and Partition Tolerance



System always returns data, even if inaccurate
Optimistic locking
DNS, web caching
Brewer’s CAP Theorem

Clusters (rings)




Set of nodes that appear as a single server
Single node is still a cluster
Container for keyspaces
Keyspaces


Analogous to a relational database
Has name and set of attributes to define keyspacewide behavior



Replication factor (# of nodes will having row copy)
Replica placement strategy (how rows are copied)
Column Families
Cassandra Data Model

Column Families



Analogous to a relational table
Container for an ordered collection of rows
Columns



Basic data structure in Cassandra
Consists of a name, value and clock (timestamp)
Defined with a key name sorting rule (ascii, integer, etc.)




Value sorting is not possible
Names and values stored as Java byte arrays
May be indexed for queries
Super Columns



A special column with values that are maps of subcolumns
(standard columns)
Single level of nesting only
Subcolumns are not indexed – read a supercolumn and all of
its columns are read as well
Cassandra Data Model


System keyspace stores metadata about the cluster,
similar to the master db in SQL Server
Peer-to-peer distribution model where behavior of
each node is identical (no Master/Slave)



Gossip protocol where gossiper runs every second
on a timer


New node added to cluster without disruption
Accepts requests only after learning topology
Each node has information about the others
Anti-entropy is the replica synchronization
mechanism in Cassandra

Nodes exchange hashes of column family data in
order to determine whether read-repair is needed
Cassandra Architecture



Writes are immediately written to a commit log
and subsequently written to an in-memory
store called the memtable
At a specified threshold objects in the
memtable are flushed to disk to an immutable
structure called a sorted string table (SSTable)
Hinted handoffs allow nodes to receive a write
intended for another node if that other node
goes offline. The hint tells the receiving node
to update the offline node when back online
Cassandra Architecture

Compaction is the operation of merging SSTables






Bloom filters are used to reduce disk access


Keys are merged
Columns are combined
Tombstones are discarded
New index created
Merged data are sorted
Fast nondeterministic algorithms to determine
whether an element is a member of a set
Tombstones are deletion markers on records

All delete commands in Cassandra are soft deletes
Cassandra Architecture
Using Cassandra
{
The Windows Experience






Install the Java 1.6 (or later) SDK
Set environment variable JAVA_HOME set to
the install path of the JDK
Download the binaries from
http://cassandra.apache.org/download/
Unzip to Program Files (x86) or some other
directory, optionally set PATH
Set environment variable
CASSANDRA_HOME to directory above
In command line, navigate to bin under
CASSANDRA_HOME and run cassandra
Installing Cassandra on Windows





Command line interface
Navigate to bin, under CASSANDRA_HOME
and run cassandra-cli
Generally useful for development, but not
meant to be a full-blown client
Allows for basic administration (creating
keyspaces, column management, etc.)
Commands must be terminated with a ;
Cassandra-CLI

Connect to a server


Connect to a server at CLI start


connect localhost/9160;
cassandra-cli localhost/9160
System information commands



show cluster name;
show keyspaces;
show api version;
Cassandra-CLI

Create a keyspace


Switch to keyspace


use BeantownAltNet
Create a column family


create keyspace BeantownAltNet;
create column family movies with
comparator=UTF8Type and
key_validation_class=UTF8Type;
View information about column family

describe keyspace BeantownAltNet;
Cassandra-CLI

See this JIRA issue and then run (v..8):


Add a row of data




assume Movies keys as ascii;
set movies[‘Goodfellas’][‘Genre’] = ‘Drama’;
set movies[‘Goodfellas’][‘Year’] = 1990;
Count the columns
 count movies[‘Goodfellas’];
Get the row and column


get movies[‘Goodfellas’];
get movies[‘Goodfellas’][‘Genre’];

Create an index on Genre


Query by genre


get movies where Genre = ‘Drama’;
Remove a column


update column family movies with
column_metadata=[{column_name:Genre,
index_type:0, index_name:IdxGenre,
validation_type:UTF8Type}]
del movies[‘Goodfellas’][‘Year’];
Remove a row

del movies[‘Goodfellas’];
Cassandra-CLI





Used for Cassandra’s client API
Effectively an RPC serialization mechanism
Software framework for scalable, cross-language
services development
Combines software stack with code generation to
build services
Support for C++, Java, Python, PHP, Erlang, Perl,
Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk
and Ocaml
struct UserProfile {
1: i32 uid,
2: string name,
3: string blurb
}
service UserStorage {
void store(1: UserProfile user),
UserProfile retrieve(1: i32 uid)
}
Thrift







CQL is a DSL similar to SQL meant to abstract better the
details of the server operations from the clients (still
requires Thrift)
Currently, CQL drivers exist only for Java and Python
CREATE KEYSPACE BeantownAltNet with
replication_factor=1;
CREATE COLUMNFAMILY movies (
key VARCHAR PRIMARY KEY,
genre VARCHAR,
year INT);
INSERT INTO movies (key, genre, year) VALUES
(‘Zoolander’, ‘Comedy’, 1996);
SELECT key, genre, year FROM movies;
SELECT key, genre, year FROM movies WHERE
genre=‘Comedy’;
Cassandra Query Language (CQL)


Command line CQL tool that ships with the Python
CQL driver
Windows installation




Grab the precompiled windows Thrift binaries for
Python and copy to site-packages
http://www.dreamcubes.com/b2/softwaredevelopment/20/thrift-with-python-on-win32/
Download cassandra-dbapi2 from
http://code.google.com/a/apacheextras.org/p/cassandra-dbapi2/source/checkout
and run - setup.py install
easy_install pyreadline
Run - python cqlsh localhost 9160
CQLSH




CREATE KEYSPACE foo WITH
strategy_class=‘SimpleStrategy’ AND
strategy_options:replication_factor=1;
CREATE COLUMNFAMILY users (key
VARCHAR PRIMARY KEY, nickname
VARCHAR);
INSERT INTO users (key, nickname) VALUES
(‘jzablocki’, ‘zblock’);
SELECT * FROM users;
CQLSH
.NET and
Cassandra
{
The Client Libraries

Currently, there are three well maintained ,
community sponsored client libraries




Cassandra-Sharp http://code.google.com/p/cassandra-sharp/
Aquiles - http://aquiles.codeplex.com/
FluentCassandra https://github.com/managedfusion/fluentcassandra
No official Apache client
.NET Client Libraries




Configured in App/Web.config
Simple API over most common Thrift calls
Additional support for Cassandra commands
via Execute method and Client class
Support for executing CQL
Cassandra-Sharp
https://bitbucket.org/johnzablocki/codevoyeursamples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStartCassandraSharp
Cassandra-Sharp Demo



Configured in App/Web.config
Simple wrapper over most common Thrift calls
No direct support for executing CQL (though
an internal class does have CQL execution)
Aquiles

https://bitbucket.org/johnzablocki/codevoyeursamples/src/1e2aeb969518/src/PresentationSam
ples/NoSQLAndDotNet/CassandraQuickStartAquiles
Aquiles Demo





Intended to be an idiomatic .NET Cassandra
framework (i.e., more like .NET than Java)
Makes use of .NET 4.0 dynamic feature
Raw Thrift commands are abstracted
No current support for SQL
Developerd by Nick Berardi
FluentCassandra

https://bitbucket.org/johnzablocki/codevoyeursamples/src/1e2aeb969518/src/PresentationSam
ples/NoSQLAndDotNet/CassandraQuickStart
FluentCassandra Demo
Non-Relational
Design
{
Codd is Dead

Materialized View


Valueless Column


Store redundant data for more efficient queries
MovieGenres[‘Drama’][‘Goodfellas’] = null;
MovieGenres[‘Drama’][‘Casino’] = null;
All data necessary to satisfy a query is in the
column. No value needed (see above)
Aggregate Key

Combine values with a delimiter to create a
composite key
ZipCodes[‘Wethersfield:CT’] = ‘06109’;
ZipCodes[‘Cambridge:MA’] = ‘02140’;
Design Patterns








Links
http://dllhell.net – my blog
http://codevoyeur.com – code projects
http://linkedin.com/in/johnzablocki
http://twitter.com/codevoyeur
http://cassandra.apache.org/
http://bitbucket.org/johnzablocki/codevoyeursamples - code from this presentation
http://shop.oreilly.com/product/0636920010852.do
- O’Reilly’s Cassandra - The Definitive Guide
http://about.me/johnzablocki
Questions?