.NET and NoSQL Introducing Cassandra
Download
Report
Transcript .NET and NoSQL Introducing Cassandra
.NET and NoSQL
Introducing Cassandra
{
John Zablocki
Development Manager, HealtcareSource
Organizer, Beantown ALT.NET
Beantown ALT.NET
2011-10-26
New England Code Camp – 10/29/2011
WP7 Location @ Dev Boston Meetup –
11/3/2011
DDD w/ Steve Bohlen @ Beantown ALT.NET –
11/28/2011
Shameless Plugs
NoSQL Overview
Cassandra Basic Concepts
Cassandra Data Model
Client API
Cassandra and .NET
Questions?
Agenda
NoSQL
{
Not Only SQL
Coined in 1998 by Carlos Strozzi to describe a
database that did not expose a SQL interface
In 2008, Eric Evans reintroduced the term to
describe the growing non-RDBMS movement
Broadly refers to a set of data stores that do not
use SQL or a relational data model
Popularized by large web presences such as
Google, Facebook and Amazon
What is NoSQL?
NoSQL Databases
NoSQL databases come in a variety of flavors *
XML (myXMLDB, Tamino, Sedna)
Tabular (Hbase, Big Table)
Key/Value (Redis, Memcached with BerkleyDB)
Object (db4o, JADE)
Graph (Trinity, neo4j, InfoGrid)
Document store (CouchDB, MongoDB)
Eventually Consistent Key/Value Store
(Cassandra, Dynamo)
* loose taxonomies
NoSQL Databases
Why NoSQL?
RDBMS Administrators are highly paid
Highly paid individuals often buy larger than
average homes or cars
Larger than average homes and cars require
more energy than smaller home and cars
Therefore RDMBSs contribute to global
warming more than NoSQL databases which
typically do not require the addition of a DBA
RDBMs and the Environment
RDBMSs often require high end servers and
that are taxing on disks
High end servers consume more electricity than
mid-range servers
Taxed disks fail more often than untaxed disks
Therefore RDBMSs require more energy and
produce more waste (lots of hard drives in
landfills) than NoSQL DBs, which run on midrange servers.
Even More Why NoSQL?
The current healthcare crisis requires talented
software engineers to fix the outdated or nonexistent IT systems of the hospital system
Talented software engineers spend a great deal
of time mapping objects to tables in RDBMSs
Talented software engineers are unable to fix
healthcare because they are too busy mapping
objects to tables
Therefore RDBMSs are causing illnessnes
NoSQL and Healthcare
Three disruptive technologies you should be
paying attention to today…
NoSQL databases and big data technologies especially MongoDB, CouchDB, Cassandra,
Hbase, MapReduce and Hadoop
Evented I/O Web Servers – especially Node.js
and to a lesser extent Tornado
Functional programming languages – especially
Scala, F# and Erlang
Please Pardon the Interruption…
Introducing Cassandra
{
Open source, Apache supported project
Originally written by Facebook for Inbox search feature
Written in Java. Yes, Java.
Column-oriented with row-oriented properties
Schemaless
Data stored in sparse, multidimensional hashtables
FB now uses a proprietary fork
Sparse meaning that rows may have one or more columns
Distributed and Decentralized
Highly Available and Fault Tolerant
Elastic Scalability
Tunable Consistency
MapReduce via Hadoop
About Cassandra
Column-Oriented
Content stored by column, rather than by row:
1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
More efficient when an aggregate needs to be computed
over many rows
More efficient when writing new values for a column to
all rows at once
Better compression is possible, due to the fact that modern
compression schemes make use of the similarity of
adjacent data (column data is uniform)
Less efficient for multi-column reads
Less efficient for multi-column writes
Column-Oriented
Cassandra is meant to run on multiple nodes
Single node is possible, but Cassandra’s benefits
will not be realized
Every node is identical
No Master/Slave
Peer-to-peer protocol keeps data in sync (gossip)
Distributed and Decentralized
Periodic, pairwise interactions
Bounded size information exchange
One agent changes the state of another
Reliable communication is not assumed
Low frequency of interactions to minimize
protocol costs
Some form of randomness in peer selection
Gossip Protocol
Vertical Scaling
Horizontal Scaling (Clustering)
Throw hardware at the problem
More memory, faster CPU, etc.
Add more machines
Possibly partition the data across machines
Elastic Scaling
Horizontal cluster that can scale up and scale down
seamlessly
New nodes can be brought online and begin
serving requests with partial data
New nodes come online without service distruption
Elastic Scalability
Consistency - ensures transactions move a database
from one consistent state to another
Cassandra supports tunable consistency
Strict (sequential) consistency – all nodes see all
writes in the same order
Causal consistency – potentially causally related
operations seen by all nodes in the same order
A read always returns the most recent write
Concurrent writes are not causally related
Timestamps used to determine the cause of events
Weak (eventual) consistency – all updates will
propagate to all nodes, but not immediately
See Eric Brewer’s CAP Theorem
Tunable Consistency
Large-scale distributed systems have three
competing requirements
Consistency – all nodes see the same data at the
same time
Availability – All clients will always be able to read
and write data and all requests will receive a
response of success or failure
Partition Tolerance – The system will continue to
function, even in the face of network segmentation
failures
Theorem states that a distributed system can satisfy
only 2 of these 3 properties at the same time
Brewer’s CAP Theorem
Consistency and Availability
Two-phase commit for distributed transactions
Consistency and Partition Tolerance
System blocks on a network partition
Pessimistic locking
Node failure hinders availability
Availability and Partition Tolerance
System always returns data, even if inaccurate
Optimistic locking
DNS, web caching
Brewer’s CAP Theorem
Clusters (rings)
Set of nodes that appear as a single server
Single node is still a cluster
Container for keyspaces
Keyspaces
Analogous to a relational database
Has name and set of attributes to define keyspacewide behavior
Replication factor (# of nodes will having row copy)
Replica placement strategy (how rows are copied)
Column Families
Cassandra Data Model
Column Families
Analogous to a relational table
Container for an ordered collection of rows
Columns
Basic data structure in Cassandra
Consists of a name, value and clock (timestamp)
Defined with a key name sorting rule (ascii, integer, etc.)
Value sorting is not possible
Names and values stored as Java byte arrays
May be indexed for queries
Super Columns
A special column with values that are maps of subcolumns
(standard columns)
Single level of nesting only
Subcolumns are not indexed – read a supercolumn and all of
its columns are read as well
Cassandra Data Model
System keyspace stores metadata about the cluster,
similar to the master db in SQL Server
Peer-to-peer distribution model where behavior of
each node is identical (no Master/Slave)
Gossip protocol where gossiper runs every second
on a timer
New node added to cluster without disruption
Accepts requests only after learning topology
Each node has information about the others
Anti-entropy is the replica synchronization
mechanism in Cassandra
Nodes exchange hashes of column family data in
order to determine whether read-repair is needed
Cassandra Architecture
Writes are immediately written to a commit log
and subsequently written to an in-memory
store called the memtable
At a specified threshold objects in the
memtable are flushed to disk to an immutable
structure called a sorted string table (SSTable)
Hinted handoffs allow nodes to receive a write
intended for another node if that other node
goes offline. The hint tells the receiving node
to update the offline node when back online
Cassandra Architecture
Compaction is the operation of merging SSTables
Bloom filters are used to reduce disk access
Keys are merged
Columns are combined
Tombstones are discarded
New index created
Merged data are sorted
Fast nondeterministic algorithms to determine
whether an element is a member of a set
Tombstones are deletion markers on records
All delete commands in Cassandra are soft deletes
Cassandra Architecture
Using Cassandra
{
The Windows Experience
Install the Java 1.6 (or later) SDK
Set environment variable JAVA_HOME set to
the install path of the JDK
Download the binaries from
http://cassandra.apache.org/download/
Unzip to Program Files (x86) or some other
directory, optionally set PATH
Set environment variable
CASSANDRA_HOME to directory above
In command line, navigate to bin under
CASSANDRA_HOME and run cassandra
Installing Cassandra on Windows
Command line interface
Navigate to bin, under CASSANDRA_HOME
and run cassandra-cli
Generally useful for development, but not
meant to be a full-blown client
Allows for basic administration (creating
keyspaces, column management, etc.)
Commands must be terminated with a ;
Cassandra-CLI
Connect to a server
Connect to a server at CLI start
connect localhost/9160;
cassandra-cli localhost/9160
System information commands
show cluster name;
show keyspaces;
show api version;
Cassandra-CLI
Create a keyspace
Switch to keyspace
use BeantownAltNet
Create a column family
create keyspace BeantownAltNet;
create column family movies with
comparator=UTF8Type and
key_validation_class=UTF8Type;
View information about column family
describe keyspace BeantownAltNet;
Cassandra-CLI
See this JIRA issue and then run (v..8):
Add a row of data
assume Movies keys as ascii;
set movies[‘Goodfellas’][‘Genre’] = ‘Drama’;
set movies[‘Goodfellas’][‘Year’] = 1990;
Count the columns
count movies[‘Goodfellas’];
Get the row and column
get movies[‘Goodfellas’];
get movies[‘Goodfellas’][‘Genre’];
Create an index on Genre
Query by genre
get movies where Genre = ‘Drama’;
Remove a column
update column family movies with
column_metadata=[{column_name:Genre,
index_type:0, index_name:IdxGenre,
validation_type:UTF8Type}]
del movies[‘Goodfellas’][‘Year’];
Remove a row
del movies[‘Goodfellas’];
Cassandra-CLI
Used for Cassandra’s client API
Effectively an RPC serialization mechanism
Software framework for scalable, cross-language
services development
Combines software stack with code generation to
build services
Support for C++, Java, Python, PHP, Erlang, Perl,
Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk
and Ocaml
struct UserProfile {
1: i32 uid,
2: string name,
3: string blurb
}
service UserStorage {
void store(1: UserProfile user),
UserProfile retrieve(1: i32 uid)
}
Thrift
CQL is a DSL similar to SQL meant to abstract better the
details of the server operations from the clients (still
requires Thrift)
Currently, CQL drivers exist only for Java and Python
CREATE KEYSPACE BeantownAltNet with
replication_factor=1;
CREATE COLUMNFAMILY movies (
key VARCHAR PRIMARY KEY,
genre VARCHAR,
year INT);
INSERT INTO movies (key, genre, year) VALUES
(‘Zoolander’, ‘Comedy’, 1996);
SELECT key, genre, year FROM movies;
SELECT key, genre, year FROM movies WHERE
genre=‘Comedy’;
Cassandra Query Language (CQL)
Command line CQL tool that ships with the Python
CQL driver
Windows installation
Grab the precompiled windows Thrift binaries for
Python and copy to site-packages
http://www.dreamcubes.com/b2/softwaredevelopment/20/thrift-with-python-on-win32/
Download cassandra-dbapi2 from
http://code.google.com/a/apacheextras.org/p/cassandra-dbapi2/source/checkout
and run - setup.py install
easy_install pyreadline
Run - python cqlsh localhost 9160
CQLSH
CREATE KEYSPACE foo WITH
strategy_class=‘SimpleStrategy’ AND
strategy_options:replication_factor=1;
CREATE COLUMNFAMILY users (key
VARCHAR PRIMARY KEY, nickname
VARCHAR);
INSERT INTO users (key, nickname) VALUES
(‘jzablocki’, ‘zblock’);
SELECT * FROM users;
CQLSH
.NET and
Cassandra
{
The Client Libraries
Currently, there are three well maintained ,
community sponsored client libraries
Cassandra-Sharp http://code.google.com/p/cassandra-sharp/
Aquiles - http://aquiles.codeplex.com/
FluentCassandra https://github.com/managedfusion/fluentcassandra
No official Apache client
.NET Client Libraries
Configured in App/Web.config
Simple API over most common Thrift calls
Additional support for Cassandra commands
via Execute method and Client class
Support for executing CQL
Cassandra-Sharp
https://bitbucket.org/johnzablocki/codevoyeursamples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStartCassandraSharp
Cassandra-Sharp Demo
Configured in App/Web.config
Simple wrapper over most common Thrift calls
No direct support for executing CQL (though
an internal class does have CQL execution)
Aquiles
https://bitbucket.org/johnzablocki/codevoyeursamples/src/1e2aeb969518/src/PresentationSam
ples/NoSQLAndDotNet/CassandraQuickStartAquiles
Aquiles Demo
Intended to be an idiomatic .NET Cassandra
framework (i.e., more like .NET than Java)
Makes use of .NET 4.0 dynamic feature
Raw Thrift commands are abstracted
No current support for SQL
Developerd by Nick Berardi
FluentCassandra
https://bitbucket.org/johnzablocki/codevoyeursamples/src/1e2aeb969518/src/PresentationSam
ples/NoSQLAndDotNet/CassandraQuickStart
FluentCassandra Demo
Non-Relational
Design
{
Codd is Dead
Materialized View
Valueless Column
Store redundant data for more efficient queries
MovieGenres[‘Drama’][‘Goodfellas’] = null;
MovieGenres[‘Drama’][‘Casino’] = null;
All data necessary to satisfy a query is in the
column. No value needed (see above)
Aggregate Key
Combine values with a delimiter to create a
composite key
ZipCodes[‘Wethersfield:CT’] = ‘06109’;
ZipCodes[‘Cambridge:MA’] = ‘02140’;
Design Patterns
Links
http://dllhell.net – my blog
http://codevoyeur.com – code projects
http://linkedin.com/in/johnzablocki
http://twitter.com/codevoyeur
http://cassandra.apache.org/
http://bitbucket.org/johnzablocki/codevoyeursamples - code from this presentation
http://shop.oreilly.com/product/0636920010852.do
- O’Reilly’s Cassandra - The Definitive Guide
http://about.me/johnzablocki
Questions?