Transcript BI in the Cloud
NoSQL for the SQL Server Pro
Lynn Langit Feb 2013 – SDC, Sweden
Is NoSQL just Hadoop?
• HUGE Hype factor over last few years Apache Hadoop is a software framework that supports data • • intensive distributed applications under a free license enables applications to work with thousands of nodes and was inspired by Google 's MapReduce and petabytes Google File System of data (GFS) papers
Hadoop in the Enterprise
Working with Hadoop
Common Tools / • • • • • Languages Java (JDK) / Eclipse MapReduce • Map (query/format) • Reduce (aggregate) • plug-in for Eclipse (Java) Pig (ETL -- Java) Hive (HQL Query) • HBase tables Others • Mahout (analyze) • Karmasphere (analyze) • R (analyze)
Demo -HDInsight– Cluster Allocation
What is the relationship?
NoSQL BigData
BigData = Exponentially More Data
• Retail Example -> ‘Feedback Economy’ – Number of transactions – Number of behaviors (collected every minute) 2500 2000 1500 1000 500 0 Purchases Locations Phone data 12:00 12:30 1:00 1:30 2:00 2:30
BigData = ‘Next State’ Questions
Collecting Behavioral data • What could happen?
• Why didn’t this happen?
• When will the next new thing happen?
• What will the next new thing be?
• What happens?
Demo - HDInsight - MapReduce
Hitting (Relational) Walls • • • CA – Highly-available consistency CP – Enforced consistency AP – Eventual consistency
So many NoSQL options
• • More than just the Elephant in the room Over 120+ types of NoSQL databases
Flavors of NoSQL
Key / Value Database
• • • Schema-less State (Persistent or Volatile) Examples –
AWS Dynamo DB
– Riak
Column Database
• • Wide, sparse column sets Examples: – Cassandra –
HBase
– – BigTable GAE HR DS – Azure Tables – SQL 2012 Tabular Model
More about Column Databases
• • Type A – Column-families – Non-relational – Sparse – Examples: HBase, Cassandra, xVelocity (SQL 2012 Tabular) Type B – Column-stores – Relational – Dense – Example: • SQL Server 2012 Columnstore index
Demo - Document Database (Mongo DB) • • document-oriented (collection of JSON documents) w/semi structured data – Encodings include BSON, JSON, XML… binary forms – PDF, Microsoft Office documents - Word, Excel…)
Demo - Graph Database (Neo4j) • • • a lot of many-to-many relationships recursive self-joins when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data
So which type of NoSQL? Back to CAP… CP = NoSQL/column Hadoop Big Table H-base MemCacheDB Consistency Availability CA = SQL/RDBMS SQL Sever / Oracle MySQL Partitioning AP = NoSQL/documen t or key/value DynamoDB CouchDB Cassandra Voldemort
Which type of NoSQL for which type of data?
Type of Data Type of NoSQL solution Example
Log files Product Catalogs User profiles Startups Social media connections LOB w/Transactions Wide Column Key Value on disk Key Value in memory Document Graph NONE! Use RDBMS HBase DynamoDB Redis MongoDB Neo4j SQL Server
Cloud-hosted NoSQL up to 50x CHEAPER
The reality…two pivots
Storage Methods • SQL (RDBMS) • NoSQL Storage Locations • On premises • Cloud-hosted
NoSQL (Cloud) BLOB Storage Buckets • • • • Amazon – S3 or Glacier – The gold standard Google – Cloud Storage – Free for developers Microsoft Azure BLOBS DropBox, Box…
Cloud-hosted RDBMS
• • • AWS RDS – SQL Server, mySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling Google – mySQL – Lowest cost – Most limited RDBMS functionality Microsoft – SQLAzure – Highest cost
• • Demo - AWS RDS SQL Server, MySQL or Oracle Essential to understand pricing models
Cloud Offerings– RDBMS AND NoSQL
Cloud RDBMS NoSQL buckets NoSQL databases Streaming ML or (Mahout) Document or Graph AWS
RDS – all major S3 or Glacier DynamoDB Custom EC2 MongoDB on EC2
mySQL Cloud Storage H/R Data on GAE Prospective Search & Prediction API Freebase none
Microsoft
SQL Azure Azure Blobs Azure Tables StreamInsight MongoDB on Windows Azure HDInsight
Hadoop Dremel/Warehousi ng
Elastic MapReduce using S3 & EC2 RedShift BigQuery none
Data Scientists…
Karmasphere Studio for AWS
Hadoop Connector to Excel
Google BigQuery • • • Hadoop-like (Dremel) based service For massive amounts of data SQL-like query language
Dremel Realized => Impala
• Interactive Hadoop?
Other types of cloud data services
Hosting public datasets • Pay to read • Earn revenue by offering for read Cleaning / matching (your) data • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Data Market, InfoChimps, DataMarket.com
NoSQL To-Do List
Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc…
The Changing Data Landscape
• recipes)
www.TeachingKidsProgramming.org
• • • Free Courseware ( Do a Recipe Teach a Kid (Ages 10 ++) Java or Microsoft SmallBasic
Toward Data Craftsmanship…
Follow me @LynnLangit RSS my blog www.LynnLangit.com
Hire me • To help build your BI/Big Data solution • To teach your team next gen BI • To learn more about using NoSQL solutions