BI in the Cloud

Download Report

Transcript BI in the Cloud

NoSQL for the SQL Server Pro

Lynn Langit Feb 2013 – SDC, Sweden

Is NoSQL just Hadoop?

• HUGE Hype factor over last few years Apache Hadoop is a software framework that supports data • • intensive distributed applications under a free license enables applications to work with thousands of nodes and was inspired by Google 's MapReduce and petabytes Google File System of data (GFS) papers

Hadoop in the Enterprise

Working with Hadoop

Common Tools / • • • • • Languages Java (JDK) / Eclipse MapReduce • Map (query/format) • Reduce (aggregate) • plug-in for Eclipse (Java) Pig (ETL -- Java) Hive (HQL Query) • HBase tables Others • Mahout (analyze) • Karmasphere (analyze) • R (analyze)

Demo -HDInsight– Cluster Allocation

What is the relationship?

NoSQL BigData

BigData = Exponentially More Data

• Retail Example -> ‘Feedback Economy’ – Number of transactions – Number of behaviors (collected every minute) 2500 2000 1500 1000 500 0 Purchases Locations Phone data 12:00 12:30 1:00 1:30 2:00 2:30

BigData = ‘Next State’ Questions

Collecting Behavioral data • What could happen?

• Why didn’t this happen?

• When will the next new thing happen?

• What will the next new thing be?

• What happens?

Demo - HDInsight - MapReduce

Hitting (Relational) Walls • • • CA – Highly-available consistency CP – Enforced consistency AP – Eventual consistency

So many NoSQL options

• • More than just the Elephant in the room Over 120+ types of NoSQL databases

Flavors of NoSQL

Key / Value Database

• • • Schema-less State (Persistent or Volatile) Examples –

AWS Dynamo DB

– Riak

Column Database

• • Wide, sparse column sets Examples: – Cassandra –

HBase

– – BigTable GAE HR DS – Azure Tables – SQL 2012 Tabular Model

More about Column Databases

• • Type A – Column-families – Non-relational – Sparse – Examples: HBase, Cassandra, xVelocity (SQL 2012 Tabular) Type B – Column-stores – Relational – Dense – Example: • SQL Server 2012 Columnstore index

Demo - Document Database (Mongo DB) • • document-oriented (collection of JSON documents) w/semi structured data – Encodings include BSON, JSON, XML… binary forms – PDF, Microsoft Office documents - Word, Excel…)

Demo - Graph Database (Neo4j) • • • a lot of many-to-many relationships recursive self-joins when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data

So which type of NoSQL? Back to CAP… CP = NoSQL/column Hadoop Big Table H-base MemCacheDB Consistency Availability CA = SQL/RDBMS SQL Sever / Oracle MySQL Partitioning AP = NoSQL/documen t or key/value DynamoDB CouchDB Cassandra Voldemort

Which type of NoSQL for which type of data?

Type of Data Type of NoSQL solution Example

Log files Product Catalogs User profiles Startups Social media connections LOB w/Transactions Wide Column Key Value on disk Key Value in memory Document Graph NONE! Use RDBMS HBase DynamoDB Redis MongoDB Neo4j SQL Server

Cloud-hosted NoSQL up to 50x CHEAPER

The reality…two pivots

Storage Methods • SQL (RDBMS) • NoSQL Storage Locations • On premises • Cloud-hosted

NoSQL (Cloud) BLOB Storage Buckets • • • • Amazon – S3 or Glacier – The gold standard Google – Cloud Storage – Free for developers Microsoft Azure BLOBS DropBox, Box…

Cloud-hosted RDBMS

• • • AWS RDS – SQL Server, mySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling Google – mySQL – Lowest cost – Most limited RDBMS functionality Microsoft – SQLAzure – Highest cost

• • Demo - AWS RDS SQL Server, MySQL or Oracle Essential to understand pricing models

Cloud Offerings– RDBMS AND NoSQL

Cloud RDBMS NoSQL buckets NoSQL databases Streaming ML or (Mahout) Document or Graph AWS

RDS – all major S3 or Glacier DynamoDB Custom EC2 MongoDB on EC2

Google

mySQL Cloud Storage H/R Data on GAE Prospective Search & Prediction API Freebase none

Microsoft

SQL Azure Azure Blobs Azure Tables StreamInsight MongoDB on Windows Azure HDInsight

Hadoop Dremel/Warehousi ng

Elastic MapReduce using S3 & EC2 RedShift BigQuery none

Data Scientists…

Karmasphere Studio for AWS

Hadoop Connector to Excel

Google BigQuery • • • Hadoop-like (Dremel) based service For massive amounts of data SQL-like query language

Dremel Realized => Impala

• Interactive Hadoop?

Other types of cloud data services

Hosting public datasets • Pay to read • Earn revenue by offering for read Cleaning / matching (your) data • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Data Market, InfoChimps, DataMarket.com

NoSQL To-Do List

Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc…

The Changing Data Landscape

• recipes)

www.TeachingKidsProgramming.org

• • • Free Courseware ( Do a Recipe  Teach a Kid (Ages 10 ++) Java or Microsoft SmallBasic 

Toward Data Craftsmanship…

Follow me @LynnLangit RSS my blog www.LynnLangit.com

Hire me • To help build your BI/Big Data solution • To teach your team next gen BI • To learn more about using NoSQL solutions