Transcript View and/or

Introduction to
Windows Azure HDInsight
Jon Tupitza
Principal, Solution Architect
[email protected]
Agenda
• What is Big Data?
…and how does it effect our business?
• What is Hadoop?
…and how does it work?
• What is Windows Azure HDInsight?
…and how does it fit into the Microsoft BI Ecosystem?
• What tools are used to work with Hadoop & HDInsight?
• How do I get started using HDInsight?
© 2014 JT IT Consulting, LLC
2
What is Big Data?
Datasets that, due to their size and complexity, are difficult to store, query, and
manage using existing data management tools or data processing applications.
Volume
(Size)
Variety
(Structure)
Velocity
(Speed)
© 2014 JT IT Consulting, LLC
• Explosion in social media, mobile apps, digital sensors, RFID,
GPS, and more have caused exponential data growth.
• Traditionally BI has sourced structured data, but now insight
must be extracted from unstructured data like large text blobs,
digital media, sensor data, etc.
• Sources like Social Networking and Sensor signals create data
at a tremendous rate; making it a challenge to capture, store,
and analyze that data in a timely or economical manner.
3
Key Trends Causing Data Explosion
Device Explosion
Social Networks
Cheap Storage
5.5 Billion+ devices with
over 70% of the global
population using them.
Over 2 Billion
users worldwide
In 1990 1MB cost $1.00
Today 1MB costs .01 cent
Ubiquitous Connection
Sensor Networks
Inexpensive
Computing
Web traffic to generate
over 1.6 Zettabytes of
data by 2015
© 2014 JT IT Consulting, LLC
Over 10 Billion
networked sensors
1980: 10 MIPS/Sec
2005: 10M MIPS/Sec
4
Big Data is Creating Big Opportunities!
Big Data
technologies are
a top priority for
most institutions:
both corporate
and government
© 2014 JT IT Consulting, LLC
Currently, 49%
of CEO’s and
CIO’s claim they
are undertaking
Big Data projects
Software
estimated to
experience 34%
YOY compound
growth rate: 4.6B
by 2015
Services
estimate to
experience 39%
YOY compound
growth rate: 6.5B
by 2015
5
Uses for Big Data technologies
Data Warehousing:
• Operational Data: New
User Registrations,
Purchasing, New
Product offering
• Data Exhaust: byproducts like log files
have low density of
useful data but the data
they do contain is very
valuable if we can extract
it at a low cost and in a
timely manner.
© 2014 JT IT Consulting, LLC
Devices and the
Internet of things:
• Trillions of Internetconnected devices
• GPS data
• Cell phone data
• Automotive engine
performance data
Collective Intelligence
• Social Analytics: What
is the social sentiment
for my product(s)?
• Live Data Feed/Search:
How do I optimize my
services based on
weather or traffic
patterns? How do I build
a recommendations
engine (Mahout)?
• Advanced Analytics:
How do I better predict
future outcomes?
6
What is Hadoop and How does it Work?
• Storage Layer (HDFS)
• Programming Layer (Map/Reduce)
• Schema on Read vs. Schema on Write
Task Tracker
HDFS Layer
• Hadoop Distributed Architecture:
Map/Reduce
Layer
Implements a Divide and Conquer Algorithm to Achieve Greater Parallelism
Name Node
Task Tracker
Task Tracker
• Traditionally we have always brought
•
•
the data to the schema and code
Hadoop sends the schema and the
code to the data
We don’t have to pay the cost or live
with the limitations of moving the
data: IOPs, Network traffic.
© 2014 JT IT Consulting, LLC
Data Node
Data Node
7
HDFS (Hadoop Distributed Files System)
• Fault Tolerant:
•
•
•
Data is distributed across each Data Node in the cluster (like RAID 5).
3 copies of the data is stored in case of storage failures.
Data faults can be quickly detected and repaired due to data redundancy.
• High Throughput
• Favors batch over interactive operations to support streaming large datasets.
• Data files are written once and then closed; never to be updated.
• Supports data locality. HDFS facilitates moving the application code (query)
to the data rather than moving the data to the application (schema on read).
© 2014 JT IT Consulting, LLC
8
Map/Reduce
• Like the “Assembly” language of Hadoop / Big Data: very low level
• Users can interface with Hadoop using higher-level languages like Hive an Pig
• Schema on Read; not Schema on Write
• Moving the Code to the Data:
• First, Store the data
• Second, (Map function) move the
•
programming to the data; load the code on
each machine where the data already resides.
Third, (Reduce function) collects statistics
back from each of the machines.
© 2014 JT IT Consulting, LLC
9
Map/Reduce: Logical Process
Key Value
A
X=2, Y=3
B
X=1, Z=2
C
X=3
A
Y=1, Z=4
1. SELECT WHERE Key=A
Key Value
Map
A
X=2, Y=3
A
Y=1, Z=4
2. SUM Values of Each Property
Reduce
Key Value
E
Y=3
A
X=1, Z=2
A
Z=5
D
Y=2, Z=1
Key Value
Map
A
X=1, Z=2
A
Z=5
The MAP function runs on each data node,
extracting data that matches the query.
© 2014 JT IT Consulting, LLC
Key
A
Value
X=3, Y=4, Z=11
The REDUCE function runs on one node,
combining the results from all the MAP
components into the final results set.
10
Yet Another Resource Manager (YARN)
• Second generation Hadoop
•
•
Extends capabilities beyond
Map/Reduce; beyond batch
processing
Makes good on the promise
of (near) real-time Big Data
processing
Client
Resource
Manager
Client
© 2014 JT IT Consulting, LLC
Node Manager
App Master Container
Node Manager
App Master Container
Node Manager
Container
Container
11
Data Movement and Query Processing
RDBMS require a
schema to be
applied when the
data is written:
Hadoop/HDInsight
applies a schema
only when the data
is read:
RDBMS perform
query processing
in a central
location:
Hadoop performs
query processing
at each storage
node:
The data is
transformed to
accommodate the
schema
The schema
doesn’t change the
structure of the
underlying data
Data is moved from
storage to a central
location for
processing
Data doesn’t need
to be moved across
the network for
processing
Some information
hidden in the data
may be lost at
write-time
The data is stored
in its original (raw)
format so that all
hidden information
is retained
More central
processing capacity
is required to move
data and execute
the query
Only a fraction of
the central
processing capacity
is required to
execute the query
© 2014 JT IT Consulting, LLC
12
Major Differences: RDBMS vs. Big Data
Feature
Relational Database
Hadoop / HDInsight
Data Types and Formats
Structured
Semi-Structured or Unstructured
Data Integrity
High: Transactional Updates
Low: Eventually Consistent
Schema
Static: Required on Write
Dynamic: Optional on Read & Write
Read and Write Pattern
Fully Repeatable Read/Write
Write Once; Repeatable Read
Storage Volume
Gigabytes to Terabytes
Terabytes, Petabytes and Beyond
Scalability
Scale Up with More Powerful HW Scale Out with Additional Servers
Data Processing Distribution
Limited or None
Distributed Across Cluster
Economics
Expensive Hardware & Software
Commodity Hardware & Open
Source Software
© 2014 JT IT Consulting, LLC
13
Solution Architecture: Big Data or RDBMS?
Where will the source
data come from?
What is the format
of the data?
• Perhaps you already have the data that contains the information you need, but you
can’t analyze it with your existing tools. Or is there a source of data you think will
be useful, but you don’t yet know how to collect it, store it, and analyze it?
• Is it highly structured, in which case you may be able to load it into your existing
database or data warehouse and process it there? Or is it semi-structured or
unstructured, in which case a Big Data solution that is optimized for textual
discovery, categorization, and predictive analysis will be more suitable?
What are the delivery and
quality characteristics
of the data?
• Is there a huge volume? Does it arrive as a stream or in batches? Is it of high
quality or will you need to perform some type of data cleansing and validation of
the content?
Do you want to combine
the results with data from
other sources?
• If so, do you know where this data will come from, how much it will cost if
you have to purchase it, and how reliable this data is?
Do you want to integrate
with an existing
BI system?
• Will you need to load the data into an existing database or data
warehouse, or will you just analyze it and visualize the results separately?
© 2014 JT IT Consulting, LLC
14
What is Windows Azure HDInsight?
• Windows Azure: Microsoft’s online storage and compute services
• HDInsight: Microsoft’s implementation of Apache Hadoop
(Hortonworks Data Platform) as an online service
Makes Apache Hadoop readily
available to the Windows community
• Enables Windows Azure subscribers to
quickly and easily provision an HDInsight
cluster across Windows Azure’s pool of
storage and compute resources.
• Also enables them to quickly de-provision
those clusters when they’re not needed.
• Allows subscribers to continuously store
their data for later use.
© 2014 JT IT Consulting, LLC
Exposes Apache Hadoop services to the
Microsoft programming ecosystem
• SQL Server: Analysis Services
• PowerShell and the Cross-platform
Command-line tools
• Visual Studio: CLR (C#, F#, etc.)
• JavaScript
• ODBC / JDBC / REST API
• Excel Self-Service BI (SSBI): PowerQuery,
PowerPivot, PowerView and PowerMap
15
HDInsight Data Storage
• Azure Blob Storage System: The default storage system for HDInsight
• Enables you to persist your data even when you’re not running an HDInsight cluster
• Enables you to leverage your data using HDInsight, Azure SQL Server Database & PDW
Hadoop Distributed File System (HDFS) API
Azure Blob Storage
Name
Node
Front End
Partition Layer
Data
Node
© 2014 JT IT Consulting, LLC
Data
Node
Stream Layer
Azure Storage (ASV)
16
Azure Storage Vault (ASV)
• The default file system for the HDInsight Service
• Provides scalable, persistent, sharable, highly-available storage
• Fast data access for compute nodes residing in the same data center
• Addressable via:
•
asv[s].<container>@<account>.blob, core.windows.net/path
• Requires storage key in core-site.xml:
<property>
<name>fs.azure.account.key.accountname</name>
<value>keyvaluestring</value>
</property>
© 2014 JT IT Consulting, LLC
17
The Zoo: HDInsight / Hadoop Ecosystem
Query
(Hive)
Stats
processing
(RHadoop)
Machine
Learning
(Mahout)
Distributed Processing
(Map/Reduce)
Distributed Storage
(HDFS)
C#, F#
.NET
Azure Storage
Vault (ASV)
World’s Data
(Azure Data
Marketplace)
JavaScript
Relational
(SQL Server)
Event Driven
Processing
Business Intelligence
(Excel, Power View, SSAS)
Scripting
(Pig)
Graph
(Pegasus)
Data Integration
(ODBC / SQOOP / REST)
NoSQL Database
(HBase)
Log file Aggregation
(Flume)
Monitoring & Deployment
(System Center)
Parallel Data Warehouse (PDW) / Ploybase
Pipeline / Workflow
(Oozie)
Active Directory
(Security)
© 2014 JT IT Consulting, LLC
Metadata
(HCatalog)
18
Programming HDInsight
Since HDInsight is a service-based implementation, you get immediate access
to the tools you need to program against HDInsight/Hadoop
Existing Ecosystem
.NET
JavaScript
DevOps/IT Pros:
© 2014 JT IT Consulting, LLC
• Hive, Pig, Sqoop, Mahout, Cascading, Scalding,
Scoobi, Pegasus, etc.
• C#, F# Map/Reduce, LINQ to Hive, .Net
Management Clients, etc.
• JavaScript Map/Reduce, Browser-hosted Console,
Node.js management clients
• PowerShell, Cross-Platform CLI Tools
19
Microsoft’s Vision
To provide insight to users by activating new types of data…
Broader Access
Enterprise Ready
Insights
• Easy installation
of Hadoop on
Windows
• Simplified
programming via
integration with
.Net and
JavaScript.
• Integration with
SQL Server Data
Warehouses
• Choice of
deployment on
Windows Server or
Windows Azure
• Integration with
other Windows
components like
Active Directory and
System Center
• Integrate with
the Microsoft
BI Stack:
• SQL Server
• SharePoint
• Excel &
PowerBI
And contribute back to the community distribution of Hadoop.
© 2014 JT IT Consulting, LLC
20
How do I get started with HDInsight?
1.
2.
3.
4.
5.
6.
7.
Create an Windows Azure account (subscription)
Create an Azure Storage account
Create an Azure Blob Storage node
Provision an HDInsight Service cluster
Install Windows Azure PowerShell
Install Windows Azure HDInsight PowerShell
Setup Environment
© 2014 JT IT Consulting, LLC
21
Architectural Models
Standalone Data
Analysis &
Visualization
Experiment with data sources to discover if they provide
useful information. Handle data that can’t be processed
using existing systems.
Data Transfer,
Cleansing or ETL
Extract and transform data before loading it into existing
databases. Categorize, normalize, and extract summary
results to remove duplication and redundancy.
Data Warehouse or
Data Storage
Create robust data repositories that are reasonably
inexpensive to maintain. Especially useful for storing and
managing huge data volumes.
Integrate with
Existing EDW and
BI Systems
Integrate Big Data at different levels; EDW, OLAP, Excel
PowerPivot. Also, PDW enables querying HDInsight to
integrate Big Data with existing dimension & fact data.
© 2014 JT IT Consulting, LLC
22
Resources
• http://www.windowsazure.com
• http://hadoopsdk.codeplex.com
• http://nugget.org/packages?q=hadoop
• PolyBase – David DeWitt http://www.Microsoft.com/en-
us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx
• PDW Website: http://Microsoft.com/en-us/sqlserver-solutionstechnologies/data-warehousing/pdw.aspx
• http://blogs.technet.com/b/dataplatforminsider/archive/2013/04/25/insight-through-integration-sql-server-2012parallel-data-warehouse-polybase-demo.aspx
© 2014 JT IT Consulting, LLC
24
Tools
• http://azurestorageexplorer.codeplex.com
© 2014 JT IT Consulting, LLC
25
Place title here
• HDInsight clusters can be
•
•
provisioned when needed and then
de-provisioned without loosing the
data or metadata they have
processed.
Azure Storage Vault allow you to
maintain that state; paying only for
the storage and not the cluster(s).
Since stream data often arrives in
massive bursts, HDInsight can
provide a buffer between that data
generation and existing data
warehouse/BI infrastructures.
© 2014 JT IT Consulting, LLC
26