Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors /

Download Report

Transcript Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors /

Petabytes
BIG DATA
Transactions +
Interactions +
Observations
= BIG DATA
Mobile Web
Sentiment
SMS/MMS
Speech to Text
User Click Stream
Social Interactions & Feeds
Terabytes
WEB
Web logs
Spatial & GPS Coordinates
A/B testing
Sensors / RFID / Devices
Behavioral Targeting
Gigabytes
CRM
Business Data Feeds
Dynamic Pricing
Segmentation
External Demographics
Search Marketing
Megabytes
ERP
Purchase detail
Customer Touches
Support Contacts
Purchase record
Payment record
User Generated Content
Affiliate Networks
Offer details
Dynamic Funnels
Offer history
HD Video, Audio, Images
Product/Service Logs
Increasing Data Variety and Complexity
APPLICATIONS
OLTP, ERP, CRM Systems
Custom
Applications
Business
Analytics
Packaged
Applications
Unstructured documents, emails
Server logs
DATA SYSTEM
2.8 ZB in 2012
85% from New Data Types
RDBMS
EDW
Sentiment, Web Data
MPP
REPOSITORIES
15x Machine Data by 2020
40 ZB by 2020
Sensor. Machine Data
Source: IDC
SOURCES
Geolocation
Existing Sources
(CRM, ERP, Clickstream, Logs)
Clickstream
APPLICATIONS
OLTP, ERP, CRM Systems
Custom
Applications
Business
Analytics
Packaged
Applications
DEV & DATA TOOLS
Server logs
EDW
MPP
REPOSITORIES
Data Management
Operations
RDBMS
Data Access
Security
OPERATIONS TOOLS
Governance
& Integration
DATA SYSTEM
Build &
Test
Unstructured documents, emails
Sentiment, Web Data
Provision,
Manage &
Monitor
SOURCES
Sensor. Machine Data
Geolocation
OLTP, ERP, Documents, Web Logs,
Social
CRM Systems
Emails
Click Streams Networks
Machine
Generated
Sensor
Data
Geolocation
Data
Clickstream
Hortonworks Data Platform 2.2
GOVERNANCE
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
WebHDFS
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Script
SQL
Java
Scala
NoSQL
Stream
Pig
Hive
Cascading
HBase
Storm
Tez
Tez
Spark
ISV
Engines
Solr
YARN: Data Operating System
(Cluster Resource Management)
1
°
°
°
°
°
°
°
Linux
Windows
°
°
°
°
°
°
°
°
°
(Hadoop
File
°
° Distributed
°
°
° System)
°
°
°
°
°
°
HDFS
°
°
Deployment Choice
OPERATIONS
Authentication
Authorization
Accounting
Data Protection
Provision,
Manage &
Monitor
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Slider
Slider
Tez
Others
In-Memory Search
SECURITY
On-Premises
Cloud
Ambari
Zookeeper
Scheduling
Oozie
Hortonworks
Data Platform
(HDP)
The Only Completely Open
Distribution for Apache
Hadoop
Fundamentally Versatile and
Comprehensive enterprise
capabilities
Wholly Integrated for deep
ecosystem interoperability
HDP certifies the most recent & stable community innovation
1.2.0
0.98.4
2.6.0
4.2
0.60
0.5.1
0.6.0
0.12.0
Data
Management
4.0.0
1.4.0
0.9.1
3.4.5
1.4.4
0.4.0
1.4.4
Data Access
Governance
& Integration
Operations
Ranger
Knox
Oozie
3.3.2
Ambari
Falcon
Slider
Tez
Solr
1.3.1
Zookeeper
0.96.1
HBase
2013
Pig
October
2.2.0
Hadoop
&YARN
HDP 2.0
0.12.0
Hive & HCatalog
2014
3.4.5
0.5.0
0.4.0
4.0.0
Spark
April
0.98.0
Storm
0.12.1
0.5.0
1.4.5
1.5.1
Phoenix
2.4.0
4.1.0
1.5.0
4.7.2
0.13.0
HDP 2.1
0.4.0
0.9.3
Flume
0.14.0
October
4.10.0
Sqoop
HDP 2.2
2014
1.7.0
0.14.0
Security
Hortonworks Data Platform 2.2
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
DEV & DATA TOOLS
OPERATIONAL TOOLS
a
HDInsight
Azure
x
Ω
SOURCES
DATA SYSTEM
APPLICATIONS
New!
Power BI
INFRASTRUCTURE
HDP certifies the most recent & stable community innovation
1.2.0
0.98.4
2.6.0
4.2
0.60
0.5.1
0.6.0
0.12.0
Data
Management
4.0.0
1.4.0
0.9.1
3.4.5
1.4.4
0.4.0
1.4.4
Data Access
Governance
& Integration
Ranger
Oozie
Operations
Knox
3.3.2
Ambari
Falcon
Slider
Tez
Solr
1.3.1
Zookeeper
0.96.1
HBase
2013
Pig
October
2.2.0
Hadoop
&YARN
HDP 2.0
0.12.0
Hive & HCatalog
2014
3.4.5
0.5.0
0.4.0
4.0.0
Spark
April
0.98.0
Storm
0.12.1
0.5.0
1.4.5
1.5.1
Phoenix
2.4.0
4.1.0
1.5.0
4.7.2
0.13.0
HDP 2.1
0.4.0
0.9.3
Flume
0.14.0
October
4.10.0
Sqoop
HDP 2.2
2014
1.7.0
0.14.0
Security
Hortonworks Data Platform 2.2
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
Open Source Data Management
Scalable
Linearly scale to store Petabytes of data
Reliable
•
•
•
Distributed across “nodes”
Natively redundant
Single File System
Redundant storage protects against node
failures
Flexible
Store all types of data, apply flexible schemas
for analysis and sharing
•
•
•
Cluster Resource Manager
Built in Fault Tolerance
High Cluster Utilization
Economical
Utilize cose efficient commodity hardware
Achieve high cluster utilization
ResourceManager
Scheduler
NodeManager
NodeManager
NodeManager
NodeManager
map 1.1
nimbus0
vertex1.1.1
vertex1.2.2
NodeManager
NodeManager
NodeManager
NodeManager
map1.2
Batch
Interactive SQL
vertex1.1.2
nimbus2
NodeManager
NodeManager
NodeManager
NodeManager
nimbus1
Real-Time
reduce1.1
vertex1.2.1
SCALE (storage & processing)
Traditional
Database
EDW
Required on write
Reads are fast
MPP
Analytics
schema
speed
NoSQL
Hadoop
Platform
Required on read
Writes are fast
Standards and structured
governance
Loosely structured
Limited, no data processing
processing
Processing coupled with data
Structured
data types
Multi and unstructured
best fit use
Data Discovery
Processing unstructured data
Massive Storage/Processing
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Hortonworks Data Platform (HDP) for Windows
Microsoft Azure HDInsight
Microsoft Analytics Platform System (APS)
All offerings co-engineered by Hortonworks and Microsoft
Enjoy seamless interoperability across on-premises and cloud
Data Operating System of
Hadoop
DATA ACCESS
Batch
Script
Map
Reduce
Pig
SQL
NoSQL
Stream
Search
Others
Storm
Solr
In-Memory
Analytics,
ISV engines
Hive/Tez,
HBase
HCatalog Accumulo
YARN : Data Operating System
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
°
°
°
°
°
°
°
°
°
°
°
°
°
N
(Hadoop Distributed File System)
DATA MANAGEMENT
HORTONWORKS
DATA PLATFORM (HDP)
For Windows
Sqoop
• RPC
• REST (HTTP)
• C LibHDFS
Flume
Stinger Initiative
Custom
Apps
Business Analytics
SQL
Apache Hive
Apache
Tez
Apache
MapReduce
Apache YARN
1
°
°
°
°
°
°
°
°
°
HDFS
°
°
°
°
°
°
°
°
°
N
°
(Hadoop Distributed File System)
Apache Hive Contribution… an Open Community at its finest
1,672
Jira Tickets Closed
145
Developers
44
Companies
~390,000
Lines Of Code Added… (2x)
13
Months
Replaces MapReduce as
primitive for Hive, Pig, etc
Task with pluggable Input, Processor and Output
Input
Processor
Output
Task
Tez Task - <Input, Processor, Output>
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Tez avoids
unneeded writes to
HDFS
Hive – MR
M
M
Hive – Tez
M
SELECT a.state
SELECT b.id
R
R
M
SELECT a.state,
c.itemId
M
M
R
M
SELECT b.id
R
M
M
HDFS
JOIN (a, c)
SELECT c.price
M
R
M
R
R
HDFS
JOIN (a, c)
R
HDFS
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M
M
M
R
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
R
Hive SQL Datatypes
Hive SQL Semantics
SQL Compliance
INT
SELECT, INSERT
TINYINT/SMALLINT/BIGINT
GROUP BY, ORDER BY, SORT BY
BOOLEAN
JOIN on explicit join key
FLOAT
Inner, outer, cross and semi joins
DOUBLE
Sub-queries in FROM clause
Hive provides a wide array
of SQL datatypes and
semantics so your existing
tools integrate more
seamlessly with Hadoop
STRING
ROLLUP and CUBE
TIMESTAMP
UNION
BINARY
Windowing Functions (OVER, RANK, etc)
DECIMAL
Custom Java UDFs
ARRAY, MAP, STRUCT, UNION
Standard Aggregation (SUM, AVG, etc.)
DATE
Advanced UDFs (ngram, Xpath, URL)
VARCHAR
Sub-queries for IN/NOT IN, HAVING
CHAR
Expanded JOIN Syntax
Hive 0.12 (HDP 2.0)
INTERSECT / EXCEPT
Hive 0.13 (HDP 2.1)
Hive 0.11
Apache Storm
NoSQL
HBase
YARN : Data Operating System
1
°
°
°
°
°
°
°
HDFS
(Permanent Data Storage)
°
°
°
°
°
°
°
°
°
°
°
°
N
Apache Solr
MapReduce Indexing Job
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
SQL
Java
Scala
Others
HBase
Accumulo
Storm
Spark
Others
Solr
Others
Pig
Hive
Cascading
Engines
NoSQL
NoSQL
Stream
In-Memory
Engines
Search
ISV Engines
Tez
Tez
Tez
Tez
Slider
Slider
Slider
Kafka
Script
Slider
YARN: Data Operating System
(Cluster Resource Management)
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
(Hadoop Distributed File System)
°
°
°
Single Use System
Multi Use Data Platform
Batch Apps
Batch, Interactive, Online, Streaming, …
2nd Gen of Hadoop
1st Gen
of Hadoop
Classic
Hadoop
Apps
Batch
MapReduce
MapReduce
(cluster resource management
& data processing)
Flexible Data
Processing
Online Data
Processing
Stream
Processing
Hive, Pig, others…
HBase, Accumulo
Storm
Batch & Interactive
Tez
Efficient Cluster Resource
Management & Shared Services
(YARN)
HDFS
Redundant, Reliable Storage
(redundant, reliable storage)
(HDFS)
others
…
Apache Falcon
Provides key governance
framework for:
Define sophisticated Worklows and DLM Policies
Enable audit, compliance, and data re-processing
Staged Data
Cleansed Data
Conformed
Data
Presented
Data
Retain 5 Years
Retain 3 Years
Retain 3 Years
Retain Last
Copy Only
Disaster Recovery and Backup
between environments
Publishing data between
environments for Discovery
Site to Site
Site to Cloud
HDP Advanced Security
•
•
•
Apache Knox
Enterprise
Identity
Provider
LDAP/AD
Browser
Firewall
Firewall
Identity Providers
HDP Cluster 1
Masters
NN
Web
HCat
JT
DN
DMZ
REST
Client
TT
YARN
HBase
Hive
Knox Gateway
GW
HDP Hadoop Cluster 2
JDBC
Client
Masters
NN
JT
DN
A stateless reverse proxy
instance deployed in
DMZ
Oozie
-Requests streamed through GW to
Hadoop services after auth.
-URLs rewritten to refer to
gateway
Hive
Web
HCat
Oozie
TT
HBase
YARN
Ambari: Deploy, Manage, Monitor
AMBARI WEB
REST APIs
AMBARI SERVER
PROVISION
compute
&
storage
.
.
.
MANAGE
.
.
.
.
MONITOR
.
.
.
compute
&
storage
PROVISION | MANAGE | MONITOR
Ambari SCOM Server aggregates + exposes Hadoop metrics
Ambari
SCOM
Mgmt
Pack
Ambari
SCOM
Server
Ambari SCOM monitors health + alerts in case of problems
HADOOP
Storage & Process
at Scale
microsoft.com/sqlserver and Amazon Kindle Store
microsoftvirtualacademy.com
Azure Machine Learning, DocumentDB, and Stream Analytics
http://channel9.msdn.com/Events/TechEd
www.microsoft.com/learning
http://microsoft.com/technet
http://developer.microsoft.com