0 to 60 with HDInsight

Download Report

Transcript 0 to 60 with HDInsight

What’s New in SQL Server 2014 since SQL Server 2005
Mission Critical Performance
PERFORMANCE & SCALE
In-Memory OLTP
Enhanced In-Memory ColumnStore for DW
Support for 640 logical proc. & 4 TB memory
Support to 15,000 partitions
Resource Governor IO governance
Buffer Pool Extension to SSDs
Query optimization enhancements
SysPrep at cluster level
Predictable performance with tiering of compute, network, and
storage with Windows Server 2012 R2
Data Compression with USC-2 Unicode support
Backup Compression
PROGRAMMABILITY
SQL Server Data Tools
Local DB runtime (Express)
Data-tier Application Component project template
Data-Tier Application Framework (DAC Fx)
Query optimization enhancements
Interoperability support (ADO.NET, ODBC, JDBC, PDO, ADO
APIs and .NET C/C++, Java, Linux, and PHP platforms)
HIGH AVAILABILITY
SQL Server AlwaysOn
Database Mirroring
Failover Clustering
Database Snapshots
Delayed Durability
Recovery Advisor
Windows Server Core
Live Migration
Online Operations
Clustered Shared Volume support, VHDX support (Windows
Server 2012 R2)
Manage on-premises and cloud apps (System Center 2012 R2)
Data Support
FILESTREAM data type
FileTable built on FILESTREAM
Remote Blob Storage with SharePoint 2010
Spatial data support
Full Text Search for unstructured files
Statistical Semantic Search
Large user-defined data types
SECURITY
User-Defined Server Roles
Default Schema for Groups
SQL Server Audit
Transparent Data Encryption
Extensible Key Management
Standards-based Encryption
SQL Server Fine-grained Auditing
Enhanced separation of duty
CC certification at High Assurance Level
Backup encryption support
T-SQL enhancements
Enhanced support for ANSI SQL standards
Transact-SQL Static Code Analysis tools
Transact-SQL code snippets
Intellisense
Programmability Support
Support for LINQ and ADO.NET Entity Framework
CLR Integration and ADO.NET Object Services
MANAGEABILITY
Distributed Replay
Contained Database Authentication
System Center Management Pack for SQL Server 2012
Windows PowerShell 2.0 support
Multi-server Management with SQL Server Utility Control Point
Data-Tier Application Component
Multi-server Management with SQL Server Utility Control Point
Data-Tier Application Component
Policy-Based Management
SQL Server Performance Data Collector
Query enhancements
SMTP mail for secure DB email w/o Outlook
Faster Insights from Any Data
Platform for Hybrid Cloud
ACCESS ANY DATA
Power Query
Windows Azure HDInsight
Service
Analytics Platform System
(PDW V2)
Mash up data from
different sources, such as
Oracle & Hadoop
Analysis Services
Import PowerPivot
models into Analysis
Services
Enhancements on
productivity, performance
Cube design tools, block
computations, and writeback to MOLAP
HYBRID CLOUD SOLUTIONS
Backup to Windows Azure
Cloud Disaster Recovery
Extend on-premises apps to the cloud
INSIGHTS WITH
FAMILIAR TOOLS
Power BI in Office 365
Power Map for Excel
Mobile interfaces for
Power BI
Reporting Services
Power View
Configurable reporting
alerts
Reporting as SharePoint
Shared Service
Report Builder 3.0
Report Designer
Report Manager
COMPLETE BI
SOLUTION
SQL Server BI Edition
StreamInsight
BI Semantic Model
SQL Server Data Tools
BI Development Studio
Microsoft Visual Studiobased report dev tools
Change Data Capture for
Oracle
Data Quality Services
Build organizational
knowledge base
Connect to 3rd party data
cleansing providers
EASY ON-RAMP TO THE CLOUD
New Windows Azure Deployment UI
for SQL Server
Larger SQL Server VMs and memory sizes
now available in Windows Azure
DAC enhancements: Import/export with
Windows Azure SQL Database
COMPLETE AND CONSISTENT
FROM ON-PREM TO CLOUD
SQL Server Data Tools
License Mobility (with SA)
Resource Governor enhancements
Snapshot backups to Windows Azure via
SQL Server Management Studio
Master Data Services
Master Data Hub
Master Data Services
Add-in for Microsoft Excel
Integration Services
Graphical tools in SSIS
Extensible object model
SSIS as a Server
Broader data integration
with more sources; DB
vendors, cloud, Hadoop
Pipeline improvements
Persistent lookups
High-performance
connectors
Data profiling tool
7
Particle collider produces 1 petabyte of
data a second…. THAT IS BIG DATA.
Has Changed
Cloud
Volume
Velocity
Relational
Data
Variety

















Hadoop Cluster
Name Node
Master
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Worker Nodes
Data
Node
Data
Node
Data
Node
Block
1
Block
2
Block
3
Block
1
Block
3
Block
2
Block
2
Block
3
Block
1
Data Node1 Data Node 2 Data Node3 Data Node 4 Data Node 5




is written to the node creating the file
is written to a data node within the
is written to a data node in a
Default Replication factor = 3
Default Block Size = 64MB
NameNode
Standby Namenode
namespace
(heartbeat, balancing, replication, etc.)
DataNode
DataNode
DataNode
nodes write to local disk
DataNode
DataNode
21
hadoop-namenode
JobTracker
NameNode
MapReduce
Layer
HDFS
Layer
hadoopdatanode1
hadoopdatanode2
hadoopdatanode3
hadoopdatanode4
TaskTracker
TaskTracker
TaskTracker
TaskTracker
DataNode
DataNode
DataNode
DataNode
MapReduce
Layer
HDFS
Layer
24
<keyi, valuei>
Mapper
DataNode
<keyA , valuea>
<keyB, valueb>
<keyC, valuec>
…
<keyA , list(valuea, valueb, valuec, …)>
Reducer
<keyi, valuei>
Mapper
DataNode
<keyi, valuei>
Mapper
<keyA , valuea>
<keyB, valueb>
<keyC, valuec>
…
<keyA , valuea>
<keyB, valueb>
<keyC, valuec>
…
Sort
and
group
by
key
<keyB, list(valuea, valueb, valuec, …)>
Reducer
Output
<keyC, list(valuea, valueb, valuec, …)>
DataNode
<keyi, valuei>
Mapper
<keyA , valuea>
<keyB, valueb>
<keyC, valuec>
…
Reducer
25
OPERATIONAL
SERVICES
Hortonworks
Data Platform (HDP)
DATA
SERVICES
Oozie
Mahout Python
HIVEStore,
Process and
Access Data
SQOOP
HCATALOG
HBase
PIG
Ambari
HADOOP CORE
 100% open source and complete
distribution
Distributed
Storage & Processing
 Ecosystem endorsed to ensure
PLATFORM SERVICES
Enterprise Readiness
HORTONWORKS
DATA PLATFORM (HDP)
Windows
OS
© Hortonworks Inc. 2013
Cloud
VM
interoperability
 Roadmap for continued
enhancements including Operational
Services
Appliance
Page 26
• Highly fault tolerant
• Distributed batch processing at
the largest possible scale today
• Relatively easy to write distributed
computations over very large
amounts of data
• MR framework removes burden of
dealing with failures from
programmer.
• Lots of Tools in Ecosystem
• Open source/no licensing costs
• Schema on read vs write (just put
it there)
• Schema embedded in application code
• A lack of shared schema makes sharing
data between applications difficult
• Not good at DBMS power like cost
optimizer, indexes, deep statistics,
optimized HW, adanved buffer pools,
consistency.
• No declarative query language.
• Large clusters require a lot of
management.
• High datacenter costs for large clusters
• Background
• Research done by Gray System Lab lead by Technical Fellow David DeWitt
• High-level Goals
• Seamless Integration with Hadoop via regular T-SQL
• Enhancing PDW query engine to process data coming from HDFS
• Fully parallelized query processing, bi-directional import and export between PDW and HDFS
• Integration with various Hadoop implementations
• Hadoop on Windows Server, Hortonworks, and Cloudera
Parallel data transfers
PDW Appliance
Control Node
Compute Node
Compute Node
Hadoop Cluster
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
• Direct parallel data access between PDW Compute Nodes and Hadoop Data Nodes
• Introducing “structure” on the “unstructured” data
Hadoop
Hadoop
HDFS
DB
SQL in, results out
HDFS
DB
SQL in, results stored in HDFS
• Cost-based decision on how much data needs to be pushed to PDW
 Stats on PDW Control Node
• SQL operations on HDFS data pushed into Hadoop as MapReduce jobs
SQL
Results
1
7
Map job
Hadoop
2
PDW
5
6
MapReduce
3
HDFS
4
DB
Polybase
Polybase
•
•
•
•
•
Store first, come up with the questions later
Easily track gameplay quality, user behavior, etc by time
Short timeline, small team, not Hadoop experts
Results in common tools like Excel
Start at scale of 100s of TB
A New Approach to Customer Connection
Other
Curator
(Persistent
Storage)
REST
(Landing Zone)
Blob Storage
Sqoop
Blob Storage or
In Memory
Optimized
for write throughput
Optimized for query
efficiency
Use Case Specific & General Processing
- Many blobs)
small blobs
- Optimized size (combine
- Data governance requirements (PII scrub)
- Raw/binary format
- Cleansed/masked
- Aggregate for efficient storage
Data
kept
until curated
- Partitioned
- Publish to real-time Self-Service
consumers and
long/ DW
Reporting
Azure
Blob
Storage
if
persisted
- Well-defined, semi-structured
data
Analytics
term
storage (Hadoop)
- Azure Queues & Workers for in memory
Architecture – Use Cloud Building Blocks
HDInsight
Clusters
(Hive, Pig, etc)
Windows Azure
HDInsight Service
“on demand, dedicated virtual clusters”




• http://azure.microsoft.com/en-us/pricing/member-offers/msdn-benefits/
$150/mo credits with MSDN
• http://azure.microsoft.com/en-us/documentation/articles/hdinsight-getstarted/
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• www.pluralsight.com
[email protected]