UW-Madison CS professor (1976-2008) Microsoft Technical Fellow (2008 - ?? ) Manage Jim Gray Systems Lab (graysystemslab.com)

Download Report

Transcript UW-Madison CS professor (1976-2008) Microsoft Technical Fellow (2008 - ?? ) Manage Jim Gray Systems Lab (graysystemslab.com)

UW-Madison CS professor
(1976-2008)
Microsoft Technical Fellow
(2008 - ?? )
Manage Jim Gray Systems Lab
(graysystemslab.com)
what’s
next
what?
why?
how?
what?
Customers need it for their businesses
to ‘survive’ in 21 century
It relieves customers’ pain
If we do it right, it will make
customers feel ‘great’ too
“PolyBase is an ‘aspirin’, ‘air’ and ‘jewel’ in
data management for Big Data”
- Anonymous @Microsoft
PolyBase
Provides a scalable, T-SQL compatible query processing
framework for combining data from both universes
SELECT
Results
SQL Server
16
Windows
Azure Blob
Storage
(WASB)
Hadoop
(nonrelational
data)
Allow SQL 16 customers to
execute T-SQL queries against
relational data in SQL Server
and “semistructured” data in
HDFS and/or Azure
why?
Increased number and
variety of data sources
that generate large
quantities of data
Sensors (e.g. location, speed,
acceleration rates, acoustical, …)
Web 2.0 (e.g. twitter, wikis, … )
Web clicks
Realization that data is
“too valuable” to delete
Dramatic decline in the
cost of hardware,
especially storage
If storage was still
$100/GB there would be
no big data revolution
underway
Machine
Learning
(Mahout)
Query
(Hive)
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)
Data Integration
(Sqoop/REST/ODBC)
Scripting
(Pig)
NoSQL Database
(HBase)
Workflow &Scheduling
(Oozie)
Coordination
(ZooKeeper)
Management & Monitoring
(Ambari)
Parallel SQL systems for
Scalable distributed
data warehousing on HDFS
file system
Hive
HDFS
Impala
Underpinnings of the entire
Hadoop ecosystem
Highly fault-tolerant
Spark/
Shark
HAWQ
Hadoop Community
World View
Append only – no updates!
Big Data goes HERE
rest of the world
World View
Can I join
you?
HDFS
Relational
HDFS
(semi-structured)
Polybase
Insight
• Two universes of data
• Structured relational
You sure data
• SemistructuredCAN!
data in HDFS for
“big data”
• Polybase Goal:
Relational
Provide a scalable, T-SQL compatible
query processing framework for combining
data from both(structured)
universes
e.g., cleansing data
before loading it
e.g., joining
relational tables w/
streams of tweets
e.g., mine sensor
data for predictive
analytics
Example #1:
(Non-Relational
Sensor data
Sensor data from
cars (kept in Hadoop)
Structured
Customer
data
Price policies
(based on driver behavior)
Relational data
(kept in SQL Server
PDW/APS)
Example #2:
(Non-Relational
Social Media
(kept in Hadoop)
Structured
Product
data
Product data
(kept in SQL Server
PDW/APS)
Basket Analysis of online shoppers
(based on social media behavior)
Example #3:
Rig old
sensor data
Rig new
(‘hot’)
data
Drilling rig analysis
Monitoring &
functioning of rig
(kept in Hadoop)
More recent data
(kept in SQL Server
PDW/APS)
PolyBase is now part
of SQL 16 (CTP2)
2012
2013
2014…
2015
Past TODAY
…
2016
…
Future
…
what’s
next
what?
why?
how?
PolyBase = SQL Server PDW V2 querying
HDFS/Azure data, in-situ
PolyBase
PolyBase
Leverages PDW’s parallel query execution framework
Exploits PDW’s parallel query optimizer to selectively push
computations on HDFS data as MapReduce jobs
Polybase
Data moves in parallel directly between Hadoop’s Data
Nodes and PDW’s compute nodes
PolyBase
Standard T-SQL query language. Eliminates need for
writing MapReduce jobs
HDFS
DB
Client Connections
Data Movement
Service (DMS)
User Queries
• JDBC,
Separate
on each node
OLEDB, process
ODBC, ADO.NET
• Shuffles intermediate tables
• Parse
SQL
among compute nodes
during
query
execution
Control
Node• Validate and authorize
• Optimize and build query plan
• Execute parallel query
• Return results to client
SQL Server
SQL Server
SQL Server
SQL Server
SQL Server
SQL Server
Scalable SQL Server DW offering
Highly competitive performance and cost
Currently only available in appliance form factor
But, soon available as SQL DW Service in Azure
KEY COMPONENTS
One control node
 Engine Service + DMS
 Compiles and controls execution
of SQL queries in parallel
Multiple compute nodes
 Each with a SQL Server instance + DMS
 Execute SQL queries in parallel
Engine
Service
DB
DB
DB
DMS
DMS
DMS
DMS
Hortonworks or Cloudera
Many
popular
fileeither
formats
Hadoop
clusterHDFS
can be
Hadoop
cluster
can be either: supported
Hadoop
distributions
RC, cloud
ORC, … )
on premise(text,
or in the
Namenode
(HDFS)
Text
Format
File
System
RCFile
ORCFile
Windows
Cluster
Linux
Cluster
Format File Format
File …
File
System
System
System
Parquet
Format
File
(future)
System
File
System
Hadoop
Cluster
Key Technical Challenges
Supporting
Arbitrary File
Formats
Parallelizing
Data
Transfers
in HDFS (e.g.,
Text, RC, ORC,
Parquet, … )
between compute
nodes and HDFS
data nodes
Imposing
Structure on
Unstructured
Data
in HDFS, using
external table
concept
Exploiting
Computational
Resources
of Hadoop
clusters
Compute Node
SQL Server
Compute Node
DMS
DMS
HDFS
HDFS
HDFS
SQL Server
Hadoop
Cluster
(augmented w/)
Hides complexity of HDFS
Uses Hadoop “RecordReaders/Writers” 
standard HDFS file types can be read/written
Used to transfer data in parallel
to & from Hadoop
Key Technical Challenges
Supporting
Arbitrary File
Formats
Parallelizing
Data
Transfers
in HDFS (e.g.,
Text, RC, ORC,
Parquet, … )
between compute
nodes and HDFS
data nodes
Imposing
Structure on
Unstructured
Data
in HDFS, using
external table
concept
Exploiting
Computational
Resources
of Hadoop
clusters
Engine
Service
DB
DB
DB
DMS
DMS
DMS
DMS
Namenode
(HDFS)
Hadoop
Cluster
File
System
File
System
File
System
File
System
File
System
File
System
Key Technical Challenges
Supporting
Arbitrary File
Formats
Parallelizing
Data
Transfers
in HDFS (e.g.,
Text, RC, ORC,
Parquet, … )
between compute
nodes and HDFS
data nodes
Imposing
Structure on
Unstructured
Data
in HDFS, using
external table
concept
Exploiting
Computational
Resources
of Hadoop
clusters
CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTER
WITH (TYPE= HADOOP, LOCATION = ‘hdfs://10.xxx.xx.xx:8020’,
JOB_TRACKER_LOCATION=’10.xxx.xx.xx:5020’);
CREATE EXTERNAL FILE FORMAT TEXT_FORMAT
WITH (FORMAT_TYPE = DELIMITEDTEXT',
DATA_COMPRESSION = ‘org.apache.hadoop.io.compress.GzipCodec’,
FORMAT_OPTIONS (FIELD_TERMINATOR = ‘\t‘));
CREATE EXTERNAL TABLE CUSTOMER
( c_custkey
bigint not null,
c_name
varchar(25) not null,
c_address
varchar(40) not null,
c_nationkey
integer not null,
…
HDFS file path
)
WITH (LOCATION ='/tpch1gb/customer.tbl’, DATA_SOURCE = GSL_HDFS_CLUSTER,
FILE_FORMAT = TEXT_FORMAT);
Selection on external table in HDFS
Customer
A possible execution plan:
EXECUTE
QUERY
IMPORT
FROM HDFS
CREATE
temp table T
Select * from T where
T.c_nationkey =3 and T.c_acctbal < 0
HDFS Customer file read into T
Execute on compute nodes
Key Technical Challenges
Supporting
Arbitrary File
Formats
Parallelizing
Data
Transfers
in HDFS (e.g.,
Text, RC, ORC,
Parquet, … )
between compute
nodes and HDFS
data nodes
Imposing
Structure on
Unstructured
Data
in HDFS, using
external table
concept
Exploiting
Computational
Resources
of Hadoop
clusters
Query plan generator
Query is parsed
“External tables” stored on HDFS are
identified
walks optimized query plan converting
subtrees whose inputs are all HDFS files into
sequence of MapReduce jobs
HDFS
SQL
Query
Parser
Logical
operator
tree
Query
Optimizer
Query
Plan
Generator
Engine
Service
Hadoop
Physical
Engine
operatorService submits MapReduce jobs
tree (as a JAR file) to Hadoop cluster.
Leverage computational capabilities of Hadoop cluster
Parallel QO is performed
Statistics on HDFS tables are used in the standard fashion
Hadoop
MapReduce
HDFS
SQL operations on HDFS data
pushed into Hadoop as
MapReduce jobs
DB
Cost-based decision on how
much computation to push
Selection and aggregate on external table in HDFS
avg
Customer
group by
Execution plan:
What really happens here?
Step 1) QO compiles predicate into Java and
generates a MapReduce (MR) job
Step 2) QE submits MR job to Hadoop cluster
Output left in hdfsTemp
Run MR Job
on Hadoop
hdfsTemp
Apply filter and compute
aggregate on Customer.
<US, $-975.21>
<UK, $-63.52>
<FRA, $-119.13>
Selection and aggregate on HDFS table
avg
Customer
group by
Execution plan:
RETURN
OPERATION
IMPORT
hdfsTEMP
Select * from T
Read hdfsTemp into T
1. Predicate and aggregate
pushed into Hadoop cluster
as a MapReduce job
2. Query optimizer makes a
cost-based decision on
what operators to push
CREATE
temp table T
On compute nodes
Run MR Job
on Hadoop
Apply filter and computes
aggregate on Customer. Output
left in hdfsTemp
hdfsTemp
<US, $-975.21>
<UK, $-63.52>
<FRA, $-119.13>
Highest Possible Performance
Parallelized data transfers between PDW appliance and Hadoop clusters.
Push down of SQL operations to Hadoop
Simplicity
Query data in
Hadoop and/or
data in PDW via
standard T-SQL
PolyBase
(in SQL Server PDW)
Open
Supports most popular
Hadoop distributions for
both Linux and Windows
Full Integration with Microsoft Office & BI
Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services
HDFS Bridge in DMS
used to read/write
files/directories in
HDFS or Azure
Engine Service
to parse, optimize, &
orchestrate parallel
execution of queries over
relational tables and HDFS
data
DMS
used to move data between
compute nodes and Hadoop
data nodes
External Table
Construct
used to “surface”
records in external
tables in HDFS or
Azure files to SQL
MapReduce Job
Pushdown
used by Engine Service
to push computation to
Hadoop cluster
how?
What Does It Mean?
Clusters of SQL Server 16
instances can be used to
process data residing in
Queries can arbitrarily mix
HDFS in parallel
relational data in SQL
Server with data
in HDFS or Azure
Full T-SQL interface for
HDFS-resident data
(RTM)
All standard BI tools
supported
Step 1: Set up a
Hadoop cluster (if
you don’t have
one already)
Hortonworks or Cloudera Distributions
Hadoop 2.0 or above
Linux or Windows
On premise or in Azure
PolyBase
DLLs
PolyBase
DLLs
PolyBase
DLLs
PolyBase
DLLs
DMS
DMS
DMS
DMS
DMS
Head node is the SQL Server
instance to which queries are
submitted
Compute nodes are used for
scale out query processing
for data in HDFS or Azure
DMS
DMS
DMS
DMS
Head Node is actually also
always a Compute Node
DMS
DMS
DMS
DMS
DMS
CREATE EXTERNAL DATA
SOURCE GSL_HDFS_CLUSTER
AV_DFS_CLUSTER
WITH (TYPE= HADOOP, …)
Azure
Azure
Storage
Hadoop
Volume
Cluster
#2
Azure
Storage
Volume
Azure
Storage
Volume
T-SQL queries
submitted here
Queries can only
refer to tables here
and/or external
tables here
Compute nodes are used for scale out query
processing on external tables in HDFS/Azure
Tables on compute
nodes cannot be
referenced by queries
submitted to head
node
Number of compute nodes can be
dynamically adjusted by DBA
Hadoop clusters can be shared between
different SQL 16 clusters
Key Differences
Scaleout
Storage
Scaleout
QP
Language
Surface
Delivery
Vehicle
Relational,
HDFS, Azure
Relational,
HDFS, Azure
T-SQL
Subset
Appliance
Relational
& Azure
Relational,
& Azure
T-SQL
Subset
Cloud
(PASS)
HDFS
& Azure
Relational,
HDFS
(CTP2)
HDFS, Azure
Full
“Box” &
T-SQL (RTM) Cloud (IAAS)
1. Each SQL 16 instance can participate in a single PolyBase
cluster – A limitation of current DMS software
2. Multiple compute nodes not “production ready”
3. No support for varchar(max) or unique column types
4. No scale out for joins between tables on Head Node and
external tables on HDFS
This should be fixed by CTP3 (for sure by RTM)
Example on next slide
Select C.Name from
Orders O, Customers C
where C.id = O.CustId and
O.price > $1000
tmp
Probable CTP2 Query Plan:
1) Use MR job to find Orders > $1000
2) Import result into SQL 16 instance on
Head Node
3) Perform join on Head Node
Select C.Name from
Customers C, Orders O
where C.id = O.CustId and
P.price > $1000
CTP3 plan likely to be:
1) Use MR job to find Orders > $1000
2) Import result into SQL 16 instances on Compute
nodes
3) Redistribute Customers table from Head
Node to Compute Nodes
4) Perform join in parallel on compute nodes
1. Will alwaysOn (Hadron) be supported?
2. Can a compute node be used for other SQL workloads?
3. Can a compute node share a machine with a Hadoop
4.
5.
6.
7.
data node?
What SQL Server editions will I need?
Will the MapR Hadoop distribution be supported?
Is there a limit on the number of compute nodes?
Why is it not possible to support different Hadoop distros
simultaneously?
what’s
next
what?
why?
how?
1. “Local” table scale-out query
processing
2. “Official” support for multiple compute
nodes
3. Enhanced performance mechanisms⁺
• ES and DMS running inside SQL Server
instead of as separate processes
• Streaming of data from HDFS directly into
leaves of executing SQL query plans
• Native (i.e. C++) HDFS libraries for Text,
ORC, and Parquet file formats
• Indexing of HDFS data
1. Simplicity - Query data in Hadoop, Azure and SQL
Server via standard T-SQL
2. Highest Possible Performance - Parallelized data
transfers between SQL 16 instances and Hadoop data
nodes. Push down of SQL operations to Hadoop
using MR jobs
3. Open - Supports most popular Hadoop distributions
for both Linux and Windows
4. Full Integration with Microsoft Office & BI - Excel’s
PowerPivot, PowerView, Tableau, Cognos, SQL Server
Reporting & Analysis Services
1)
2)
3)
55