Danny Tambs Solution Architect VOLUME (Size) VARIETY (Structure) VELOCITY (Speed) ODBC Distributed Processing (Map Reduce) Distributed Storage (HDFS) World’s Data (Azure Data Marketplace) Windows Azure Storage.

Transcript Danny Tambs Solution Architect VOLUME (Size) VARIETY (Structure) VELOCITY (Speed) ODBC Distributed Processing (Map Reduce) Distributed Storage (HDFS) World’s Data (Azure Data Marketplace) Windows Azure Storage.

Danny Tambs
Solution Architect
VOLUME
(Size)
VARIETY
(Structure)
VELOCITY
(Speed)
ODBC
Distributed Processing
(Map Reduce)
Distributed Storage
(HDFS)
World’s Data (Azure Data
Marketplace)
Windows Azure Storage
PDW 2012 (V2)
10X Faster & Lower Capital Cost
Control Node
Mgmt. Node
LZ
Backup Node
Infiniband
& Ethernet
Infiniband & Ethernet
Fiber Channel
Per RACK details
• 160 cores on 10 compute nodes
• 1.28 TB of RAM on compute
• Up to 30 TB of temp DB
• Up to 150 TB of user data
Per RACK Details
• 128 cores on 8 compute nodes
• 2TB of RAM on compute
• Up to 168 TB of temp DB
• Up to 1PB of user data
Built in block fashion (Scale Units) to support easy scale-out. From
¼ Rack system to 7 Racks.
Host 0
One standard node type. 256GB Ram per node.
Host 1 (Redundant)
Using SQL Server 2012 underneath.
Host 2
Infiniband Connectivity between nodes.
JBOD
IB &
Ethernet
Host 3
Direct attached SAS
Moving from SAN to JBODs
Significant reduction in costs
Leverage Windows Server 2012 technologies to achieve the
same level of reliability and robustness.
Scale unit concept
Capacity scale unit: adding 2/3 compute nodes and related
storage
Spare scale unit
Base scale unit: min populated rack w/ networking
HP
Quarter-rack
Half
Three-quarters
Full rack
One-&-quarter
One-&-half
Two racks
Two and a half
Three racks
Four racks
Five racks
Six racks
Base
1
1
1
1
2
2
2
3
3
4
5
6
Compute
2
4
6
8
10
12
16
20
24
32
40
48
Spare
1
1
1
1
2
2
2
3
3
4
5
6
Total
4
6
8
10
13
15
19
24
28
37
46
55
Raw disk: 1TB
15.1
30.2
45.3
60.4
75.5
90.6
120.8
151
181.2
241.6
302
362.4
Raw disk: 3TB
45.3
90.6
135.9
181.2
226.5
271.8
362.4
453
543.6
724.8
906
1087.2
Capacity
227 TB
453 TB
680 TB
906 TB
1133 TB
1359 TB
1812 TB
2265 TB
2718 TB
3624 TB
4530 TB
5436 TB
Seven racks
7
56
7
64
422.8
1268.4
6342 TB
DELL
Quarter-rack
2 thirds
Full rack
One and third
One and 2 third
2 racks
2 and a third
2 and 2 thirds
Three racks
Four racks
Five racks
Base
1
1
1
2
2
2
3
3
3
4
5
Compute
3
6
9
12
15
18
21
24
27
36
45
Spare
1
1
1
2
2
2
3
3
3
4
5
Total
5
8
11
15
18
21
25
28
31
41
51
Raw disk: 1TB
22.65
45.3
67.95
90.6
113.25
135.9
158.55
181.2
203.85
271.8
339.75
Raw disk: 3TB
67.95
135.9
203.85
271.8
339.75
407.7
475.65
543.6
611.55
815.4
1019.25
Capacity
340 TB
680 TB
1019 TB
1359 TB
1699 TB
2039 TB
2378 TB
2718 TB
3058 TB
4077 TB
5096 TB
Six racks
6
54
6
61
407.7
1223.1
6116 TB
2 to 56 nodes
Up to 6 PB user
data
2 (HP) or 3 (DELL)
node increments
for small
topologies
Plan
Injection
DSQL
plan
‘Optimizable query’
DMS op
(e.g. SELECT)
…
Control Node
[Shell DB]
Compute
Node 1
Compute
Node 2
Compute
Node n
Table Distribution Definition
Compute Nodes - VHDX
SMP System
PDW
DD
SD
Date Dim
Item Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Prod
Prod
Prod
Prod
Dim ID
Category
Sub Cat
Desc
DD
SD
SF
1
SF
2
ID
PD
ID
PD
Sales Fact
DD
Date Dim ID
Store Dim ID
Prod Dim ID
Mktg Camp Id
Qty Sold
Dollars Sold
SD
DD
Store Dim
Store
Store
Store
Store
Dim ID
Name
Mgr
Size
Promo Dim
Mktg Camp ID
Camp Name
Camp Mgr
Camp Start
Camp End
SD
DD
SD
SF
3
SF
4
SF
5
ID
PD
ID
PD
ID
PD
CREATE TABLE myTable
(column Defs)
WITH ( DISTRIBUTION = HASH (id));
PDW Node 1
Create Table <table GUID>_a
Create Table <table GUID>_b
…
Create Table <table GUID>_h
8 Tables per Node
PDW Node 2
Create Table <table GUID>_a
Create Table <table GUID>_b
…
Create Table <table GUID>_h
PDW Node …
PDW Node 8
Create Table <table GUID>_a
Create Table <table GUID>_b
…
Create Table <table GUID>_h
• High-level goals for V2
Data Scientists
BI Users
DB Admins
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Hadoop
Non-relational data
Regular
T-SQL
Results
Traditional schemabased DW applications
External Table
Enhanced
PDW query
engine
PDW V2
Relational data
• Parallel Data Transfers
PDW Appliance
Hadoop Cluster
CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ])
{WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}
[;]
1.
Indicates
‘External’ Table
2.
Required location of
Hadoop cluster and file
3.
Optional Format Options
associated with data import
from HDFS
<Format Options> :: = [,FIELD_TERMINATOR= ‘Value’], [,STRING_DELIMITER = ‘Value’], [,DATE_FORMAT =
‘Value’], [,REJECT_TYPE = ‘Value’], [,REJECT_VALUE = ‘Value’] [,REJECT_SAMPLE_VALUE = ‘Value’,],
[USE_TYPE_DEFAULT = ‘Value’]
• FIELD_TERMINATOR
 to indicate a column delimiter
• STRING_DELIMITER
 to specify the delimiter for string data type fields
• DATE_FORMAT
 for specifying a particular date format
• REJECT_TYPE
 for specifying the type of rejection, either value or percentage
• REJECT_SAMPLE_VALUE
 for specifying the sample set – for reject type percentage
• REJECT_VALUE
 for specifying a particular value/threshold for rejected rows
• USE_TYPE_DEFAULT
 for specifying how missing entries in text files are treated
• Concept of External Tables
• Direct and parallelized HDFS access
• Enhancing PDW’s Data Movement Service (DMS) to allow direct communication
between HDFS data nodes and PDW compute nodes
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Regular
T-SQL
Results
External Table
Traditional schemabased DW applications
Enhanced
PDW query
engine
HDFS bridge
HDFS data nodes
Non-Relational data
PDW V2
Relational data
‘design time’
CREATE
EXTERNAL TABLE
Statement
Tabular view on
hdfs://../employee.tbl
Parsing of
format options
File binding & split
generation
HDFS
bridge
process
part of DMS process
Parser
process
Hadoop
Name Node
maintains metadata (file
location, file size …)
part of ‘regular’ T-SQL
parsing process
SHELL-only
plan
External Table
Shell Object
No actual physical tables
on compute nodes
CREATE EXTERNAL
TABLE
…
Hadoop Name Node
Control Node
[Shell DB]
Compute
Node 1
Compute
Node 2
Compute
Node n
Querying non-relational data in HDFS via T-SQL
I.
II.
Query data in HDFS and display results in table form (via external tables)
Join data from HDFS with relational PDW data
Running Example – Creating external table ‘ClickStream’:
CREATE EXTERNAL TABLE ClickStream(url varchar(50), event_date date, user_IP
varchar(50)), WITH (LOCATION =‘hdfs://MyHadoop:8020/tpch1GB/employee.tbl’,
FORMAT_OPTIONS (FIELD_TERMINATOR = '|'));
Query Examples
1.
2.
SELECT top 10 (url) FROM ClickStream where user_IP =
‘192.168.0.1’
SELECT url.description FROM ClickStream cs, Url_Descr* url
WHERE cs.url = url.name and cs.url=’www.cars.com’;
3.
SELECT user_name FROM ClickStream cs, User* u WHERE
cs.user_IP = u.user_IP and cs.url=’www.microsoft.com’;
Text file in HDFS with | as field
delimiter
Filter query against data in HDFS
Join data from various files in HDFS
(*Url_Descr is a second text file)
Join data from HDFS with data in PDW
(*User is a distributed PDW table)
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Parallel
HDFS Reads
HDFS data nodes
Non-Relational data
SELECT
Results
External Table
Enhanced
PDW query
engine
Traditional schema-based
DW applications
Parallel
Importing
HDFS bridge
DMS
DMS
Reader Reader
1
… N
PDW V2
Relational data
DSQL plan with
external DMS move
SELECT FROM
EXTERNAL TABLE
Plan Injection
External Table
Shell Object
Control Node
[Shell DB]
…
Compute
Node 1
HFDS
Readers
Compute
Node n
HFDS
Readers
…
Hadoop
Data Node 1
Hadoop
Data Node n
• Parallel Import of HDFS data & Export into HDFS
Example
CREATE TABLE ClickStream_PDW WITH DISTRIBUTION = HASH(url)
AS SELECT url, event_date, user_IP FROM ClickStream
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Parallel
HDFS Reads
CTAS
Results
External Table
Enhanced
PDW query
engine
Traditional schema-based
DW applications
Parallel
Importing
HDFS bridge
HDFS data nodes
Non-Relational data
DMS
DMS
Reader Reader
1 … N
PDW V2
Relational data
Retrieval of data in
HDFS ‘on-the-fly’
parallelized
Example
CREATE EXTERNAL TABLE ClickStream WITH(LOCATION
=‘hdfs://MyHadoop:5000/users/outputDir’,FORMAT_OPTIONS (FIELD_TERMINATOR = '|'))
AS SELECT url, event_date, user_IP FROM ClickStream_PDW
Retrieval of PDW data
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Parallel
HDFS Writes
HDFS data nodes
Non-relational data
CETAS
Results
External Table
Enhanced
PDW query
engine
HDFS bridge
HDFS HDFS
Writer Writer
1 … N
Traditional schema-based
DW applications
Parallel
Exporting
PDW V2
Relational data
Example
CREATE EXTERNAL TABLE ClickStream WITH (LOCATION
=‘hdfs://MyHadoop:5000/users/outputDir’, FORMAT_OPTIONS (FIELD_TERMINATOR =
'|')) AS SELECT url, event_date,user_IP FROM ClickStream_PDW
1.
PDW table (can be either distributed or replicated)
2. Output directory in HDFS
3.
Example
New external table created
with results of the join
CREATE EXTERNAL TABLE ClickStream_UserAnalytics WITH (LOCATION
=‘hdfs://MyHadoop:5000/users/outputDir’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|'))
AS SELECT user_name, user_location, event_date, user_IP FROM ClickStream c,
User_PDW u where c.user_id = u.user_ID
2.
PDW data
2.
Joining incoming data from
HDFS with PDW data
1.
External table referring to
data in HDFS
• Configuration & Prerequisites for enabling Polybase
Java RunTime Environment
sp_configure & Reconfigure
exec
exec
exec
exec
sp_configure
sp_configure
sp_configure
sp_configure
‘Hadoop
‘Hadoop
‘Hadoop
‘Hadoop
connectivity,
connectivity,
connectivity,
connectivity,
1’
2’
3’
0’
-----
connectivity to HDP 1.1 on Wind. Server
connectivity to HDP 1.1 on Linux
connectivity to CHD 4.0 on Linux
disabling Polybase (default)
http://channel9.msdn.com/Events/TechEd/Australia/2013
http://www.microsoftvirtualacademy.com/
http://technet.microsoft.com/en-au/
http://msdn.microsoft.com/en-au/
http://technet.microsoft.com/en-us/evalcenter/dn205290.aspx
www.windowsazure.com
http://www.windowsazure.com/en-us/documentation/services/hdinsight/?fb=en-us
www.powerbi.com

Danny Tambs Solution Architect VOLUME (Size) VARIETY (Structure) VELOCITY (Speed) ODBC Distributed Processing (Map Reduce) Distributed Storage (HDFS) World’s Data (Azure Data Marketplace) Windows Azure Storage.

Transcript Danny Tambs Solution Architect VOLUME (Size) VARIETY (Structure) VELOCITY (Speed) ODBC Distributed Processing (Map Reduce) Distributed Storage (HDFS) World’s Data (Azure Data Marketplace) Windows Azure Storage.

Directory