Polybase Index - Microsoft Gray Systems Lab

Transcript Polybase Index - Microsoft Gray Systems Lab

Indexing HDFS Data in PDW:
Splitting the data from index
Vinitha Gankidi#, Nikhil Teletia*, Jignesh M. Patel#,
Alan Halverson*, David J. DeWitt*
University of Wisconsin-Madison
* Microsoft Jim Gray Systems Lab
#
1
Motivation
SQL
SQL Server PDW
with Polybase
Result
Data lives in two worlds
Hybrid SQL-On-Hadoop solutions
(Microsoft PolyBase, Teradata
QueryGrid, IBM Big SQL etc.)
Hot Data
Cold Data
RDBMS
HDFS
Familiar SQL
interface
Load first,
schema later
Decades of
research and
optimization
Cheap and
scalable data
store
2
Query Execution over External Data
SELECT * FROM hdfs_Employee
The HDFS files haveWHERE
to be DeptID
entirely= 1imported
1 Import HDFS files into PDW
IMPORT PATH
2 Run the rest of the query inside PDW
IDID
DeptID
DeptID
ID
Name
DeptID
101
A
1
101
101
PUSH-DOWN PATH
Name
Name
102
102
AA
BB
11
22
102for MAP
B
2
Significant startup overhead
task
103
C
3
1 Run a Map job to filter
103
103
C
C
3
3
intofiles
PDWare scanned entirely
2 Import the result of the
AllMap
thejob
HDFS
HDFS
3 Run the rest of the query inside PDW
3
What is a Split-Index?
1. Index is stored in RDBMS, while the data is in HDFS
2. Index is stored as a RDBMS table
• Hash-partitioned across multiple node
• Each partition has clustered B+ tree
Index
RDBMS
IDID
ID
101
101
101
102
102
Dept
ID
1
Name
Name
Name
AA
A
BB
102viewB
Split-Index is similar to a materialized
103
103
CC
HDFS an external
HDFS REC
103
C
(with
pointer)
File
offset Len
Split-Index
can be out-of-sync with the data
Name
HDFS
file1
0
10
DeptID
DeptID
DeptID
11
1
22
2
33
3
4
Query Execution using Split-Index
SELECT name FROM hdfs_Employee
WHERE DeptID = 1
DeptID
DeptID
File offset
offset Len
Len
File
HDFS
Name HDFS REC
Name
4
1
Name
A
Return the result
…
2
IDID
Name
Name
DeptID
DeptID
SELECT [HDFS File Name],
ID Name DeptID
[HDFS Offset], [Rec Len]
101
101
AA
11
File offset Len
FROM index_Employee
1
file1
0
10
Name
101
ABBhaving
122 to
Using
index,
we can answer queries without
102
1
file1
0
10
102
WHERE DeptID = 1
1 22 file1
01010
10
102
BC
23
file1
10
sequentially
scan each HDFS file.
file1
10
103
103
C
3
Qualifying Tuples
2 3 file1
1020 1010
103
C
3
file1
Index_Emp
(Index
on
DeptID)10
3
file1
20
HDFS
HDFS REC
3
file1
20
10
HDFS
File
offset Len
3
Name
RDBMS
Retrieve qualifying tuples
file1
0
10
from HDFS files.
file2
100
10
5
Dept
ID
Incremental Index Update
• Given the append-only property of the HDFS data, index
can be updated incrementally
• A new HDFS file is added
• Append the rows of the new file to the existing index
• An HDFS file is deleted
• Delete the rows of the deleted file from the existing index
6
Hybrid Scan
• A stale Split-Index can still be used during query
execution
• Examples:
• An HDFS file is added
• Scan the new file using non-index approach
• Process existing files using index
• An HDFS file is deleted
• When probing the index, remove the rows associated with the
deleted file
7
Experiments
8
Split-Index Performance
• Cluster
• 9 Node SQL Server PDW cluster (8 compute nodes + 1 control node)
• 29 Node Windows HDP 2.0 cluster (28 data nodes + 1 name node)
• Data Set
• 10 TB Scale Lineitem table
• Compare Push-Down approach with Split-Index approach
COST
Map Cost
RID Materialize Cost
Data Import Cost
Data Import Cost
Push-Down Approach
Split-Index Approach
9
Split-Index Performance
SELECT * FROM lineitem WHERE l_orderkey <= [Variable]
RID Materialize Cost
Data Import Cost
Map Cost
Split-Index on
l_orderkey
Execution Time (in seconds)
20000
16209
15000
Data Size: ~800GB
Index Size: ~80GB
12248
11694
10000
6563
5000
0
4
617
Map Cost
RID Materialize
Cost
Data Import
Cost
Data Import
Cost
Index performance is sensitive to the access pattern.
Push-down Split-index Push-down Split-index Push-down Split-index
Rifle-Shot
(1 tuple)
1%
(600M tuples)
Predicate Selectivity
10%
(6B tuples)
Push-Down
Approach
Split-Index
Approach
10
Space vs. Time Trade off
• Cost of storing the data in RDBMS is higher compared to
HDFS
• Split-Index can SELECT
be used as a covering index
SUM(l_extendedprice*l_discount)
ASwe
REVENUE
• Quantify the performance
and space trade-off as
move
FROMto
lineitem
columns from HDFS
PDW
WHERE l_shipdate >= '1994-01-01'
• Experiment Setup AND l_shipdate <
• 1 TB Scale Lineitem
dateadd(mm, 1, cast('1994-01-01' as date))
• Modified Query 6 AND l_discount BETWEEN .06 - 0.01 AND
.06 + 0.01 AND l_quantity < 24
11
Space vs. Time Trade off
PDW Disk footprint (GB)
Split-Index can be used to balance the query
The Lineitem
500table is in PDW. No index.
execution time and the PDW disk footprint
400
The Lineitem table is in HDFS
Split-Index on l_shipdate, l_discount,
l_quantity, l_extendedprice
300
200
The Lineitem table is
in HDFS. No index.
(Push-Down)
The Lineitem table is in HDFS
Split-Index on l_shipdate
100
0
0
100
200
300
400
500
Execution time (in seconds)
600
700
12
Conclusions and Future Work
A simple “Split-Index” mechanism can be used to achieve
low-latency on highly-selective queries, with minimal system
changes
Incremental index update reduces the cost of maintaining
the Split-Index; Hybrid scan allows using the stale Split-Index
Future Work: Query optimization to use the Split-Index, and
automatic physical schema designer for the Split Index(es).
13