Multi-Core Processors for In-Memory Databases

download report

Transcript Multi-Core Processors for In-Memory Databases

SIMD-Scan: Ultra Fast inMemory Table Scan using onChip Vector Processing Units
Thomas Willhalm and Nicolae Popovici, Intel GmbH
Yazan Boshmaf, SAP AG
Hasso Plattner, Alexander Zeier, Jan Schaffner
Hasso-Plattner-Institute, University of Potsdam
VLDB 2009
August 25, 2009
Agenda
•
•
•
•
Column-Store and Light-Weight Database Compression
Single Instruction Multiple Data (SIMD)
Using SIMD for Decompression
Using SIMD for Predicate Handling
Acknowledgement: We would like to thank Franz Färber, Günter
Radestock, Tobias Mindnich, and Christoph Weyerhäuser from SAP for
the fruitful discussion and the tremendous help in integrating and
testing the SIMD routines.
SIMD-Scan
VLDB 2009
2
Column-Store
• Column-oriented
• Columns compressed independently
• Completely in main memory
A
B
C
D
row
• For each query do a full-table scan,
i.e.
−Decompresses required columns
−Aggregate data according to predicate
−Further processing
Data is stored in memory as columns
SIMD-Scan
VLDB 2009
3
SAP* NetWeaver Business Warehouse
Accelerator (BWA)
• Processing highly parallelized across multiple cores and blades
• Shared nothing approach
• Public demo available at http://microfinance.sap.com/
SIMD-Scan
VLDB 2009
4
Light-Weight Database Compression
Sales Table
ID
1
2
3
…
Amount
12.23
1.02
132.13
…
Tables for storing “Amount” attribute
Attribute Table
DocId
1
2
3
4
5
Dictionary
ValueId
ValueId
42
4
100000
128
31455
1
2
3
…
100000
Value
0.02
0.10
1.00
…
…
132.13
•Our focus: Loading this column
•From “Dictionary”, those values are (0, 1, 2 ,3, …, 100000)
•Max is 100000 which needs 17-bits to represent (217-1)
•Idea: instead of 32-bits, use 17-bits to store each
•Accessing “Value” needs decompression into 32-bits
SIMD-Scan
VLDB 2009
4/29/2020
5
5
Integers are compressed as packed
bit-fields
Example: packed 17-bit fields
F
E
D
C
B
110300
...
A
9
8
65536
7
6
5
1772
2702
4
3
2
1
2
0
42
DECOMPRESS
1772
…
3.14
2.55
2.73
1.23
0.02
2702
2
Use as Index for Dictionary
Dictionary
SIMD-Scan
42
17 bits
32 bits
VLDB 2009
6
Using SIMD for Full-Table Scans
SIMD-Scan
VLDB 2009
7
Single Instruction Multiple Data (SIMD)
•
Scalar processing
• SIMD processing
− traditional mode
− with Intel® SSE(2,3,4)
− one instruction produces
one result
− one instruction produces
multiple results
SOURCE
SOURCE
X
0
X4
X3
X2
X1
Y4
Y3
Y2
Y1
SSE/2/3 OP
Scalar OP
Y
DEST
DEST
X4opY4 X3opY3 X2opY2 X1opY1
XopY
SIMD-Scan
127
VLDB 2009
8
Single Instruction Multiple Data (SIMD)
SSE Operation
•128-bit wide with
Intel® SSE(2,3,4)
SOURCE
− 2 64-bit integer ops/cycle
− 4 32-bit integer ops/cycle
127
0
X4
X3
X2
X1
Y4
Y3
Y2
Y1
SSE2 OP
− 8 16-bit integer ops/cycle
DEST
− 16 8-bit integer ops/cycle
• 256-bit with AVX
CLOCK
CYCLE 1
(Sandy Bridge)
X4opY4 X3opY3 X2opY2 X1opY1
• 512-bit with Larrabee
Vector-Processing Unit built-in standard processors
SIMD-Scan
VLDB 2009
9
Using SIMD for Decompression
SIMD-Scan
VLDB 2009
10
DECOMPRESS unaligned bit fields
Example: packed 17-bit fields
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1
0
1. Load a pre-fetched 128-bit segment of input data into SSE register.
110300
...
65536
1772
2702
2
42
2. Copy compressed values to target DWORDs “32-bit segment”.
1772
2702
2
42
_mm_shuffle_epi8
3. Align the values from unequally shifted DWORDs.
1772
2702
2
42
4. Store uncompressed values.
SIMD-Scan
VLDB 2009
4/29/2020
11
11
Problem: There are values that
span across 5 Bytes
Example: packed 27-bit fields
0
F
E
D
128~
C
B
A
9
8
7
6
17
32766
5
4
3
2
127321873
1
0
42
Shuffle
??
32766
127321873
42
•The 3rd value spans across 5 Bytes.
•Cannot use Shuffle to copy the FULL bits into a 4-Byte space
directly.
4/29/2020
SIMD-Scan
VLDB 2009
4/29/2020
12
12
Solution: Shift 5-Bytes values into
4 Bytes and blend
Example: packed 27-bit field
F
E
D
~128
C
B
A
9
6
5
4
4
3
2
1
0
42
Shuffle
_mm_shuffle_epi8
27
4
42
Shift(64)
_mm_slli_epi64
_mm_srli_epi64
32766
7
27
32766
32766
8
27
4
42
Blend
_mm_blend_epi16
32766
SIMD-Scan
27
4
VLDB 2009
42
13
Different workarounds for
“independent shift” are used
1. Direct shuffle for nicely aligned values
2. Integer Multiplication 16-bit and 32-bit to simulate
independent shift
3. Use 2 shifts and blend results
4. Integer Comparison to propagate value (1-bit compression)
SIMD-Scan
VLDB 2009
4/29/2020
14
14
Shift-1: Direct shuffle aligned
values
Example: packed 24-bit fields
• Data is nicely aligned (case 8, 16 & 24).
• Copy interesting parts only and “zero” out the others.
F
E
…
D
32766
C
B
A
9
8
128
7
6
5
31415
4
3
2
5
1
0
114
Shuffle
128
SIMD-Scan
31415
5
VLDB 2009
114
4/29/2020
15
15
Shift-2: Use multiplication to
simulate independent left shift
Example: packed 15-bit fields
F
E
D
C
5-bits shift
32766
B
A
9
8
7
6
5
7-bits shift
27
4
__m128i mult_rslt = _mm_mullo_epi32(shfl_rslt, mult_msk );
32766
3
6-bits shift
__m128i mult_msk = _mm_set_epi32(0x04,0x02,0x01,0x80);
7-bits shift
4
7-bits shift
27
2
1
0
0-bits shift
42
Multiply to
shift left
7-bits shift
7-bits shift
4
42
Shift right
_mm_srli_epi32(mult_rslt1_m128i, 7);
__
SIMD-Scan
32766
27
4
VLDB 2009
42
4/29/2020
16
16
Shift-3: Blend results of different shift
amount Example: packed 4-bit fields
F
E
D
C
B
A
9
8
7
6
5
4
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
3
8
7
2
1
6
5
4
7
6
3
3
3
0
2
1
0
5
3
2
1
2
1
7
6
5
2
1
2
1
0
Shift right by 4
31 30 29
27 26 25
23 22 21
19 18 17
15 14 13
11 10 9
Shuffle
27 26 25
31 30 29
19 18 17
23 22 21
11 10 9
15 14 13
Blend
27 26 25 27 26 25 24
19 18 17 19 18 17 16
11 10 9 11 10 9
8
3
Mask with “and”
27
25
26
24
19
17
18
16
11
9
10
8
3
1
2
0
Shuffle
3
2
1
0
Use second Blend for remaining values
SIMD-Scan
VLDB 2009
17
Shift-4: Use integer comparison to
propagate
values packed 1-bit fields only
Packed 1-bit fields ONLY
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1
0
shuffle
and
compare
0x00 0xFF 0x00 0xFF
0x00 0xFF 0x00 0xFF
0x00 0xFF 0x00 0xFF
0x00 0xFF 0x00 0xFF
and
0x00 0x01 0x00 0x01
0x00 0x01 0x00 0x01
0x00 0x01 0x00 0x01
0x00 0x01 0x00 0x01
shuffle
0
SIMD-Scan
1
VLDB 2009
0
1
18
Decompression is 1.58x faster with
SIMD
Decompression Scan on Intel® Xeon™ Processor X5560
3.0
Speed-up SSE vs. scalar
2.5
2.0
1.5
1.0
0.5
0.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
avg
Bit Case
SIMD-Scan
VLDB 2009
19
Using SIMD for Predicate Handling
SIMD-Scan
VLDB 2009
20
COMPRESSEDSEARCH searches on
compressed values
Algorithmic optimization by only decompressing the range of values
that are of interest:
• DECOMPRESS
• And returns indexes of “Index Values” instead of
decompressed “Index Values”
COMPRESSEDSEARCH(1,30)
F
E
D
Input vector
...
Index=115
42
C
B
A
y
9
8
7
x
6
5
42
Decompress
Index=114
Index=113
4
27
4
4
3
2
27
1
0
32766
Index=112
32766
Compare and store the index
114
113
Result Buffer
SIMD-Scan
VLDB 2009
21
Basic idea of COMPRESSEDSEARCH
Example: COMPRESSEDSEARCH(3,30)for packed 17-bit fields
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1
0
1. Load a pre-fetched 128-bit segment of input data into SSE register.
...
49
270
42
4
27
32766
2. Copy compressed values to target DWORDs “32-bit segment”.
42
4
27
32766
3. Make a Parallel Comparison of each DWORDs (1 <= Value < 30).
0x00000000
0xFFFFFFFF
0xFFFFFFFF
0x00000000
114
113
4. Store the Indexes.
…
SIMD-Scan
VLDB 2009
22
Compare shifted values
Example: COMPRESSEDSEARCH(3,30)for packed 15-bit fields
5-bit shift
7
6
3
Less than
Shifted upper bound
4
7-bit shift
3
3
0xFFFFFFFF
5
27
6-bit shift
Shifted lower bound
30
8
4
42
0x00000000
9
0xFFFFFFFF
30
30
3
2
1
0
32766
0-bit shift
3
0xFFFFFFFF
30
0xFFFFFFFF
0xFFFFFFFF
0x00000000
0x00000000
0xFFFFFFFF
0xFFFFFFFF
0x00000000
VLDB 2009
and
0xFFFFFFFF
SIMD-Scan
greater than
F
E
D
C
B
A
mask values with pand
23
Hits are stored with look-up table
• Test first, if there are any hits with _mm_testz_si128
−Implicit “pand” (saves 1 instruction)
F
E
D
0x00000000
C
B
A
9
0xFFFFFFFF
8
7
6
5
4
3
0xFFFFFFFF
Extract bits with “movemask”
use this for table look-up
2
1
0
0x00000000
0b0110
Maintain loop variable with current indexes
115
114
113
112
Shuffle indexes of hits (shuffle mask from look-up table)
114
113
Append result to list of hits
SIMD-Scan
VLDB 2009
24
Full-table Scan is 1.63x faster with
SIMD
Full-Table Scan on Intel® Xeon™ Processor X5560
speed-up SSE vs. scalar code
2.50
2.00
1.50
1.00
0.50
0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 avg
Bit case
SIMD-Scan
VLDB 2009
25
Best performance is achieved for small
results sets
Full-Table Scan on Intel® Xeon™ Processor X5560
speed-up SSE vs. scalar code
2.5
2.0
1.5
1.0
0.5
0.0
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
average
Selectivity of Query
SIMD-Scan
VLDB 2009
26
SIMD-Scan scales with the number of
cores
Throughput of Full-Table Scan
Intel® Xeon™ Processor X5560 (2.8GHz, 2 CPUs with 4 cores each)
16
13.1
14
12
GB/s
10
7.2
8
6
4.2
4
2.4
2
0
0
SIMD-Scan
2
4
6
Number of threads
VLDB 2009
8
10
27
Summary
•Data is stored in memory as columns
•Integers are compressed as packed bit-fields
•Use Vector-Processing Unit built-in standard processors
•Decompression is 1.58x faster with SIMD
•Full-table Scan is 1.63x faster with SIMD
•Best performance is achieved for small results sets
•SIMD-Scan scales with the number of cores
SIMD-Scan
VLDB 2009
28
Trademarks
•Intel and Xeon are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the United States or
other countries.
•SAP, SAP NetWeaver, BusinessObjects, BusinesObjects
Explorer, and other SAP products and services mentioned
herein as well as their respective logos are trademarks or
registered trademarks of SAP AG in Germany and other
countries.
•Other names and brands may be claimed as the property of
others.
SIMD-Scan
VLDB 2009
29
Load problem of unaligned data
Example:
packed 15-bit fields
63
~
Load the
second
group
27
32766
~
114
~
4
5
42
Load the
first group
31415 128~
44
~
0
6
99
3~
12
115
2~
57
…
1~
127
0
F
E
3~
D
C
114
1
B
5
99
A
9
31415
115
8
7
128
57
6
5
32766
2
4
3
27
44
2
1
4
6
0
42
12
~3
What about
this value?
4/29/2020
SIMD-Scan
VLDB 2009
4/29/2020
30
30
Solution: Load using palign
Start
END
127
0
0
127
mm_load_si128
0
Align to 128-bits new SSE register with palign
127
Neglect by shifting
15 bytes
~
_mm_alignr_epi8
F
1~
E
D
99
C
B
115
A
57
9
8
7
2
6
5
44
4
3
6
2
12
1
0
3
• Use the same technique for further loads with “suitable” shift amount.
• Unroll loop because shift amount is an immediate
(loop unrolling can be done by Intel compiler with pragma)
• Use unaligned loads on Intel® Xeon® processor 5500 series
SIMD-Scan
VLDB 2009
4/29/2020
31
31