TPC-H Studies

Download Report

Transcript TPC-H Studies

TPC-H Studies
Joe Chang
[email protected]
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
TPC-H
TPC-H
DSS – 22 queries, geometric mean
60X range plan cost, comparable actual range
Power – single stream
Tests ability to scale parallel execution plans
Throughput – multiple streams
Scale Factor 1 – Line item data is 1GB
875MB with DATE instead of DATETIME
Only single column indexes allowed, Ad-hoc
SF 10, test studies
Not valid for publication
Auto-Statistics enabled,
Excludes compile time
Big Queries – Line Item Scan
Super Scaling – Mission Impossible
Small Queries & High Parallelism
Other queries, negative scaling
Did not apply T2301, or disallow page locks
Big Q: Plan Cost vs Actual
3,500
3,000
DOP 1
DOP 2
DOP 8
DOP 16
Plan Cost
reduction from
DOP1 to 16/32
Q1
28%
Q9
44%
Q18
70%
Q21
20%
DOP 4
2,500
2,000
Plan Cost @ 10GB
1,500
1,000
500
0
Q1
Q9
Q13
Q18
Q21
75
60
memory affects
Hash IO onset
DOP 1
DOP 2
DOP 4
DOP 8
DOP 30
DOP 16
DOP 32
DOP 24
Plan Cost says
scaling is poor
except for Q18,
Q18 & Q 21
> 3X Q1, Q9
Actual Query time
In seconds
45
30
15
0
Q1
Q9
Q13
Q18
Q21
Plan Cost is poor
indicator of true
parallelism scaling
Big Query: Speed Up and CPU
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Q1
Q9
Holy Grail
Speed up relative to
DOP 1
Q13
Q18
Q21
90
80
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
70
CPU time
In seconds
60
50
40
30
20
10
0
Q1
Q9
Q13
Q18
Q21
Q13 has slightly
better than perfect
scaling?
In general,
excellent scaling to
DOP 8-24, weak
afterwards
Super Scaling
Suppose at DOP 1, a query runs for 100
seconds, with one CPU fully pegged
CPU time = 100 sec, elapse time = 100 sec
What is best case for DOP 2?
Assuming nearly zero Repartition Threads cost
CPU time = 100 sec, elapsed time = 50?
Super Scaling: CPU time decreases going from
Non-Parallel to Parallel plan!
No, I have not started drinking, yet
Super Scaling
2.5
CPU normalized to DOP 1
2.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
1.5
1.0
0.5
CPU-sec goes
down from DOP
1 to 2 and
higher (typically
8)
0.0
Q7
Q8
Q11
Q21
Q22
26
24
DOP 1
DOP 2
DOP 4
DOP 8
22
DOP 16
DOP 24
DOP 30
DOP 32
20
3.5X speedup
from DOP 1 to 2
(Normalized to
DOP 1)
Speed up relative to DOP 1
18
16
14
12
10
8
6
4
2
0
Q7
Q8
Q11
Q21
Q22
CPU and Query time in seconds
20
18
16
CPU time
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
14
12
10
8
6
4
2
0
Q7
Q8
Q11
Q21
Q22
Q21
Q22
12
10
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
8
Query time
6
4
2
0
Q7
Q8
Q11
Super Scaling Summary
Most probable cause
Bitmap Operator in Parallel Plan
Bitmap Filters are great,
Question for Microsoft:
Can I use Bitmap Filters in OLTP systems with
non-parallel plans?
Small Queries – Plan Cost vs Act
250
200
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
Query 3 and 16
have lower plan
cost than Q17, but
not included
Plan Cost
150
100
50
0
Q2
Q4
Q6
Q15
Q17
Q20
Q4,6,17 great
scaling to DOP 4,
then weak
3.5
3.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
2.5
Query time
2.0
Negative scaling
also occurs
1.5
1.0
0.5
0.0
Q2
Q4
Q6
Q15
Q17
Q20
Small Queries CPU & Speedup
6
CPU time
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
5
4
3
2
1
0
Q2
Q4
Q6
Q15
Q17
Q20
What did I get for all
that extra CPU?,
Interpretation: sharp
jump in CPU means
poor scaling,
disproportionate
means negative
scaling
18
16
Speed up
14
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
12
10
8
6
4
2
0
Q2
Q4
Q6
Q15
Q17
Q20
Query 2 negative at
DOP 2, Q4 is good,
Q6 get speedup, but
at CPU premium,
Q17 and 20
negative after DOP
8
High Parallelism – Small Queries
Why? Almost No value Sometimes you do get lucky
TPC-H geometric mean scoring
Small queries have as much impact as large
Linear sum of weights large queries
OLTP with 32, 64+ cores
Parallelism good if super-scaling
Default max degree of parallelism 0
Seriously bad news, especially for small Q
Increase cost threshold for parallelism?
Q that go Negative
4.0
3.5
3.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Query time
2.5
2.0
1.5
1.0
0.5
0.0
Q17
14
Q19
“Speedup”
12
10
Q20
Q22
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
8
6
4
2
0
Q17
Q19
Q20
Q22
CPU
12
10
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
8
6
4
2
0
Q17
Q19
Q20
Q22
Other Queries – CPU & Speedup
22
20
DOP 1
DOP 2
DOP 4
DOP 8
18
DOP 16
DOP 24
DOP 30
DOP 32
CPU time
16
14
12
10
8
6
4
2
0
Q3
Q5
Q10
Q12
22
20
18
16
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Q14
Q16
Speedup
14
12
10
8
6
4
2
0
Q3
Q5
Q10
Q12
Q14
Q16
Q3 has problems
beyond DOP 2
Other - Query Time seconds
16
14
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Query time
12
10
8
6
4
2
0
Q3
Q5
Q10
Q12
Q14
Q16
Scaling Summary
Some queries show excellent scaling
Super-scaling, better than 2X
Sharp CPU jump on last DOP doubling
Need strategy to cap DOP
To limit negative scaling
Especially for some smaller queries?
Other anomalies
Compression
PAGE
Compression Overhead - Overall
1.5
Query time compressed
relative to uncompressed
1.4
40% overhead for
compression at low
DOP,
10% overhead at
max DOP???
1.3
1.2
1.1
1.0
DOP 1
DOP 2
DOP 4
1.5
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
CPU time compressed
relative to uncompressed
1.4
1.3
1.2
1.1
1.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
2.0
Query time compressed
relative to uncompressed
1.8
1.6
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
2.0
CPU time compressed
relative to uncompressed
1.8
1.6
DOP 1
DOP 2
DOP 4
DOP 16
DOP 24
DOP 32
DOP 8
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Compressed Table
LINEITEM – real data may be more compressible
Uncompressed: 8,749,760KB, Average Bytes per row: 149
Compressed: 4,819,592KB, Average Bytes per row: 82
Partitioning
Orders and Line Item on Order Key
Partitioning Impact - Overall
1.8
1.7
1.6
1.5
Query time partitioned
relative to not partitioned
1.4
1.3
1.2
1.1
1.0
0.9
0.8
DOP 1
1.15
1.10
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
DOP 16
DOP 24
DOP 30
DOP 32
CPU time partitioned
relative to not partitioned
1.05
1.00
0.95
0.90
DOP 1
DOP 2
DOP 4
DOP 8
6
5
Query time partitioned
relative to not partitioned
4
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
3
2
1
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Q17
Q18
Q19
Q20
Q21
Q22
4.0
3.5
CPU time partitioned
relative to not partitioned
3.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
2.5
2.0
1.5
1.0
0.5
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Plan for Partitioned Tables
Scaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution Plans
Scaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple
users
Fixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scaling
Back-off on parallelism if gain is negligible
Fix throughput degradation with multiple
users running big DW queries
Sybase and Oracle, Throughput is close to
Power or better
Query Plans
Big Queries
Q1 Pricing Summary Report
Q1 Plan
Non-Parallel
Parallel plan 28% lower
than scalar, IO is 70%,
no parallel plan cost
reduction
Parallel
Q9 Product Type Profit Measure
Non-Parallel
Parallel
IO from 4 tables contribute
58% of plan cost, parallel
plan is 39% lower
Q9 Non-Parallel Plan
Table/Index Scans comprise 64%, IO
from 4 tables contribute 58% of plan
cost
Join sequence: Supplier, (Part,
PartSupp), Line Item, Orders
Q9 Parallel Plan
Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders
Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp
Q9 Non-Parallel Plan details
Table Scans comprise 64%,
IO from 4 tables contribute
58% of plan cost
Q9 Parallel reg vs Partitioned
Q13 Why does Q13 have perfect scaling?
Q18 Large Volume Customer
Non-Parallel
Parallel
Q18 Graphical Plan
Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan
Q18 Plan Details
Non-Parallel
Parallel
Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPU
DOP 16/32: size is below IO threshold, CPU reduced by >10X
Q21 Suppliers Who Kept Orders Waiting
Non-Parallel
Note 3 references to Line Item
Parallel
Q21 Non-Parallel Plan
H3
H2
H3
H1
H2
H1
Q21 Parallel
Q21
3 full Line Item clustered index scans
Plan cost is approx 3X Q1, single “scan”
Super Scaling
Q7 Volume Shipping
Non-Parallel
Parallel
Q7 Non-Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q7 Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q8 National Market Share
Non-Parallel
Parallel
Q8 Non-Parallel Plan
Join sequence: Part, Line Item, Orders, Customer
Q8 Parallel Plan
Join sequence: Part, Line Item,
Orders, Customer
Q11 Important Stock Identification
Non-Parallel
Parallel
Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Small Queries
Query 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q2
Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)
Q2
PartSupp is now Index Scan + Key Lookup
Q6 Forecasting Revenue Change
Note sure why this blows CPU
Scalar values are pre-computed, pre-converted
Q20?
Date functions are usually written as
because Line Item date columns are “date” type
CAST helps DOP 1 plan, but get bad plan for parallel
This query may get a poor execution plan
Q20
Q20
Q20 alternate - parallel
Statistics estimation
error here
Penalty for mistake
applied here
Other Queries
Q3
Q3
Q12 Random IO?
Will this generate random IO?
Query 12 Plans
Parallel
Non-Parallel
Queries that go Negative
Q17 Small Quantity Order Revenue
Q17
Table Spool is concern
Q17
the usual suspects
Q19
Q19
Q22
Q22
32
30
DOP 2
DOP 4
DOP 8
28
DOP 16
DOP 24
DOP 32
Speedup from DOP 1 query time
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Tot
Q17
Q18
Q19
Q20
Q21
Q22
Tot
3.0
2.5
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
CPU relative to DOP 1
2.0
1.5
1.0
0.5
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16