Parallel Execution Plans

Download Report

Transcript Parallel Execution Plans

Parallel Execution Plans

Joe Chang [email protected]

www.qdpma.com

About Joe Chang

SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL execution plan Performance Monitoring, Profiler/Trace aggregation

So you bought a 64+ core box

Now

Learn all about Parallel Execution All guns (cores) blazing Negative scaling

Yes, this can happen, how will you know

Super-scaling

No I have not been smoking pot

High degree of parallelism & small SQL Anomalies, execution plan changes etc Compression

How much in CPU do I pay for this?

Partitioning

Great management tool, what else?

Parallel Execution Plans

Reference: Adam Machanic PASS

Execution Plan Quickie

I/O and CPU Cost components Estimated Execution Plan Cost is duration in seconds on some reference platform IO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KB IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125

F4

Index + Key Lookup - Scan

Actual LU Scan CPU 1919 8736 Time (Data in memory) 1919 8727 (926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%) True cross-over approx 1,400,000 rows 1 row : page 1,093,729 pages/1350 = 810.17 (8,748MB)

Index + Key Lookup - Scan

Actual LU Scan CPU 2138 18622 Time 321 658 8748000KB/8/1350 = 810 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

Actual Execution Plan

Actual Estimated Estimated Actual Note Actual Number of Rows, Rebinds, Rewinds

Row Count and Executions

Outer Inner Source For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows

Parallel Plans

Parallelism Operations

Distribute Streams Non-parallel source, parallel destination Repartition Streams Parallel source and destination Gather Streams Destination is non-parallel

Parallel Execution Plans

Note: gold circle with double arrow, and parallelism operations

Parallel Scan (and Index Seek)

DOP 1 DOP 4

4X 2X

DOP 2

IO Cost same CPU reduce by degree of parallelism, except no reduction for DOP 16 8X

DOP 8

IO contributes most of cost!

Parallel Scan 2

DOP 16

Hash Match Aggregate

CPU cost only reduces By 2X,

Parallel Scan

IO Cost is the same CPU cost reduced in proportion to degree of parallelism, last 2X excluded? On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration).

A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan cost Not reducing IO cost in parallel plan may inhibit generating favorable plan, i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)

Actual Execution Plan - Parallel

More Parallel Plan Details

Parallel Plan - Actual

Parallelism – Hash Joins

Hash Join Cost

DOP 4 DOP 1

Search: Understanding Hash Joins For In-memory, Grace, Recursive

DOP 2 DOP 8

Hash Join Cost

CPU Cost is linear with number of rows, outer and inner source See BOL on Hash Joins for In-Memory, Grace, Recursive IO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit) Parallel Plan: Memory allocation is per thread! Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?

Parallelism Repartition Streams

DOP 2 DOP 4 DOP 8

Bitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed. SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the

Parallelism

operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.

Parallel Execution Plan Summary

Queries with high IO cost may show little plan cost reduction on parallel execution Plans with high portion hash or sort cost show large parallel plan cost reduction Parallel plans may be inhibited by high row count in Parallelism Repartition Streams Watch out for (Parallel) Merge Joins!

Scaling Theory

Parallel Execution Strategy

Partition work into little pieces Ensures each thread has same amount High overhead to coordinate Partition into big pieces May have uneven distribution between threads Small table join to big table Thread for each row from small table Partitioned table options

What Should Scale?

3 2

Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently, 3) Small amount of coordination to consolidate threads

2

More Difficult?

4 3 3

Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage 3) Large coordination effort between threads 4) More work … Consolidate

2 2

Partitioned Tables

No Repartition Streams

Regular Table Partitioned Tables

No Repartition Streams operations!

Scaling Reality

8-way Quad-Core Opteron Windows Server 2008 R2 SQL Server 2008 SP1 + HF 27

Test Queries

TPC-H SF 10 database Standard, Compressed, Partitioned (30) Line Item Table SUM, 59M rows, 8.75GB

Orders Table 15M rows

32 28 24 20 16 12 8 4 0 10 5 0 35 30 25 20 15

CPU-sec

Sum 1 column Sum 2 columns DOP 1 DOP 2 DOP 4 Sum 1 column Sum 2 columns DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Standard CPU-sec to SUM 1 or 2 columns in Line Item Compressed

12 8 4 0 32 28 24 20 16 6 4 2 0 14 12 10 8 26 24 22 20 18 16

Speed Up

Sum 1 Sum 2 S2 Group S2 Join DOP 1 DOP 2 DOP 4 Sum 1 column Sum 2 columns S2 Group S2 Join DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 DOP 8 DOP 16 DOP 24 DOP 32 Standard Compressed

12 8 4 0 32 28 24 20 16 15 10 5 0 20 25 30

Line Item sum 1 column

Sum 1 Std Compressed Partitioned CPU-sec DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Speed up relative to DOP 1 Sum 1 Std Compressed Partitioned DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

8 6 4 2 0 16 14 12 10 26 24 22 20 18 20 10 0 50 40 30 60

Line Item Sum w/Group By

Group Std Compressed Hash CPU-sec DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 Speedup Group Std Compressed Hash DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

15 10 5 0 30 25 20 80 60 40 20 0 120

Hash Join

100 Join Std Compressed Partitioned DOP 1 DOP 2 DOP 4 Join Std Compressed Partitioned DOP 1 DOP 2 DOP 4 DOP 8 DOP 8 DOP 16 DOP 24 DOP 32 DOP 16 DOP 24 DOP 32 CPU-sec Speedup

32 30 28 26 24 8 6 4 2 0 22 20 18 16 14 12 10 4 2 8 6 0 20 18 16 14 12 10

Key Lookup and Table Scan

Key Lookup std Table Scan uncmp Key Lookup compr Table Scan cmpr CPU-sec 1.4M rows DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Key Lookup std Key Lookup compr Table Scan uncmp Table Scan cmpr Speedup DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Parallel Execution Summary

Contention in queries w/low cost per page Simple scan, High Cost per Page – improves scaling!

Multiple Aggregates, Hash Join, Compression Table Partitioning – alternative query plans Loop Joins – broken at high DOP Merge Join – seriously broken (parallel)

Scaling DW Summary

Massive IO bandwidth Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users

Fixes from Microsoft Needed

Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better

Test Systems

Test Systems

2-way quad-core Xeon 5430 2.66GHz

Windows Server 2008 R2, SQL 2008 R2 8-way dual-core Opteron 2.8GHz

Windows Server 2008 SP1, SQL 2008 SP1 8-way quad-core Opteron 2.7GHz Barcelona Windows Server 2008 R2, SQL 2008 SP1 Build 2789 8-way systems were configured for AD-

not good!

Test Methodology

Boot with all processors Run queries at MAXDOP 1, 2, 4, 8, etc Not the same as running on 1-way, 2-way, 4-way server Interpret results with caution

References

Search Adam Machanic PASS