Transcript Storage

Storage Performance for Data Warehousing

Joe Chang [email protected]

www.qdpma.com

About Joe Chang

SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL execution plan Performance Monitoring, Profiler/Trace aggregation

Storage

Organization Structure

In many large IT departments DB and Storage are in separate groups Storage usually has own objectives Bring all storage into one big system under full management (read: control) Storage as a Service, in the Cloud One size fits all needs Usually have zero DB knowledge Of course we do high bandwidth, 600MB/sec good enough for you?

Data Warehouse Storage

OLTP – Throughput with Fast Response DW – Flood the queues for maximum through-put

Do not use shared storage for data warehouse!

Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else.

My advice: stay away from shared storage, controlled by different team.

Nominal and Net Bandwidth

PCI-E Gen 2 – 5 Gbit/sec signaling x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net?

Fibre Channel 8 Gbit/s nominal 780GB/s point-to-point, 680MB/s from host to SAN to back-end loop SAS RAID Controller, x8 PCI-E G2, 2 x4 6G 2.8GB/s

Depends on the controller, will change!

Storage

SAS Direct-Attach

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SAS x4 PCI-E x8 RAID .

.

.

.

SAS x4 SAS x4 PCI-E x8 RAID .

.

.

.

SAS x4 SAS x4 PCI-E x8 RAID .

.

.

.

PCI-E x8 RAID .

.

.

.

PCI-E x4 RAID SAS x4 SAS x4 SAS x4 SAS x4 PCI-E x4 PCI-E x4 2 x10GbE 2 x10GbE Many Fat Pipes Very Many Disks

Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller Option B: Split enclosure over 2 x4 SAS ports, 1 controller Balance by pipe bandwidth

Don’t forget fat network pipes

Storage

FC/SAN

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

PCI-E x8 PCI-E x8 HBA HBA PCI-E x4 HBA PCI-E x4 HBA PCI-E x4 HBA PCI-E x4 HBA PCI-E x4 HBA 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC PCI-E x4 PCI-E x4 2 x10GbE 2 x10GbE 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC

PCI-E x8 Gen 2 Slot with quad-port 8Gb FC If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!

SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.

1-2 15-disk enclosures per 8Gb FC port, 20 30MB/s per disk?

.

.

.

.

.

.

.

.

Storage

SSD / HDD Hybrid

.

.

.

.

.

.

.

.

No RAID w/SSD?

SAS x4 PCI-E x8 SAS .

.

.

.

SAS x4 SAS x4 PCI-E x8 SAS .

.

.

.

SAS x4 SAS x4 SAS .

.

.

.

PCI-E x8 SAS x4 SAS x4 SAS .

.

.

.

PCI-E x8 PCI-E x4 SAS x4 PCI-E x4 PCI-E x4 RAID SAS x4 2 x10GbE 2 x10GbE

Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending Log: valuable SSD space for backups and flat files Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs

PDW

Control Management Landing Zone Backup Node Compute Nodes

SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP SP FC FC SP

ODM X2-2

Database Server SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD Exadata Storage SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD SSD SSD IB IB SSD SSD

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SAS x4 RAID SAS SAS x4 PCI-E x4 SAS x4 2 x10GbE 2 x10GbE

SP FC FC SP

SSD SSD IB IB SSD SSD

SP FC FC SP SP FC FC SP

SSD

Current: mostly 3Gbps SAS/SATA SDD Some 6Gbps SATA SSD Fusion IO – direct PCI-E Gen2 interface 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s No RAID ?

HDD is fundamentally a single point failure SDD could be built with redundant components HP report problems with SSD on RAID controllers, Fujitsu did not?

Big DW Storage – iSCSI

Are you nuts?

Well, maybe if you like frequent long coffee-cigarette breaks

Storage Configuration - Arrays

Very Many Spindles

Shown: two 12-disk Arrays per 24-disk enclosure Options: between 6-16 disks per array SAN systems may recommend R10 4+4 or R5 7+1

Comment on Meta LUN

Data Consumption Rate: Xeon

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors GHz Total Cores Mem GB SQL Q1 sec SF Total MB/s MB/s per core 2 Xeon 5355 2.66

8 64 5sp2 85.4

100 1,165.5

145.7

Conroe

2 Xeon 5570 2.93

8 144 8sp1 42.2

100 2,073.5

259.2

Nehalem

2 Xeon 5680 4 Xeon 7560 8 Xeon 7560 3.33

2.26

2.26

12 32 64 192 640 512 8r2 8r2 21.0

37.2

100 300 4,166.7

7,056.5

8r2 183.8

3000 14,282 347.2

Westmere

220.5

Neh.-EX

223.2

Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.

Data Consumption Rate: Opteron

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors 4 Opt 8220 8 Opt 8360 8 Opt 8384 8 Opt 8439 8 Opt 8439 GHz Total Cores Mem GB SQL 2.8

8 128 5rtm Q1 sec 309.7

256 8rtm 91.4

2.5

2.7

2.8

2.8

32 32 48 48 256 256 512 8rtm 8sp1 72.5

49.0

8rtm 166.9

SF 300 300 Total MB/s 868.7

2,872.0

300 300 3,620.7

5,357.1

1000 5,242.7

MB/s per core 121.1

89.7

113.2

111.6

109.2

Barcelona Shanghai Istanbul

2 Opt 6176 4 Opt 6176 2.3

2.3

24 48 192 512 8r2 8r2 20.2

31.8

100 300 4,331.7

180.5

Magny-C

8,254.7

172.0

-

Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?

Data Consumption Rate

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Processors 2 Xeon 5355 2 Xeon 5570 2 Xeon 5680 2 Opt 6176 GHz Total Cores Mem GB 2.66

8 64 SQL Q1 sec 5sp2 85.4

2.93

3.33

2.3

8 12 24 144 8sp1 42.2

192 192 8r2 8r2 21.0

20.2

SF 100 100 100 100 Total MB/s 1165.5

MB/s per core 145.7

2073.5

4166.7

4331.7

259.2

347.2

180.5

4 Opt 8220 8 Opt 8360 8 Opt 8384 8 Opt 8439 4 Opt 6176 2.8

2.5

2.7

2.8

2.3

8 32 32 48 48 128 5rtm 309.7

256 8rtm 91.4

256 8rtm 72.5

256 8sp1 49.0

512 8r2 31.8

300 300 300 300 300 868.7

2872.0

3620.7

5357.1

8254.7

121.1

89.7

Barcelona

113.2

Shanghai

111.6

172.0

Istanbul Magny-C

8 Xeon 7560 2.26

64 512 8r2 183.8

3000 14282 223.2

Storage Targets

2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk

Processors 2 Xeon X5680 4 Opt 6176 4 Xeon X7560 8 Xeon X7560 Total Cores 12 48 32 64 BW Core 350 175 250 Target MB/s PCI-E x8-x4 SAS HBA Storage Units/Disks Storage Units/Disks Actual Bandwidth 4200 8400 8000 5 - 1 5 - 1 6 - 4 225 14400 9 - 5 2 4 6 11

2 - 48 4 - 96 6 - 144 4 - 96 8 - 192 12 - 288 5 GB/s 10 GB/s 15 GB/s 10 - 240 20 - 480 26 GB/s

8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk 24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk, possible but not always practical 24 disks per x4 SAS port requires 50MB/sec, more achievable in practice

Think: Shortest path to metal (iron-oxide)

Your Storage and the Optimizer

Model Optimizer SAS 2x4 SAS 2x4 FC 4G SSD 24 48 30 8 Disks BW (KB/s) 10,800 2,800,000 2,800,000 360,000 2,800,000 Sequential IOPS “Random” IOPS Sequential Rand IO ratio 1,350 320 4.22

350,000 350,000 9,600 19,200 36.5

18.2

45,000 350,000 12,000 280,000 3.75

1.25

Assumptions

2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8 HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short stroked, possible skip-seek. SSD 35,000 IOPS The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction

450 400 350 300 250 200 150 100 50 0 300 250 200 150 100 50 0

Data Consumption Rates

X 5355 5sp2 X 5680 8R2 X 5570 8sp1 O 6176 8R2 TPC-H SF100 Query 1, 9, 13, 21 Q1 Q9 Q18 Q21 O DC 2.8G 128 5rtm O QC 2.7G 256 8rtm O 12C 2.3G 512 8R2 O QC 2.5G 256 8rtm O 6C 2.8G 256 8sp1 X7560 R2 640 TPC-H SF300 Query 1, 9, 13, 21 Q1 Q9 Q18 Q21

Fast Track Reference Architecture

My Complaints Several Expensive SAN systems (11 disks) Each must be configured independently $1,500-2,000 amortized per disk Scripting?

Too many 2-disk Arrays 2 LUN per Array, too many data files Build Indexes with MAXDOP 1 Is this brain dead?

Designed around 100MB/sec per disk Not all DW is single scan, or sequential

Fragmentation

Weak Storage System

1) Fragmentation could degrade IO performance, 2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.

Table File Partition

Powerful Storage System

3) Fragmentation has very little impact.

4) Defragmenting has mild impact, and completes within night time window.

LUN Disk What is the correct conclusion?

Operating System View of Storage

Operating System Disk View

Disk 2 Basic 396GB Online Disk 3 Basic 396GB Online Disk 4 Basic 396GB Online Disk 5 Basic 396GB Online Disk 6 Basic 396GB Online Disk 7 Basic 396GB Online Controller 1 Port 0 Controller 1 Port 1 Controller 2 Port 0 Controller 2 Port 1 Controller 3 Port 0 Controller 3 Port 1

Additional disks not shown, Disk 0 is boot drive, 1 – install source?

File Layout

Each File Group is distributed across all data disks

Disk 2, Partition 0 File Group for the big Table File 1 Disk 3 Partition 0 File Group for the big Table File 2 Disk 4 Partition 0 File Group for the big Table File 3 Disk 5 Partition 0 File Group for the big Table File 4 Disk 6 Partition 0 File Group for the big Table File 5 Disk 7 Partition 0 File Group for the big Table File 6 Partition 1 File Group for all others File 1 Partition 1 Small File Group File 2 Partition 1 Small File Group File 3 Partition 1 Small File Group File 4 Partition 1 Small File Group File 5 Partition 1 Small File Group File 6 Partition 2 Tempdb File 1 Partition 2 Tempdb File 2 Partition 2 Tempdb File 3 Partition 2 Tempdb File 4 Partition 2 Tempdb File 5 Partition 2 Tempdb File 6 Partition 4 Backup and Load File 1 Partition 4 Backup and Load File 2 Partition 4 Backup and Load File 3 Partition 4 Backup and Load File 4 Partition 4 Backup and Load File 5 Partition 4 Backup and Load File 6

Log disks not shown, tempdb share common pool with data

File Groups and Files

Dedicated File Group for largest table Never defragment One file group for all other regular tables Load file group?

Rebuild indexes to different file group

Partitioning - Pitfalls

Disk 2 File Group 1 Disk 3 File Group 2 Disk 4 File Group 3 Disk 5 File Group 4 Disk 6 File Group 5 Disk 7 File Group 6 Table Partition 1 Table Partition 2 Table Partition 3 Table Partition 4 Table Partition 5 Table Partition 6

Common Partitioning Strategy Partition Scheme maps partitions to File Groups What happens in a table scan?

Read first from Part 1 then 2, then 3, … ?

SQL 2008 HF to read from each partition in parallel?

What if partitions have disparate sizes?