The Grade of the Steel Finding the limits of SQL Server

Transcript The Grade of the Steel Finding the limits of SQL Server

The Grade of the Steel
Finding the Limits of SQL Server
Thomas Kejser: [email protected]
Agenda
 The Story So Far
 BULK load
 OLTP Workload
 A summary of wait stats
 Building Bridges
 The New World – get used to it
 BULK INSERT reloaded
 SELECT: Getting the data out
 INSERT: OLTP style
 UPDATE: Big ones!
2
The Tuning Story so Far...
2006 Pretty big cube


5M row/sec
4TB Cube on Dedicated SAN
2007 ETL World Record


1TB in 28 minutes, 64 core run,
Four full racks of storage and machines
2008 New ETL WR / Early Fast Tracks




1TB in 10 minutes, 96 core run, Half a Rack
25GB/sec scan speeds, SuperDome, 3 Server Racks
6GB/sec Backup speed, Early SSD Drives
16M rows/sec
2009 Big OLTP




13.000 Credit Card tx/sec
50K write tx/sec
150 sql tx/sec
48 Cores
2 Racks of storage
2010 Big Cube




6.1M rows/sec processing on half rack server (incl storage)
50K IOPS in SSAS on a 2 socket server! (2U)
24TB Cube on NAS
1000 Concurrent cube users on 2 x 2U Server
Throughput / DOP
Throughput

60.0
40.0
20.0
0.0
1
11
21
DOP
31
41
3
Cubes – Been there, done that!

Tested the limits
 24TB MOLAP
 ROLAP beating MOLAP
 6.1 M rows / sec
 Record at 7.5M held by customer
 50K IOPS, 1000 concurrent users (MOLAP)

A bit more to be said about ROLAP
 Currently at 1sec latency cubes with 300 concurrent users
 But not quite happy yet...

Knowledge transferred
 SSAS Performance Guide 2008
 SSAS Operations Guide 2008R2
 SSAS Performance Guide 2008R2 (soon be be released)
 SSAS Maestro Training
4
Some interesting waits Analysed...
Wait/latch/spin
Likely Reason (at scale)
Remedy
PAGEIOLATCH
Waiting for IOPS
More IOPS
PAGELATCH_UP
PFS Contention
Add files
PAGELATCH_EX
Last Page Insert
Hash partition, change indexing structures
SOS_SCHEDULER_YIELD
NUMA imbalance
Soft NUMA
ACCESS_METHOD_DATASET_PARENT
Table scan bottleneck
Hash partition
IMPROVIO_WAIT
Bulk source high latency
Lower latency of bulk source
OLEDB
Waiting for source
Tune source
ASYNC_NETWORK_IO
Network or client too slow
Tune client
SOS_CACHE_STORE
Security lookup caches
Prefix tables with schema
LOCK_HASH
Lock bottleneck
Adjust data model
ACCESS_METHOD_HOBT_VIRTUAL_ROOT
B-tree page splits expensive
Hash partition or re-index
ALLOC_FREESPACE_CACHE
Heap insert bottleneck
Hash partition
RESOURCE_SEMAPHORE
Memory pressure from large queries
Use Resource Governor to lower max mem/query
CXPACKET
Other bottleneck
INSERT backpressure
Look elsewhere for issues
Spawn multiple inserters at lower DOP
LCK_X / LCK_U
Poor concurrency design
Optimize data model
LCK_SCH_M
Metadata operation blocking
Apply workarounds, partition data
5
Building Bridges
Business Requirements
 How many cars?
 How to secure against
disaster?
 Disruption of local
ecosystem
Drive Engineering
 How strong is the steel?
 How tall do the pylons
need to be?
 Which cranes/tools will I
need?
Image attribution: wikipedia.com
6
Core Revolution – Cheap Steel

Old CPU Knowledge
 Cores (~ sockets) are expensive
 Bigger hardware makes bad
systems scale better
 Memory can feed the cores fast
enough

New CPU Knowledge
 Core are dirt cheap
 Bigger hardware makes bad
systems scale worse
 Memory is too slow
 L2 caches really matter
 NUMA really matters
 Performance/Watt is becoming a
concern
7
NAND revolution – Tough Concrete

Old I/O knowledge
 Deep queues are good
 Fewer threads are better
 Sequential Access is king
 Sharing makes storage cheap
(SAN)

New I/O knowledge
 Small queue depths make
system faster
 Wide and multi threaded is best
8K IOPS on Fusion-IO drive
250000
5
 Random = Sequential
200000
4
 Sharing makes storage slow
150000
3
100000
2
50000
1
0
0
IOPS
Latency (ms)
0
32 64 96 128 160 192 224 256
Queue Depth (total outstanding kept constant at 256)
8
BULK INSERT - Reloaded
 Thomas, you might have gotten 16M rows/sec at
3GB/sec insert speed
 But this was on heaps, I have a clustered table
 Alright then, let us hit a cluster index
Clustered and partitioned
X Lock
1-1000
X Lock
1001-2000
X Lock
2001-3000
X Lock
3001-4000
9
Cluster Bulk – It seemed so plausible!
1
2
3
10
Cluster Bulk – Stage and Switch
1
2
3
11
Bulking at Concurrency
 What’s that spin?
SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY spins_count
DBCC SQLPERF (spinlockstats)
?
xperf –on latency –stackwalk profile
xperf –d trace.etl
xperview trace.etl
12
SOS_OBJECT_STORE
 In this case: Caused by the lock manager being
too busy acquiring locks
 Need: Reduce locking overhead
 Fixes that work well here:
 x 8 Speedup!
 End Result: 16 core server, bulk insert at 4M
rows/sec
 …simpler data structure than TPCH LINEITEM table
13
SELECT – Get me Hardware!
 Big, well designed Fact
Table
 Billions of rows
 9 x INT
 1 x smallint
 2 x NUMERIC (28,8)
 1 x CHAR(10)
 Big Server
 64 Cores
 2 TB Memory
 My own little 10Gbit
network
How fast can SQL extract
rows?
14
SELECT – Naïve, First tries
 Run 1 x SELECT
 Doesn’t work – single threaded bottleneck in network stack
 Run 64 x SELECT
 Doesn’t work: SOS_SCHEDULER_YIELD overload
 Run 64 SELECT at DOP 1
 SQLCMD.exe > NUL
 Look promising (5M rows/sec), but VERY expensive at client
 Burning 16 cores on client for every 2 cores on server
 Rendering is REALLY expensive!
 SQLCMD still consumes rows before piping to NUL
 Observe: LARGE usage of CPU time to acquire
locks (LOCK_HASH spins)
 Put a NOLOCK hint in there, 30% throughput gain!
15
Building the test harness
 One Harness.Exe Affinitized per client NUMA
Node…
 .NET doesn’t do affinity (though you can hack it..)
START.EXE /AFFINITY F000 MyApp.exe
16
SELECT - Time for Xperf

Getting 20M Rows/sec with two clients

Very high kernel times on core 0,1 on clients
 Cant push more, even though network can do more
 Where is the kernel time going?
xperf –on latency
?
17
Optimizing the Network
 Enabled Jumbo Frames (9014 Bytes)
 Changing DB connection string
 network packet=32767 (not typo, this is: 2**15-1)
 Two core bottlenecks can be remedied with more
RSS queues
 Eureka: Getting 30M rows/sec
 But alas, out of client cores
 … and client NIC
18
Adding the final clients
 End Configuration
 4 x 16 Core Server
 1 x 64 Core Server with HT Enabled (about 10% extra throughput)
 152 x MAXDOP 1 queries with .NET test harness
 NOLOCK hint
 Result: 41M rows/sec
 Full Server utilization, near linear scale
19
INSERT Statement
 The Paul Randall INSERT Challenge…
160M rows, executing at concurrency
Commit every 1K:
EASY
tuning?
20
Expectations?
 Transaction log bottleneck
 Need super low latency I/O
 26µs latency Fusion
 Checkpoints need to be fast
 Lots of page splits with NEWID()
 Could we tune for a faster key type?
 Plenty of IOPS here
 Add cores until it tanks…
 What can possibly go wrong?
21
But first, PAGELATCH
380
1000000
360
100000
340
10000
320
1000
300
100
280
10
260
PAGELATCH
10000000
Runtime
400
1
0
10
20
Runtime
30
# Data Files
Runtime -E
40
50
60
PAGELATCH_UP
22
… Life at 600K inserts/sec
23
Transaction Log – A Complicated Slide…
Alloc Slot in Buffer

T0
Log Writer is single threaded but uses
Async I/O completion ports

One Thread / Instance

Waits for WRITELOG measured by the
threads waiting for the commit

Note: If these threads take time to get
through runnable queue back onto the CPU,
WRITELOG may not accurately reflect I/O
subsystem

When in doubt compare write log waits
against the wait times exposed by virtual file
stats
LOGCACHE
ACCESS
Buffer Offset
WRITELOG
MemCpy Slot
Content
LOGBUFFER
...
Slot
n
Slot 1
...

Log buffers are flushed to disk when either
a “commit” is issued or when the buffer is
full

Semaphore of buffer offset protected by
spinlock
LOG
FLUSHQ
Tn
Signal thread which
issued commit
Writer
Queue
Log Writer

Spinning: LOGCACHE_ACCESS

Spin: Another thread has the offset

Buffer being marked as done by other thread
is reported as a backoff
Async I/O Completion
Port
24
Why Did I put in the Logging Slide?
Should be the same as this… or?
No! Because:
Takeaway: On high scale systems, you can’t trust WRITELOG to report the
I/O latency of the log file. Use io_virtual_filestats instead!
25
Another look at this test…
= BAD!
= Better!...
26
Should be faster? Right?
?
?
WRONG!
27
And the score… (smaller is better)
35000
30000
25000
20000
15000
10000
5000
0
newguid()
newsequentialid()
IDENTITY
28
INSERT end page – Hash it?
 Hash partition the table
 Create multiple B-trees
 Round robin between the
B-trees create more
resources and less
contention
ID
0,8,16
1,9,17
2,10,18
3,11,19
4,12,20
5,13,21
6,14,22
7,15,23
hash
0
1
2
3
4
5
6
7
29
Spinning on IDENTITY
30000
25000
20000
15000
Series1
10000
Current++
Current
(cache
line)
5000
0
IDENTITY
+ Unique
IDENTITY +
Hash24
IDENTITY +
Hash48
Why is 48 not that much better than 24?
SELECT *
FROM sys.dm_os_spinlock_stats
ORDER BY spins DESC
30
Insert end Page – Partition It
 Do not use a sequential
key
 Distribute the inserts all
over the B-tree
3001
- 4000
INSERT
2001
- 3000
INSERT
1001
- 2000
INSERT
INSERT
0
-1000
31
20000
830K
Inserts/sec
15000
600K
Inserts/sec
Seconds
All Cores at ~100%
Speed
35000
30000
25000
10000
5000
0
SPID + Offset
IDENTITY +
Hash48
IDENTITY +
Hash24
IDENTITY
+ Unique + Hash8
IDENTITY
+ Unique
IDENTITY
newsequentialid()
newguid()
32
But Wait – There is more!
CoreInfo.exe
sp_configure ‘AffinityIOMask’
= 1.1M / sec
250
200
150
100
50
0
SPID
+ Offset
SPID
+ Affinity
33
UPDATE
UPDATE MyBigTable
SET c6 = 43
Runtime
300
250
200
UPDATE MyBigTable
SET c6 = 43
WHERE key BETWEEN 10**9
AND 2 * 10**9 -1
150
Runtime
100
50
0
Insert
(Optimized)
Naive
UPDATE
Parallel
Update
Parallel
Update (Clean
buffer
34
Summary

The basic statements scale
 SELECT (and network)
 BULK INSERT (with cluster idx)
 INSERT, UPDATE

Must generally parallelize manually
 Build a crane to lift it!

Transaction log is the final
bottleneck
 Latency really matters
 And you can’t trust WRITELOG with
it

For max scale, keys must be crafted
carefully
 IDENTITY and NEWSEQUENTIALID
are NOT optimal!
Image attribution: civilengineering.com
35
THANK YOU!
For attending this session and
PASS SQLRally Nordic 2011, Stockholm

The Grade of the Steel Finding the limits of SQL Server

Transcript The Grade of the Steel Finding the limits of SQL Server

Directory