The Grade of the Steel Finding the limits of SQL Server
Download
Report
Transcript The Grade of the Steel Finding the limits of SQL Server
The Grade of the Steel
Finding the Limits of SQL Server
Thomas Kejser: [email protected]
Agenda
The Story So Far
BULK load
OLTP Workload
A summary of wait stats
Building Bridges
The New World – get used to it
BULK INSERT reloaded
SELECT: Getting the data out
INSERT: OLTP style
UPDATE: Big ones!
2
The Tuning Story so Far...
2006 Pretty big cube
5M row/sec
4TB Cube on Dedicated SAN
2007 ETL World Record
1TB in 28 minutes, 64 core run,
Four full racks of storage and machines
2008 New ETL WR / Early Fast Tracks
1TB in 10 minutes, 96 core run, Half a Rack
25GB/sec scan speeds, SuperDome, 3 Server Racks
6GB/sec Backup speed, Early SSD Drives
16M rows/sec
2009 Big OLTP
13.000 Credit Card tx/sec
50K write tx/sec
150 sql tx/sec
48 Cores
2 Racks of storage
2010 Big Cube
6.1M rows/sec processing on half rack server (incl storage)
50K IOPS in SSAS on a 2 socket server! (2U)
24TB Cube on NAS
1000 Concurrent cube users on 2 x 2U Server
Throughput / DOP
Throughput
60.0
40.0
20.0
0.0
1
11
21
DOP
31
41
3
Cubes – Been there, done that!
Tested the limits
24TB MOLAP
ROLAP beating MOLAP
6.1 M rows / sec
Record at 7.5M held by customer
50K IOPS, 1000 concurrent users (MOLAP)
A bit more to be said about ROLAP
Currently at 1sec latency cubes with 300 concurrent users
But not quite happy yet...
Knowledge transferred
SSAS Performance Guide 2008
SSAS Operations Guide 2008R2
SSAS Performance Guide 2008R2 (soon be be released)
SSAS Maestro Training
4
Some interesting waits Analysed...
Wait/latch/spin
Likely Reason (at scale)
Remedy
PAGEIOLATCH
Waiting for IOPS
More IOPS
PAGELATCH_UP
PFS Contention
Add files
PAGELATCH_EX
Last Page Insert
Hash partition, change indexing structures
SOS_SCHEDULER_YIELD
NUMA imbalance
Soft NUMA
ACCESS_METHOD_DATASET_PARENT
Table scan bottleneck
Hash partition
IMPROVIO_WAIT
Bulk source high latency
Lower latency of bulk source
OLEDB
Waiting for source
Tune source
ASYNC_NETWORK_IO
Network or client too slow
Tune client
SOS_CACHE_STORE
Security lookup caches
Prefix tables with schema
LOCK_HASH
Lock bottleneck
Adjust data model
ACCESS_METHOD_HOBT_VIRTUAL_ROOT
B-tree page splits expensive
Hash partition or re-index
ALLOC_FREESPACE_CACHE
Heap insert bottleneck
Hash partition
RESOURCE_SEMAPHORE
Memory pressure from large queries
Use Resource Governor to lower max mem/query
CXPACKET
Other bottleneck
INSERT backpressure
Look elsewhere for issues
Spawn multiple inserters at lower DOP
LCK_X / LCK_U
Poor concurrency design
Optimize data model
LCK_SCH_M
Metadata operation blocking
Apply workarounds, partition data
5
Building Bridges
Business Requirements
How many cars?
How to secure against
disaster?
Disruption of local
ecosystem
Drive Engineering
How strong is the steel?
How tall do the pylons
need to be?
Which cranes/tools will I
need?
Image attribution: wikipedia.com
6
Core Revolution – Cheap Steel
Old CPU Knowledge
Cores (~ sockets) are expensive
Bigger hardware makes bad
systems scale better
Memory can feed the cores fast
enough
New CPU Knowledge
Core are dirt cheap
Bigger hardware makes bad
systems scale worse
Memory is too slow
L2 caches really matter
NUMA really matters
Performance/Watt is becoming a
concern
7
NAND revolution – Tough Concrete
Old I/O knowledge
Deep queues are good
Fewer threads are better
Sequential Access is king
Sharing makes storage cheap
(SAN)
New I/O knowledge
Small queue depths make
system faster
Wide and multi threaded is best
8K IOPS on Fusion-IO drive
250000
5
Random = Sequential
200000
4
Sharing makes storage slow
150000
3
100000
2
50000
1
0
0
IOPS
Latency (ms)
0
32 64 96 128 160 192 224 256
Queue Depth (total outstanding kept constant at 256)
8
BULK INSERT - Reloaded
Thomas, you might have gotten 16M rows/sec at
3GB/sec insert speed
But this was on heaps, I have a clustered table
Alright then, let us hit a cluster index
Clustered and partitioned
X Lock
1-1000
X Lock
1001-2000
X Lock
2001-3000
X Lock
3001-4000
9
Cluster Bulk – It seemed so plausible!
1
2
3
10
Cluster Bulk – Stage and Switch
1
2
3
11
Bulking at Concurrency
What’s that spin?
SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY spins_count
DBCC SQLPERF (spinlockstats)
?
xperf –on latency –stackwalk profile
xperf –d trace.etl
xperview trace.etl
12
SOS_OBJECT_STORE
In this case: Caused by the lock manager being
too busy acquiring locks
Need: Reduce locking overhead
Fixes that work well here:
x 8 Speedup!
End Result: 16 core server, bulk insert at 4M
rows/sec
…simpler data structure than TPCH LINEITEM table
13
SELECT – Get me Hardware!
Big, well designed Fact
Table
Billions of rows
9 x INT
1 x smallint
2 x NUMERIC (28,8)
1 x CHAR(10)
Big Server
64 Cores
2 TB Memory
My own little 10Gbit
network
How fast can SQL extract
rows?
14
SELECT – Naïve, First tries
Run 1 x SELECT
Doesn’t work – single threaded bottleneck in network stack
Run 64 x SELECT
Doesn’t work: SOS_SCHEDULER_YIELD overload
Run 64 SELECT at DOP 1
SQLCMD.exe > NUL
Look promising (5M rows/sec), but VERY expensive at client
Burning 16 cores on client for every 2 cores on server
Rendering is REALLY expensive!
SQLCMD still consumes rows before piping to NUL
Observe: LARGE usage of CPU time to acquire
locks (LOCK_HASH spins)
Put a NOLOCK hint in there, 30% throughput gain!
15
Building the test harness
One Harness.Exe Affinitized per client NUMA
Node…
.NET doesn’t do affinity (though you can hack it..)
START.EXE /AFFINITY F000 MyApp.exe
16
SELECT - Time for Xperf
Getting 20M Rows/sec with two clients
Very high kernel times on core 0,1 on clients
Cant push more, even though network can do more
Where is the kernel time going?
xperf –on latency
?
17
Optimizing the Network
Enabled Jumbo Frames (9014 Bytes)
Changing DB connection string
network packet=32767 (not typo, this is: 2**15-1)
Two core bottlenecks can be remedied with more
RSS queues
Eureka: Getting 30M rows/sec
But alas, out of client cores
… and client NIC
18
Adding the final clients
End Configuration
4 x 16 Core Server
1 x 64 Core Server with HT Enabled (about 10% extra throughput)
152 x MAXDOP 1 queries with .NET test harness
NOLOCK hint
Result: 41M rows/sec
Full Server utilization, near linear scale
19
INSERT Statement
The Paul Randall INSERT Challenge…
160M rows, executing at concurrency
Commit every 1K:
EASY
tuning?
20
Expectations?
Transaction log bottleneck
Need super low latency I/O
26µs latency Fusion
Checkpoints need to be fast
Lots of page splits with NEWID()
Could we tune for a faster key type?
Plenty of IOPS here
Add cores until it tanks…
What can possibly go wrong?
21
But first, PAGELATCH
380
1000000
360
100000
340
10000
320
1000
300
100
280
10
260
PAGELATCH
10000000
Runtime
400
1
0
10
20
Runtime
30
# Data Files
Runtime -E
40
50
60
PAGELATCH_UP
22
… Life at 600K inserts/sec
23
Transaction Log – A Complicated Slide…
Alloc Slot in Buffer
T0
Log Writer is single threaded but uses
Async I/O completion ports
One Thread / Instance
Waits for WRITELOG measured by the
threads waiting for the commit
Note: If these threads take time to get
through runnable queue back onto the CPU,
WRITELOG may not accurately reflect I/O
subsystem
When in doubt compare write log waits
against the wait times exposed by virtual file
stats
LOGCACHE
ACCESS
Buffer Offset
WRITELOG
MemCpy Slot
Content
LOGBUFFER
...
Slot
n
Slot 1
...
Log buffers are flushed to disk when either
a “commit” is issued or when the buffer is
full
Semaphore of buffer offset protected by
spinlock
LOG
FLUSHQ
Tn
Signal thread which
issued commit
Writer
Queue
Log Writer
Spinning: LOGCACHE_ACCESS
Spin: Another thread has the offset
Buffer being marked as done by other thread
is reported as a backoff
Async I/O Completion
Port
24
Why Did I put in the Logging Slide?
Should be the same as this… or?
No! Because:
Takeaway: On high scale systems, you can’t trust WRITELOG to report the
I/O latency of the log file. Use io_virtual_filestats instead!
25
Another look at this test…
= BAD!
= Better!...
26
Should be faster? Right?
?
?
WRONG!
27
And the score… (smaller is better)
35000
30000
25000
20000
15000
10000
5000
0
newguid()
newsequentialid()
IDENTITY
28
INSERT end page – Hash it?
Hash partition the table
Create multiple B-trees
Round robin between the
B-trees create more
resources and less
contention
ID
0,8,16
1,9,17
2,10,18
3,11,19
4,12,20
5,13,21
6,14,22
7,15,23
hash
0
1
2
3
4
5
6
7
29
Spinning on IDENTITY
30000
25000
20000
15000
Series1
10000
Current++
Current
(cache
line)
5000
0
IDENTITY
+ Unique
IDENTITY +
Hash24
IDENTITY +
Hash48
Why is 48 not that much better than 24?
SELECT *
FROM sys.dm_os_spinlock_stats
ORDER BY spins DESC
30
Insert end Page – Partition It
Do not use a sequential
key
Distribute the inserts all
over the B-tree
3001
- 4000
INSERT
2001
- 3000
INSERT
1001
- 2000
INSERT
INSERT
0
-1000
31
20000
830K
Inserts/sec
15000
600K
Inserts/sec
Seconds
All Cores at ~100%
Speed
35000
30000
25000
10000
5000
0
SPID + Offset
IDENTITY +
Hash48
IDENTITY +
Hash24
IDENTITY
+ Unique + Hash8
IDENTITY
+ Unique
IDENTITY
newsequentialid()
newguid()
32
But Wait – There is more!
CoreInfo.exe
sp_configure ‘AffinityIOMask’
= 1.1M / sec
250
200
150
100
50
0
SPID
+ Offset
SPID
+ Affinity
33
UPDATE
UPDATE MyBigTable
SET c6 = 43
Runtime
300
250
200
UPDATE MyBigTable
SET c6 = 43
WHERE key BETWEEN 10**9
AND 2 * 10**9 -1
150
Runtime
100
50
0
Insert
(Optimized)
Naive
UPDATE
Parallel
Update
Parallel
Update (Clean
buffer
34
Summary
The basic statements scale
SELECT (and network)
BULK INSERT (with cluster idx)
INSERT, UPDATE
Must generally parallelize manually
Build a crane to lift it!
Transaction log is the final
bottleneck
Latency really matters
And you can’t trust WRITELOG with
it
For max scale, keys must be crafted
carefully
IDENTITY and NEWSEQUENTIALID
are NOT optimal!
Image attribution: civilengineering.com
35
THANK YOU!
For attending this session and
PASS SQLRally Nordic 2011, Stockholm