Evolving SQL Server for Modern Hardware

Transcript Evolving SQL Server for Modern Hardware

Evolving SQL Server for
Modern Hardware
Paul Larson, Eric N. Hanson, Mike Zwilling
Microsoft
plus to the many members of the Apollo and Hekaton teams
Paul Larson, ICDE 2015
1
Evolving the architecture of SQL Server
• Hardware changed drastically since 1980s
• Large cheap memory, plenty of cores, SSDs
• Deep architectural changes needed
• To fully exploit the changing hardware
• To adapt to different workloads
• Apollo: column-store optimized for DW
• Hekaton: in-memory, row-store optimized
for OLTP
• Queries and transactions can cross all three
engines
Paul Larson, ICDE 2015
SQL Server 2014
Common Front End: API, Catalog, Language,
Management, Security, …
Apollo
column-store
engine
Classical
row-store
engine
Optimized
for data
warehousing
General
Purpose
Hekaton
row-store
engine
Optimized
for OLTP
Common Back End: Storage, Backup, HighAvailability, Resource Management, …
2
Agenda
• Apollo column store engine
• Hekaton in-memory OLTP engine
• Looking ahead
Paul Larson, ICDE 2015
3
Why add columnar storage?
1. Column stores beat the pants off row stores on analytical queries
2. Increased importance of analytical workloads
• Analytical queries scan lots of rows but access only a few columns
• Row stores excel at OLTP-type workloads
• Short requests accessing a few rows
• Lots of overheads for scanning lots of rows
• Indexes and materialized views help
• But they are expensive to store and maintain
• Column stores excel at analytical queries
• Fast scans, reduced storage, reduced IO, less memory wasted
• But updates tend to be slow, small lookups are expensive
Paul Larson, ICDE 2015
4
What’s in Apollo?
• Column store indexes
•
•
•
•
An index that stores data column-wise (instead of row-wise)
Can be used as a primary index (base storage) or as a secondary index
Compressed to save space
Optimized for scans
• Batch mode (vectorized) operators
• Process batches of rows (~1000) instead of one row at time
• Scan operator with filtering and aggregation on compressed data
• Operators for select, hash join, hash aggregation, union all
Paul Larson, ICDE 2015
5
Index creation and storage
B
Dictionary
C
Segment
Blobs
Encode,
compress
Directory
Row group 3 Row group 2 Row group 1
A
Encode,
compress
Encode,
compress
• Also have a global dictionary per column (not shown)
Paul Larson, ICDE 2015
6
Column store compression
1. Encoding – convert to integers
• Value-based encoding (linear transformation
• Dictionary encoding
2. Row reordering
• Find optimal permutation of rows (best compression)
• Proprietary algorithm
3. Compression
• Run length encoding (value + number of consecutive repeats)
• Bit packing (use min number of bits)
4. Optional on-disk archival compression (Lempel-Ziv)
Paul Larson, ICDE 2015
7
Observed compression ratios
Database
Name
Raw data
size (GB)
Compression ratio
Archival compression?
No
Yes
GZIP
EDW
95.4
5.84
9.33
4.85
Sim
41.3
2.2
3.65
3.08
Telco
47.1
3.0
5.27
5.1
SQM
1.3
5.41
10.37
8.07
MS Sales
14.7
6.92
16.11
11.93
Hospitality
1.0
23.8
70.4
43.3
Paul Larson, ICDE 2015
8
Supporting updates
• Delete bitmap
• B-tree on disk
• Bitmap in memory
Delete bitmap
(B-tree)
Row
Group
Row
Group
Row
Group
• Delta stores
• Up to 1M rows/delta store
• May have several
• Tuple mover
Delta
Store
Delta
Store
(B-tree)
(B-tree)
Tuple
mover
• Converts delta store to row group
• Automatically or on demand
Paul Larson, ICDE 2015
9
Record performance on TPC-H
TPC-H, 10,000GB scale
Date
published
CS indexes Sockets/cores/ QphH
used?
threads
Price/QphH
4/15/13
No
8/80/160
158,108
$6.49
4/16/14
Yes
4/60/120
404,005
$2.34
25% fewer cores
2.5X faster
64% cheaper
8/120/240
652,239
$2.43
50% more cores
4.1X faster
63% cheaper
4/06/15
Yes
Paul Larson, ICDE 2015
10
Customer Experiences
• Bwin.Party
• Time to prepare 50 reports reduced by 92%
• Best reduction: 17 min to 3 sec, 340X faster
• Clalit Health
• 48 out of 50 problem queries ran faster
• Average speedup 400X
• Average wait time reduced from 20 min to 3 sec
• DevCon Security
• Reports went from 10-12 sec to 1 sec, >10X
• Ad hoc queries went from 5-7 min to 1-2 sec, > 100X
Paul Larson, ICDE 2015
11
Agenda
• Apollo column store engine
• Hekaton in-memory OLTP engine
• Looking ahead
Paul Larson, ICDE 2015
12
Hekaton: what and why
• Hekaton is a high performance, memory-optimized OLTP engine
architected for modern HW trends and integrated into SQL Server
• Market need for ever higher throughput and lower latency OLTP at
lower cost
• HW trends demanded architectural changes
• Large main memories, lots of cores, SSDs
• Data doesn’t live on disk anymore!
Paul Larson, ICDE 2015
13
Performance-critical
data fits in memory
Push decisions to
compilation time
Conflicts are Rare
Built-In
Main-Memory
Optimized
T-SQL Compiled to
Native Machine Code
Non-Blocking Execution
SQL Server Integration
Architectural Pillars
• Direct pointers to rows
• Indexes exist only in
memory
• No buffer pool
• No write-ahead logging
• Stream-based storage
• T-SQL compiled to
machine code leveraging
VC compiler
• Procedure and its queries,
becomes a C function
• Aggressive optimizations
@ compile-time
• Lock-free data structures
• Multi-version optimistic
concurrency control with
full ACID support
• No locks, latches or
spinlocks
• No I/O during transaction
• Same manageability,
administration &
development experience
• Integrated queries &
transactions
• Integrated HA and
backup/restore
Results
Principles
Hekaton Architectural Pillars
Speed of an in-memory
cache with capabilities of
a database
Queries & business logic
run at native-code speed
Transactions execute to
completion without
blocking
Hybrid engine but
integrated experience
Paul Larson, ICDE 2015
14
Record and index structure
Row format
Timestamps
Hash index on
Name
200, ∞
90,150
Name
John
Susan
Beijing
100, 200
Beijing
50, ∞
70, 90
•
•
•
•
Susan
Range index
on City
City
John
Paris
Jane
Prague
BWtree
J
S
Chain ptrs
Brussels
Rows are multi-versioned
Each row version has a valid time range indicated by two timestamps
A version is visible if transaction read time falls within version’s valid time
A table can have multiple indexes
Paul Larson, ICDE 2015
15
Transaction validation (for update transactions)
• Read stability
• Check that each version read is still visible as of the end of the transaction
• Phantom avoidance
• Repeat each scan checking whether new versions have become visible since the
transaction began
• Extent of validation depends on isolation level
• Snapshot isolation:
• Repeatable read:
• Serializable:
no validation required
read stability
read stability, phantom avoidance
Details in “High-Performance concurrency control mechanisms for main-memory databases”, VLDB 2011
Paul Larson, ICDE 2015
16
Non-blocking execution
• Goal: enable highly concurrent execution
• no thread switching, waiting, or spinning during execution of a transaction
• Lead to three design choices
• Use only latch-free data structure
• Multi-version optimistic concurrency control
• Allow certain speculative reads (with commit dependencies)
• Result:
• Read-only transactions run without blocking or waiting
• Update transactions block only on final log write
• Exception: speculative reads may force a transaction to wait before
returning a result (rare)
Paul Larson, ICDE 2015
17
Durability and availability
• Logging changes before transaction commit
• All new versions, keys of old versions in a single IO
• Aborted transactions write nothing to the log
• Checkpoint - maintained by rolling log forward
• Organized for fast, parallel recovery
• Require only sequential IO
• Recovery – rebuild in-memory database from checkpoint and log
• Scan checkpoint files (in parallel), insert records, and update indexes
• Apply tail of the log
• High availability (HA) – based on replicas and automatic failover
• Integrated with AlwaysOn (SQL Server’s HA solution)
• Up to 8 synch and asynch replicas
Paul Larson, ICDE 2015
18
Hekaton Engine Performance (micro-benchmark)
Transaction size in
#lookups/#updates
1
10
100
1,000
10,000
Speedup over classical engine
Lookups
Updates
10.8X
18.4X
18.1X
18.9X
20.4X
Paul Larson, ICDE 2015
20.2X
23.4X
31.4X
27.9X
30.5X
19
Millions
Throughput (tx/sec)
Scalability under extreme contention
(1000 row table)
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Multiversion, optimistic
80% R=10
20% R=10, W=2
5×
Single version, locking
1V
0
6
12
Threads
Paul Larson, ICDE 2015
MV/O
18
24
20
Performance does make a difference!
• Bwin.party – large online gaming site
•
•
•
•
ASP.NET session state repository
Went from 15,000 requests/sec to 250,000 requests/sec, 17X
Achieved 450,000 requests/sec in testing, 30X
Replaced 18 servers by one server
• Edgenet – provides real-time price/availability data for retailers
•
•
•
•
8X-11X faster data ingestion enabled huge service improvements
Moved from once-a-day batch ingestion to continuous data ingestion
Consolidated multiple servers into a single database server
Removed application caching layer
• Samsung Electro-Mechanics – statistical process control system
• Improved OLTP performance by 24X and DW performance by 22X
• Now able to ingest and analyze all sensor data from manufacturing lines
• Improved quality control, better products
Paul Larson, ICDE 2015
21
Agenda
• Apollo column store engine
• Hekaton in-memory OLTP engine
• Looking ahead
Paul Larson, ICDE 2015
23
In the not-so-distant future
• Support for real-time analytics
• Column store indexes on Hekaton tables
• Making secondary CS indexes updatable
• Column store enhancements
• B-tree indexes on primary CS
• Even faster scans
Paul Larson, ICDE 2015
24
Thank you for your attention
Paul Larson, ICDE 2015
25