Hekaton, the New SQL Server In

Download Report

Transcript Hekaton, the New SQL Server In

Hekaton
The New SQL Server
In-Memory OLTP
Engine
Mladen Prajdić
SQL Server MVP
[email protected]
About me
2
Agenda
• Why we need it?
• What is it?
• Storage engine
• Relational engine
3
Why we need it?
Intel CPU trends and Memory prices($/GB)
10000000
Computing power holds Moore Law due to parallelism
1000000
100000
10000
CPU clock frequency stalled
1000
100
10
Memory has gotten a LOT cheaper
1
1970
1975
1980
1985
1990
1995
2000
2005
4
2010
2015
What is it?
• Hekaton = Greek for 100x
• In-Memory OLTP
• Built directly into DB engine
• High performance
• Fully ACID
5
What is it?
6
Rough OLTP workload distribution
Relational Engine
Storage Engine
7
How do we start?
• Special MEMORY_OPTIMIZED_DATA filegroup
• All Hekaton stuff goes in there
• The filegroup can’t be removed after creation for now
• But you can removes all objects and make it empty
• Only allowed Windows (non-SQL) BIN2 collations
• 129 collations
SELECT
FROM
WHERE
*
fn_helpcollations()
name like '%BIN2%'
and name not like '%SQL%'
8
Storage Engine
• Tables
• Indexes
• Transaction Log
• Checkpoints
9
Tables
• MEMORY_OPTIMIZED = ON clause
• No LOB or off-row data allowed
• 8060 byte limit enforced at creation time
• Needs at least 1 index
• Data DURABILITY
• SCHEMA_AND_DATA (default)
• Must have PK
• SCHEMA_ONLY
• Ludicrous speed
10
Tables
• On CREATE TABLE
• Schema compiled to C DLL and written to DB metadata
• DLL loaded to memory
• All table index and row access done through DLL
11
Tables - Allowed data types
• bit
• tinyint, smallint, int, bigint
• money, smallmoney
• float, real
• date/time types: datetime, smalldatetime, datetime2, date, time
• numeric and decimal types
• char(n), varchar(n), nchar(n), nvarchar(n)
• binary(n), varbinary(n)
• uniqueidentifier
12
Tables - Limitations
• No DML triggers
• No FOREIGN KEY or CHECK constraints
• No IDENTITY columns
• No UNIQUE indexes other than for the PRIMARY KEY
• A maximum of 8 indexes, including the index supporting the PRIMARY KEY
• NO SCHEMA CHANGES
• For table changes you’ll have to DROP and RECREATE
• Indexes are created at table creation time, can’t be changed
13
Rows
• TxID global DB counter
Row Header
• Reset on restart
• Incremented on TX start
Begin TimeStamp (8 bytes)
• TxTimeStamp global DB counter
• Not reset on restart
• Incremented on TX end
End TimeStamp (8 bytes)
• Row Header
Statement ID (4 bytes)
• BeginTS
• TX TS that created the row
Index Link Count (2 bytes)
• TX TS that deleted the row
• Infinity
8 bytes * (Number of indexes)
• EndTS
• StmtID
• Statement that created the row
• IdxLinkCount
• Number of indexes that reference this row
• Payload = actual data
14
Indexes
• Hash Indexes
• Range Indexes
• Built on MVCC method
• Table needs at least one index
• They connect the rows together
• No pages or extents as we know them now
• Never written to disk, always only in memory
• Recreated on database recovery
• Recovery is parallelized
15
Indexes - Hash Indexes
• 1D Array of pointers
• Collection of hashed values of key columns
• Bucket count (array length) defined at table creation
• Can’t be changed
• Rounds to the nearest higher power of 2
• 1000 -> 1024 (8KB), 500000 -> 524288 (4MB)
• Trade off between average row chain size and available memory
• Only useful for equality and non-equality checks
• WHERE col1 = ‘value’ uses index
• WHERE col1 like ‘val%’ won’t use index
• Hash collisions
• Multiple column values hash to same integer value
• Forward only linked list
16
Indexes - Hash Indexes
Hash Index on Name
TimeStamps
Index ptr
Name
30, ∞
null
Matija
…
5
6
1
7
…
5,6,7 are
hashed values
Insert First Row To Table: Hash Matija = 5
17
Indexes - Hash Indexes
Hash Index on Name
TimeStamps
Index ptr
Name
2
150, ∞
1
Mladen
1
30, ∞
null
Matija
…
5
6
7
…
5,6,7 are
hashed values
Insert Second Row To Table: Hash Mladen = 5
18
Indexes - Hash Indexes
Hash Index on Name
TimeStamps
Index ptr
Name
2.1
250, ∞
2
Mladen
2
150, 250
1
Mladen
1
30, ∞
null
Matija
…
5
6
7
…
5,6,7 are
hashed values
Update Second Row: Change non Name data for Mladen
19
Indexes - Range Indexes
• When having no clue about number of buckets
• Searches based on data range
• Bw-tree
20
Indexes - Range Indexes: Bw-tree
• Discovered by Microsoft Research in 2011
• Lock and Latch free version of B-tree
• Index pages are not fixed size (max size still 8k)
• Can’t be changed
• Page pointer (PID)
• Location in the Page Mapping Table (1D array of page locations)
• For disk based table it’s a physical location
21
Indexes - Range Indexes: Bw-tree
• Delta pages
• Because existing pages can’t be changed
• Three page reorganization options
• Consolidation of delta records
• Delta chain > 16: Page rebuilt
• Old pages (original + delta) marked for garbage collection
• Splitting of a full index page
• When page has no more space (original+delta) to add new rows
• Merging of adjacent index pages
• Page has 1 row or less than 10% max size
• Merge with neighbor page if it has enough space (new page created)
• In the end always same PID but different memory address
22
Indexes - Range Indexes: Pages
Page Mapping Table
PID 0
0
8
20
Index Root
31
1
2
PID 6
3
5
8
PID 2 12 15 20
PID 8 24 28 31
Non-Leaf Pages
3
4
5
6
PID 10
1
2
PID 7
3
7
10
…
7
PID 4 29 30 31
8
Leaf Pages
Memory
address
8
9
6
30, ∞
null
1
20, 45
null
Key
3
Key
23
Data rows
Indexes - Range Indexes: Delta Rows
Page Mapping Table
0
PID 2
Original rows
1
2
3
…
Page with existing rows. PID = 2
24
Indexes - Range Indexes: Delta Rows
Page Mapping Table
0
PID 2.1
1
Inserted row
Delta record 1
Original rows
2.1
3
…
Insert one row to the page
25
Indexes - Range Indexes: Delta Rows
Page Mapping Table
0
Deleted row
Delta record 2
1
Inserted row
Delta record 1
2.2
Original rows
PID 2.2
3
…
Delete a row from the page
26
Transaction log
• Still needed to guarantee full ACID
• Data written to it only at TX commit
• In-Memory is optimized for multiple concurrent log streams
• Optimized writes, much less data written
• Can still be a bottleneck
• Use fast drives!
• Fully skipped for non-durable tables
• Because gives us ->
27
Transaction log
DEMO
28
Checkpoints
• Needed to reduce recovery time
• Same as for disk based tables
• Multiple Data and Delta files
• 1-1 mapping
• Checkpoint inventory file
29
Checkpoints - Data Files
• Contains only inserted versions of rows
• INSERTs + UPDATEs (MVCC)
• Append only
• Each file covers specific timestamp range
• Max 128MB per file
30
Checkpoints - Delta files
• Info about deleted rows in the Data file
• Append only
• Mapping to a single data file
31
Checkpoints - Checkpoint Issued
• T-LOG scanned from previous checkpoint and converted to Data/Delta files
• Checkpoint Inventory File created
• Locations of all Data/Delta files
• Saved to the transaction log
• Location of the Checkpoint Inventory File saved to transaction log
• Multiple data-delta files allow for high database recovery parallelization
• In-Memory recovery happens in parallel with normal disk based recovery
32
Checkpoints – Merging Multiple Data/Delta Files
• Automatic
• Manual
• SP:
sys.sp_xtp_merge_checkpoint_files
• DMV: sys.dm_db_xtp_checkpoint_files
• DMV: sys.dm_db_xtp_merge_requests
• Only adjacent less than 50% full files can be merged
Data files
Merge Selection
After Merge
D1 (30%), D2 (30%), D3 (30%), D4 (40%)
D1, D2, D3
D123, D4
D1 (30%), D2 (50%), D3 (50%), D4 (40%)
D1, D2 and D3, D4
D12, D34
D1 (50%), D2 (70%), D3 (60%), D4 (50%)
/
D1, D2, D3, D4
33
Relational Engine
• T-SQL
• Stored procedures
• Isolation levels
34
T-SQL
• 2 ways to access In-Memory tables
• T-SQL interop
• Natively compiled stored procedures
• T-SQL interop is usually slower than compiled stored procedures
• But it depends on the operation
• Complex business logic is faster in compiled procedures
35
T-SQL: Not allowed on In-Memory tables
• TRUNCATE TABLE
• MERGE (when a memory-optimized table is the target)
• Cross-database queries and transactions
• Linked servers
• Locking hints: TABLOCK, XLOCK, PAGLOCK, etc.
• NOLOCK is supported, but is quietly ignored
• Isolation level hints
• READUNCOMMITTED, READCOMMITTED, READCOMMITTEDLOCK
• Dynamic and keyset cursors
• degraded to static cursors
36
Stored procedures
• Only allowed access to In-Memory OLTP tables
• Compiled to C DLL
create procedure
• Atomic block
dbo.spCompiledProc
@param1 int not null,
@param2 varchar(100)
with native_compilation,
schemabinding,
execute as owner
as
begin atomic with (
transaction isolation level=snapshot,
language=N'us_english‘)
• All statements in single TX
• TX savepoint if in existing TX
• No parameter sniffing
• UNKNOWN by default
• Execution Plan embedded in DLL
-- stored procedure body
end
37
Stored procedures: T-SQL support
• TRY, CATCH, error functions, IF, WHILE
• Memory-optimized table types and table variables
• Math, Date, String functions
• SCOPE_IDENTITY, ISNULL, NEWID, NEWSEQUENTIALID
•…
• No DISTINCT, WITH TIES, PERCENT
• TOP with ORDER BY returns max 8192 rows for single table
• 4096 if two joined tables, 2730 if three joined tables, ...
• BOL: “Supported Constructs in Natively Compiled Stored Procedures”
38
Stored procedures: Compilation
T-SQL stored
procedure
Parser
Algebrizer
Processing flow
Query Trees
Query
optimizer
Processing flow
with
Optimized Query
Plans
Runtime
DLL created
loaded to memory
39
Compiler
Stored procedures: Execution
EXEC
spDoStuff
Parser extracts
the name and
parameters
Runtime
locates the sp
DLL entry
point
Results
returned to
the client
DLL executes
the sp logic
40
Data operations and Isolation levels
• SNAPSHOT
• Read data must be the same as at the start of the transaction
• REPEATABLE READ
• Can’t read modified non-committed data from other transactions
• SERIALIZABLE
• SNAP+RR+No phantom reads (new row inserts in the queried key range)
• READ COMMITTED
• Only for autocommit (single statement) transactions
• Any violation of the upper rules results in current transaction fail
41
Maintenance of DLLs
• DLL’s saved on the filesystem
• No need for DBA’s to do anything
• Automatic removal when object or db is dropped or server restarted
SELECT name, description
FROM
sys.dm_os_loaded_modules
WHERE description = 'XTP Native DLL'
42
Performance comparisons: Transaction/Sec
Normal stored procedure
and
table
Normal stored procedure
and
Hekaton table
Hekaton stored procedure
and
Hekaton table
1x
8x
27x
Scales linearly with the number of CPUs
43
Performance comparisons: Execution time
Disk vs Hekaton Execution Time
1,800
1,600
Disk Based Time (sec)
Hekaton Time (sec)
1,400
1,200
1,000
800
600
400
200
0
500000 (50x10000)
Speed: 3,8
100000 (100x1000)
Speed: 3,9
44
10000000 (100x100000)
Speed: 2,6
45