Knowing the Internals, Who Needs SQL Server Anyway?

Download Report

Transcript Knowing the Internals, Who Needs SQL Server Anyway?

Mark S. Rasmussen – improve.dk

Tech Lead @ iPaper

Developer/DBA/Sysadmin/Project manager/*

Comp.Sci @ Aarhus University

Blogging at improve.dk

@improvedk

Author of the OrcaMDF project
2

Level 650 – meant to inspire, not teach!

Based on 2008 R2

I have no idea...

Most of what I say is incorrect
3

Presentation at Miracle Open World

Formally started OrcaMDF
4
private static void oldschool()
{
using (var conn = new SqlConnection("Data Source=.;Initial Catalog=QFD;…")
{
conn.Open();
var cmd = new SqlCommand("SELECT * FROM Persons", conn);
var reader = cmd.ExecuteReader();
while(reader.Read())
Console.WriteLine(reader["ID"] + ": " + reader["Name"] + " (" + reader["A
}
}
5
using (var mdf = new MdfFile(mdfPath))
{
var scanner = new DataScanner(mdf);
var rows = scanner.ScanTable("Persons");
EntityPrinter.Print(rows);
}
using (var mdf = new MdfFile(mdfPath))
{
var scanner = new DataScanner(mdf);
var rows = scanner.ScanTable("Persons")
.Where(x => x.Field<short>(”Age”) < 40);
EntityPrinter.Print(rows);
}
6
The foundation of SQL Server storage

8192 bytes
Header (96 bytes)
Body (8096 bytes)

Everything stored
as pages
Record offset array
8

DBCC IND

DBCC PAGE

DBCC TRACEON (3604)

Documented in 6.5 & 7.0

Unofficially documented
9

Absolutely no documentation

Absolutely necessary for parsing
10

DEMO

OrcaMDF: PageHeader
11

Points to beginning
Header (96 bytes)
of records in body

Defines logical
order of records
5
4
3
2
1
0
12

Data records
 Stores table data

Index records
 Stores nonclustered index data, as well as non-leaf
level clustered index data

Stored in the FixedVar format
13
1 byte
1 byte
2 bytes
n bytes
2 bytes
n bytes
2 bytes
2 x #var
columns
n bytes
Variable length column data
Variable column offset array
Number of variable length columns
NULL bitmap – 1 bit per column
Number of columns in the NULL bitmap
Fixed-length data
Total length of fixed-length data (including status bytes)
Status bits B
Status bits A
14
0
1-3
4
5
6
7
Not used
Has versioning information
Has variable length columns
Has NULL bitmap
Record type
Version (always 0 in 2k8+)
15
0
1-7
Not used
Is ghost forwarded record
16

Bitmap tracking whether columns are NULL

CEIL(#Cols / 8) bytes

Always present on data pages, except when
it’s not

Only trust defined bits – rest may be garbage
17
1 byte
1 byte
10 bytes
2 bytes
0x0200
11
n bytes
0x1500
13
2 bytes
0x1800
15
n bytes
2 x #var
columns
2 bytes
4 bytes
17
n bytes
3 bytes
21
24
18
0
4
30 00 11 00
17
19 20
05 00 00 02 00
8
12
17
19 00 00 00
26 00 00 00
41 42 43 44 20
22
26
30
24
1e 00
30
25 00
37
44 65 6e 6d 61 72 6b
4d 61 72 6b
CREATE TABLE RecordTest
(
A int,
B int,
C char(5),
D varchar(10),
E varchar(20)
)
INSERT INTO RecordTest VALUES (25, 38, 'ABCD', 'Mark', 'Denmark')
19

Fixed length data always present
 Even if null
▪ Though not necessarily tail columns!

Variable length data only present when not null

Adding nullable columns is a metadata op
 Denali default value columns is metadata too!
20
How are data types stored within a record?

Fixed length data types
 bit, char, int, decimal, date, datetime, float, etc.

Variable length data types
 (n)varchar, varbinary, varchar(MAX), text, etc.

sql_variant
 Please just stay away from it
22

SLOBs
 varchar(x), nvarchar(x), varbinary(x)

LOBs
 text, ntext, image, varchar(MAX), nvarchar(MAX),
varbinary(MAX), xml

vardecimal
23
2
0
30 00
4
04 00
6
01 00
7
00
9
01 00
11
00 F0
15
4D 61 72 6B
Varlength 1 data
Varlength offset 1 col
Number of varlength cols
NULL bitmap
Number of cols
Fixed length
Status byte B
Status byte A
CREATE TABLE VarcharTest
(
A varchar(4)
)
INSERT INTO VarcharTest VALUES ('Mark')
24

DEMO

Identified using the sign bit

0b1001001110010101 = 37.781

0b0001001110010101 = 5.013

Use cases
 Row-overflow/LOB pointers
 Sparse vectors
 Back pointers
25

Varchar, nvarchar, varbinary

DEMO
26

Column data moved to new page, pointer left
behind
27
28
B = [BLOB Inline Root] Slot 0 Column 2 Offset 0x11a1 Length (physical) 24
Level = 0
TimeStamp = 1298595840
RowId = (1:21:0)
Unused = 0
Link 0
0
4
30 00 04 00
0
1
7
4
02 00 00 00
11
A1 11
01 00 00 00
4513
41 41 ... 41 (4500)
12
FA 1B 00 00
20
15 00 00 00
13
B9 91
8
16
94 11 00 00
9
02 00 00 02 00
3
12
6
UpdateSeq = 1
Size = 4500
22
01 00
24
00 00
timestamp = BitConverter.ToInt64(data, 8) << 16;
29
Blob row at: Page (1:21) Slot 0 Length: 4514 Type: 3 (DATA)
Blob Id:469368832
0
1
2
08 00 A2 11
00 00 FA 1B 00 00 00 00
4514
14
12
4
03 00
42 42 ... 42 (4500)
Data
LOB type
Blob ID (timestamp)
BLOB_FRAGMENT

Stored on shared (obj-level) TextMix pages
30

Allways stored in-row if < 24 bytes

24 byte [BLOB Inline Root] pointer

Data stored in BLOB_FRAGMENT on TextMix page

Timestamp == Blob ID

Performance prediction is tough
31

varchar(MAX), nvarchar(MAX), varbinary(MAX)

The LOB that wanted to be a SLOB

Three scenarios
 [BLOB Inline Data]
 [BLOB Inline Root]
 [Textpointer]

DEMO
32

Used when data fits in record

Not an official LOB structure
Slot 0 Column 1 Offset 0x0 Length 0 Length (physical) 0
A = [NULL]
B = [BLOB Inline Data] Slot 0 Column 2 Offset 0x1393 Length
4 Length (physical) 4
B = 0x41424344
33

Can reference up to 5 pages – data, roots, trees, etc.

12 byte header
0
4
30 00 04 00
0
1
7
4
02 00 00 00
11
A1 11
01 00 00 00
4513
41 41 ... 41 (4500)
12
FA 1B 00 00
20
15 00 00 00
13
B9 91
8
16
94 11 00 00
9
02 00 00 02 00
3
12
6
22
01 00
24
00 00

Array of 12 byte references

Only used by SLOBs & (MAX) LOBs

Also not a LOB structure (by my definition)
34

Wrapped in a single-column ”meta” record
0
1
A
2
B
4
FSize
X
LOB Structure
35
Blob row at: Page (1:176) Slot 0 Length: 8054 Type: 3 (DATA)
Blob Id:1210253312

Type 3

Where data is actually stored

Size always > 64 bytes (SMALL_ROOT)
36
0
8
Blob ID

10
Type
X
Data
How much data can we store in a DATA
record?
 8096 – Page body size
 8080 (8094)– Theoretical max
 8040 (8054)- Reality
37
Blob row at: Page (1:55) Slot 0 Length: 324 Type: 2 (INTERNAL)
Blob Id: 1210253312 Level: 0 MaxLinks: 501 CurLinks: 19
Child 0 at Page (1:176) Slot 0 Size: 8040 Offset: 8040
Child 1 at Page (1:177) Slot 0 Size: 8040 Offset: 16080

Type 2

CurLinks = number of references

MaxLinks = ?

Level = tree level

Size = computed
38
Blob ID
Type
MaxLinks
CurLinks
Level
Offset
Page ID
File ID
Slot ID
Offset
Page ID
File ID
Slot ID
Total record size = 20 + X * 16
39
Normal record with in-row data
Level = 0
Record with <40 KB of data
DATA LOB
DATA LOB
DATA LOB
Level = 1
Level = 0
Record with >40 KB of data
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
40

In theory (INTERNAL)... 8096  20  2  504
16

In reality... 500
500 8040 4,020,000
2  500 8040 8,040,000
5  500 8040 20,100,000
6  500 8040 24,120,000
41
DATA LOB
Level = 0
Level = 1
Record with >~4 MB of data
INTERNAL LOB
INTERNAL LOB
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
42
DATA LOB
Level = 0
Level = 2
Level = 1
Record with >~16 MB of data
INTERNAL LOB
INTERNAL LOB
INTERNAL LOB
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
43
Level = 2
4
Level = 1
X
500
Level = 0
X

8,040,000,000 bytes / 7,48 GB

(MAX) limit is 231-1

Many permutations
500
X
8040
44

sp_tableoption ‘MyTable’, ‘Option’, ‘ON/OFF’

Even more permutations

“text in row” 24-7000, default 256
45

Used for classic LOB types & MAX LOB types
with ’large value types out of row’ ON
 text, ntext, image

Complex column
0
8
Timestamp
12
Page ID
14
16
File ID Slot ID
46

You thought (MAX) was complex?

Textpointer = evil
47
0
8
Blob ID
10
Type
12
Length
16
?

Type = 0

Used when data <= 64 bytes

Min size = 84

Data > length = garbage
84
Data ...
48
0
8
10
Blob ID
20
Type
24
Size
28
Page ID
12
14
MaxLinks CurLinks
30
16
Level
20
?
32
File ID Slot ID
...

Type = 5

Min size = 84

Part of LOB tree
49
Record with <65 bytes of data
SMALL_ROOT
DATA LOB
DATA LOB
Level = 0
Record with <40 KB of data
LARGE_ROOT_YUKON
DATA LOB
DATA LOB
DATA LOB
Record with >40 KB of data
Level = 1
Level = 0
LARGE_ROOT_YUKON
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
50
DATA LOB
Record with >~4 MB of data
Level = 2
Level = 1
LARGE_ROOT_YUKON
INTERNAL LOB
Level = 0
DATA LOB
INTERNAL LOB
DATA LOB
INTERNAL LOB
DATA LOB
INTERNAL LOB
DATA LOB
DATA LOB
51
Data
Varchar(X)
Varchar(MAX)
Text
NULL
0
0
0
0-64
0-64
0-64
100 (16 + 84)
65-8000
65-8000
65-8000 * (+ 24 + 14)
100 + 14 + 65-8000
8kb +
N/A
24 + X
100 + X

Extreme impact on small data

The more data, the less of a diff

Performance differences

http://sqlblog.com/blogs/paul_white/archive/2011/02/23/Advanced-TSQL-Tuning-Why-Internals-Knowledge-Matters.aspx
52
Type
Name
0
SMALL_ROOT
1
?
2
INTERNAL
3
DATA
4
?
5
LARGE_ROOT_YUKON
6
?
7
?
8
?
9+
?
53
0
8
Blob ID
10
Type
12
Length
16
?
84
Data ...
54
Type
Name
0
SMALL_ROOT
1
LARGE_ROOT
2
INTERNAL
3
DATA
4
LARGE_ROOT_SHILOH
5
LARGE_ROOT_YUKON
6
SUPER_LARGE_ROOT
7
8
NULL
9+
INVALID
55

< 8000 => (MAX) = (X)

> 8000 => Tree is built

Text/ntext/image horribly inefficient
 Lots of legacy details
56
How pages are organized

Defines how data is *physically* stored

Clustered index
 Guarantees physical order of data
 Row identified by ”clustered key”

Heap
 Data stored whereever SQL Server wants to
 Row identified by ”RID”
58
Ø 10 20
Ø 4
1 3
4 5
17
23
17 18
23 25 26
59

Relies on IAM pages

Leaf pages not linked
 Except...
1 3
·· ·
4 5
·
17 18
23 25 26
60
Extents, pages & objects

All pages allocated as part of an extent

Mixed extents

Uniform extents

First 8 = mixed, rest uniform
Pages 72-79
Pages 64-71
62

Global Allocation Map

1 = Free, 0 = Allocated

Bitmap tracks 63,904 extents, almost 4GB

Present every 511,232 pages

GAM interval

2 / 511232, every 511232 pages
~4GB
~4GB
...
2
...
511232
~4GB
...
1022464
...
63

Shared Global Allocation Map

1 = Mixed & > 0 free pages

0 = Either uniform or mixed w/no free pages

Structure identical to GAM

3 / 511233, every 511232 pages
64

Index Allocation Map

1 = Uniformly allocated to IAM chain / allocation unit

0 = Not owned by IAM chain / AU

No fixed positioning!

Tracks a GAM interval

Structure (almost) identical to GAM
65
Sequence number
Status
Object ID
Index ID
Page count
Start page ID
Start page file ID
0
42
4
14
int
?
int
short
46
16
short
28
32
?
int
34
short
36
short
40
int
42
short
X 8
48
90
Slot 0 file ID
Slot 0 page ID
66
GAM
SGAM
Any IAM
Status
0
0
0
1
0
1
0
0
0
0
1
0
Mixed, all pages allocated
0
1
1
1
0
1
1
1
0
Invalid
Invalid
Invalid
1
1
1
Invalid
Mixed, > 0 free pages
Uniform
Unallocated
67
001

1
0
1
1
0
Page Free Space
Not used

Bytemap

1 / 8088, every 8088 pages

PFS interval
Page allocated?
Mixed page?
IAM page?
Ghost records?
How much free space?

Only tracks fullness where necessary
68
GAM interval
0
1
2
3
4 5
6
7
8
9
...
8088
...
511232 511233
...
SGAM
Boot page
PFS
GAM
ML map
DIFF map
Unused
SGAM
GAM
PFS
File header
69
Heap /
Index
Objects
Partitions
Allocation units
HOBT
SLOB
Partition 1
Partition N
LOB
LOB
SLOB
HOBT
70

Page 9 in primary data file
0-1
Version
2-3
CreateVersion
...
?

DBCC PAGE == DBCC DBINFO
32-35
Status

Lots of interesting info
36-39
NextID
...
?
48-303
Database name
...
?
308-309
Database ID
...
?
312-319
Max DB Timestamp
 Name + ID
...
?
512-515
First sys indexes Page ID
 FirstSysIndexes
516-517
First sys indexes File ID
518 - 1440
?
 Physical version
 Log rebuild count
 Last OK CHECKDB
 Last LOG backup
71
The source of our parsing metadata

Schema
 sys.tables + sys.columns / sys.indexes + sys.index_columns

Indexes
 Root page

Heaps
 IAM chain root

DEMO
73
sys.tables (t)
object_id
1
t.object_id = p.object_id
*
sys.partitions (p)
partition_id == hobt_id
1
p.hobt_id = au.container_Id
1..2
sys.system_internals_allocation_units (au)
(HOBT, SLOB)
container_id = sys.partitions.hobt_id
1
p.partition_id = au.container_Id
0..1
sys.system_internals_allocation_units (au)
(LOB)
container_id = sys.partitions.partition_Id
74

Just views, no physical storage

Chicken or the egg

How about we take a look at those views?

DEMO
75

The basis for DMV data

Can only be queried through the DAC
 Here be dragons!

Confusing column names

Utilizes internal functions
76

SELECT * FROM sys.sysschobjs

sysschobjs



sys.tables (t)
object_id
syscolpars
1
t.object_id = p.object_id
sysrowsets
sysallocunits
*
sys.partitions (p)
partition_id == hobt_id
1
p.hobt_id = au.container_Id
1..2

DEMO
sys.system_internals_allocation_units (au)
(HOBT, SLOB)
container_id = sys.partitions.hobt_id
1
p.partition_id = au.container_Id
0..1
sys.system_internals_allocation_units (au)
(LOB)
container_id = sys.partitions.partition_Id
77

Boot page points to sysallocunits (FirstSysIndexes)

Constant partition ID leads us to sysrowsets

Constant object ID leads us to sysschobjs

Using the above we can find syscolpars

DEMO
78
SELECT
CASE c.maxinrowlen
WHEN 0 THEN p.length
ELSE c.maxinrowlen
END AS max_inrow_length,
p.xtype AS system_type_id,
p.length AS max_length,
p.prec AS PRECISION,
p.scale AS scale,
FROM
sys.sysrscols c –- sys.system_internals_partition_columns
OUTER APPLY
OPENROWSET(TABLE RSCPROP, c.ti) p
79
80
CREATE TABLE TITest
(
A binary(50),
B char(10),
C datetime2(5),
D decimal(12, 5),
E float,
F int,
G numeric(11, 4),
H nvarchar(50),
I nvarchar(max),
J time(3),
K tinyint,
L varbinary(max),
M varchar(75),
N text
)
81
SELECT
t.name,
r.ti,
p.scale,
p.precision,
p.max_length,
p.system_type_id,
p.max_inrow_length
FROM
sys.system_internals_partition_columns p
INNER JOIN
sys.sysrscols r ON
r.rscolid = p.partition_column_id AND
r.rsid = p.partition_id
INNER JOIN
sys.types t ON
t.system_type_id = p.system_type_id AND
t.user_type_id = p.system_type_id
WHERE
partition_id = 72057594040614912
82
83

12973
= 0x000032AD

173
= 0xAD == 12973 & 0xFF

50
= 0x32 == (12973 & 0xFFFF00) >> 8
84

1322
= 0x0000052A

42
= 0x2A == 1322 & 0xFF

5
= 0x5 == (1322 & 0xFF00) >> 8

25
= 20 + scale
85

330858
= 0x00050C6A

106
= 0x6A == 330858 && 0xFF

12
= 0x0C == (330858 && 0xFF00) >> 8

5
= 0x05 == (330858 && 0xFF0000) >> 16
86
[Test]
public void Decimal()
{
var parser = new SysrscolTIParser(330858);
Assert.AreEqual(5, parser.Scale);
Assert.AreEqual(12, parser.Precision);
Assert.AreEqual(9, parser.MaxLength);
Assert.AreEqual(106, parser.TypeID);
Assert.AreEqual(9, parser.MaxInrowLength);
parser = new SysrscolTIParser(396138);
Assert.AreEqual(6, parser.Scale);
Assert.AreEqual(11, parser.Precision);
Assert.AreEqual(9, parser.MaxLength);
Assert.AreEqual(106, parser.TypeID);
Assert.AreEqual(9, parser.MaxInrowLength);
}
87
When everything else fails

You should always have backups available

Make sure to test your backups

Run regular consistency checks

This is a last resort measure
89

Torn pages

Corrupt pages

Bad metadata

Accidental deletes & truncations

How does OrcaMDF differ?
90

One page = 16 disk sectors

First and last sector most important

No header
 Identify object from IAM, linked list

No slot array
 Slot count in header
 Identify record formats in body
91

Checksum doesn’t match content

Could be minor issue, probably major

Treat like torn page
92

SQL Server bugs

Corrupt/torn pages

Scan pages and identify object in header

Scan pages and look for IAM chain

Deduce schema
 App
 Docs
 Record format
93

Accidental delete
 Records may be ghosted
 Records removed from slot array
 STOP!

Accidental truncation
 Pages deallocated, physically intact
 Scan pages, linked list
94

Garbage may be mistaken for data

Was page allocated?
 Look for clues in salvagable allocation structures
95

CorruptMdf class

Verifying tornbits / checksum

Utility methods
 Scan for pages belonging to object
 Scan for IAM pages

Best-effort parsing of pages
96
Now’s the chance
Blog:
Twitter:
Email:
improve.dk
@improvedk
[email protected]