Knowing the Internals, Who Needs SQL Server Anyway?
Download
Report
Transcript Knowing the Internals, Who Needs SQL Server Anyway?
Mark S. Rasmussen – improve.dk
Tech Lead @ iPaper
Developer/DBA/Sysadmin/Project manager/*
Comp.Sci @ Aarhus University
Blogging at improve.dk
@improvedk
Author of the OrcaMDF project
2
Level 650 – meant to inspire, not teach!
Based on 2008 R2
I have no idea...
Most of what I say is incorrect
3
Presentation at Miracle Open World
Formally started OrcaMDF
4
private static void oldschool()
{
using (var conn = new SqlConnection("Data Source=.;Initial Catalog=QFD;…")
{
conn.Open();
var cmd = new SqlCommand("SELECT * FROM Persons", conn);
var reader = cmd.ExecuteReader();
while(reader.Read())
Console.WriteLine(reader["ID"] + ": " + reader["Name"] + " (" + reader["A
}
}
5
using (var mdf = new MdfFile(mdfPath))
{
var scanner = new DataScanner(mdf);
var rows = scanner.ScanTable("Persons");
EntityPrinter.Print(rows);
}
using (var mdf = new MdfFile(mdfPath))
{
var scanner = new DataScanner(mdf);
var rows = scanner.ScanTable("Persons")
.Where(x => x.Field<short>(”Age”) < 40);
EntityPrinter.Print(rows);
}
6
The foundation of SQL Server storage
8192 bytes
Header (96 bytes)
Body (8096 bytes)
Everything stored
as pages
Record offset array
8
DBCC IND
DBCC PAGE
DBCC TRACEON (3604)
Documented in 6.5 & 7.0
Unofficially documented
9
Absolutely no documentation
Absolutely necessary for parsing
10
DEMO
OrcaMDF: PageHeader
11
Points to beginning
Header (96 bytes)
of records in body
Defines logical
order of records
5
4
3
2
1
0
12
Data records
Stores table data
Index records
Stores nonclustered index data, as well as non-leaf
level clustered index data
Stored in the FixedVar format
13
1 byte
1 byte
2 bytes
n bytes
2 bytes
n bytes
2 bytes
2 x #var
columns
n bytes
Variable length column data
Variable column offset array
Number of variable length columns
NULL bitmap – 1 bit per column
Number of columns in the NULL bitmap
Fixed-length data
Total length of fixed-length data (including status bytes)
Status bits B
Status bits A
14
0
1-3
4
5
6
7
Not used
Has versioning information
Has variable length columns
Has NULL bitmap
Record type
Version (always 0 in 2k8+)
15
0
1-7
Not used
Is ghost forwarded record
16
Bitmap tracking whether columns are NULL
CEIL(#Cols / 8) bytes
Always present on data pages, except when
it’s not
Only trust defined bits – rest may be garbage
17
1 byte
1 byte
10 bytes
2 bytes
0x0200
11
n bytes
0x1500
13
2 bytes
0x1800
15
n bytes
2 x #var
columns
2 bytes
4 bytes
17
n bytes
3 bytes
21
24
18
0
4
30 00 11 00
17
19 20
05 00 00 02 00
8
12
17
19 00 00 00
26 00 00 00
41 42 43 44 20
22
26
30
24
1e 00
30
25 00
37
44 65 6e 6d 61 72 6b
4d 61 72 6b
CREATE TABLE RecordTest
(
A int,
B int,
C char(5),
D varchar(10),
E varchar(20)
)
INSERT INTO RecordTest VALUES (25, 38, 'ABCD', 'Mark', 'Denmark')
19
Fixed length data always present
Even if null
▪ Though not necessarily tail columns!
Variable length data only present when not null
Adding nullable columns is a metadata op
Denali default value columns is metadata too!
20
How are data types stored within a record?
Fixed length data types
bit, char, int, decimal, date, datetime, float, etc.
Variable length data types
(n)varchar, varbinary, varchar(MAX), text, etc.
sql_variant
Please just stay away from it
22
SLOBs
varchar(x), nvarchar(x), varbinary(x)
LOBs
text, ntext, image, varchar(MAX), nvarchar(MAX),
varbinary(MAX), xml
vardecimal
23
2
0
30 00
4
04 00
6
01 00
7
00
9
01 00
11
00 F0
15
4D 61 72 6B
Varlength 1 data
Varlength offset 1 col
Number of varlength cols
NULL bitmap
Number of cols
Fixed length
Status byte B
Status byte A
CREATE TABLE VarcharTest
(
A varchar(4)
)
INSERT INTO VarcharTest VALUES ('Mark')
24
DEMO
Identified using the sign bit
0b1001001110010101 = 37.781
0b0001001110010101 = 5.013
Use cases
Row-overflow/LOB pointers
Sparse vectors
Back pointers
25
Varchar, nvarchar, varbinary
DEMO
26
Column data moved to new page, pointer left
behind
27
28
B = [BLOB Inline Root] Slot 0 Column 2 Offset 0x11a1 Length (physical) 24
Level = 0
TimeStamp = 1298595840
RowId = (1:21:0)
Unused = 0
Link 0
0
4
30 00 04 00
0
1
7
4
02 00 00 00
11
A1 11
01 00 00 00
4513
41 41 ... 41 (4500)
12
FA 1B 00 00
20
15 00 00 00
13
B9 91
8
16
94 11 00 00
9
02 00 00 02 00
3
12
6
UpdateSeq = 1
Size = 4500
22
01 00
24
00 00
timestamp = BitConverter.ToInt64(data, 8) << 16;
29
Blob row at: Page (1:21) Slot 0 Length: 4514 Type: 3 (DATA)
Blob Id:469368832
0
1
2
08 00 A2 11
00 00 FA 1B 00 00 00 00
4514
14
12
4
03 00
42 42 ... 42 (4500)
Data
LOB type
Blob ID (timestamp)
BLOB_FRAGMENT
Stored on shared (obj-level) TextMix pages
30
Allways stored in-row if < 24 bytes
24 byte [BLOB Inline Root] pointer
Data stored in BLOB_FRAGMENT on TextMix page
Timestamp == Blob ID
Performance prediction is tough
31
varchar(MAX), nvarchar(MAX), varbinary(MAX)
The LOB that wanted to be a SLOB
Three scenarios
[BLOB Inline Data]
[BLOB Inline Root]
[Textpointer]
DEMO
32
Used when data fits in record
Not an official LOB structure
Slot 0 Column 1 Offset 0x0 Length 0 Length (physical) 0
A = [NULL]
B = [BLOB Inline Data] Slot 0 Column 2 Offset 0x1393 Length
4 Length (physical) 4
B = 0x41424344
33
Can reference up to 5 pages – data, roots, trees, etc.
12 byte header
0
4
30 00 04 00
0
1
7
4
02 00 00 00
11
A1 11
01 00 00 00
4513
41 41 ... 41 (4500)
12
FA 1B 00 00
20
15 00 00 00
13
B9 91
8
16
94 11 00 00
9
02 00 00 02 00
3
12
6
22
01 00
24
00 00
Array of 12 byte references
Only used by SLOBs & (MAX) LOBs
Also not a LOB structure (by my definition)
34
Wrapped in a single-column ”meta” record
0
1
A
2
B
4
FSize
X
LOB Structure
35
Blob row at: Page (1:176) Slot 0 Length: 8054 Type: 3 (DATA)
Blob Id:1210253312
Type 3
Where data is actually stored
Size always > 64 bytes (SMALL_ROOT)
36
0
8
Blob ID
10
Type
X
Data
How much data can we store in a DATA
record?
8096 – Page body size
8080 (8094)– Theoretical max
8040 (8054)- Reality
37
Blob row at: Page (1:55) Slot 0 Length: 324 Type: 2 (INTERNAL)
Blob Id: 1210253312 Level: 0 MaxLinks: 501 CurLinks: 19
Child 0 at Page (1:176) Slot 0 Size: 8040 Offset: 8040
Child 1 at Page (1:177) Slot 0 Size: 8040 Offset: 16080
Type 2
CurLinks = number of references
MaxLinks = ?
Level = tree level
Size = computed
38
Blob ID
Type
MaxLinks
CurLinks
Level
Offset
Page ID
File ID
Slot ID
Offset
Page ID
File ID
Slot ID
Total record size = 20 + X * 16
39
Normal record with in-row data
Level = 0
Record with <40 KB of data
DATA LOB
DATA LOB
DATA LOB
Level = 1
Level = 0
Record with >40 KB of data
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
40
In theory (INTERNAL)... 8096 20 2 504
16
In reality... 500
500 8040 4,020,000
2 500 8040 8,040,000
5 500 8040 20,100,000
6 500 8040 24,120,000
41
DATA LOB
Level = 0
Level = 1
Record with >~4 MB of data
INTERNAL LOB
INTERNAL LOB
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
42
DATA LOB
Level = 0
Level = 2
Level = 1
Record with >~16 MB of data
INTERNAL LOB
INTERNAL LOB
INTERNAL LOB
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
DATA LOB
43
Level = 2
4
Level = 1
X
500
Level = 0
X
8,040,000,000 bytes / 7,48 GB
(MAX) limit is 231-1
Many permutations
500
X
8040
44
sp_tableoption ‘MyTable’, ‘Option’, ‘ON/OFF’
Even more permutations
“text in row” 24-7000, default 256
45
Used for classic LOB types & MAX LOB types
with ’large value types out of row’ ON
text, ntext, image
Complex column
0
8
Timestamp
12
Page ID
14
16
File ID Slot ID
46
You thought (MAX) was complex?
Textpointer = evil
47
0
8
Blob ID
10
Type
12
Length
16
?
Type = 0
Used when data <= 64 bytes
Min size = 84
Data > length = garbage
84
Data ...
48
0
8
10
Blob ID
20
Type
24
Size
28
Page ID
12
14
MaxLinks CurLinks
30
16
Level
20
?
32
File ID Slot ID
...
Type = 5
Min size = 84
Part of LOB tree
49
Record with <65 bytes of data
SMALL_ROOT
DATA LOB
DATA LOB
Level = 0
Record with <40 KB of data
LARGE_ROOT_YUKON
DATA LOB
DATA LOB
DATA LOB
Record with >40 KB of data
Level = 1
Level = 0
LARGE_ROOT_YUKON
INTERNAL LOB
DATA LOB
DATA LOB
DATA LOB
50
DATA LOB
Record with >~4 MB of data
Level = 2
Level = 1
LARGE_ROOT_YUKON
INTERNAL LOB
Level = 0
DATA LOB
INTERNAL LOB
DATA LOB
INTERNAL LOB
DATA LOB
INTERNAL LOB
DATA LOB
DATA LOB
51
Data
Varchar(X)
Varchar(MAX)
Text
NULL
0
0
0
0-64
0-64
0-64
100 (16 + 84)
65-8000
65-8000
65-8000 * (+ 24 + 14)
100 + 14 + 65-8000
8kb +
N/A
24 + X
100 + X
Extreme impact on small data
The more data, the less of a diff
Performance differences
http://sqlblog.com/blogs/paul_white/archive/2011/02/23/Advanced-TSQL-Tuning-Why-Internals-Knowledge-Matters.aspx
52
Type
Name
0
SMALL_ROOT
1
?
2
INTERNAL
3
DATA
4
?
5
LARGE_ROOT_YUKON
6
?
7
?
8
?
9+
?
53
0
8
Blob ID
10
Type
12
Length
16
?
84
Data ...
54
Type
Name
0
SMALL_ROOT
1
LARGE_ROOT
2
INTERNAL
3
DATA
4
LARGE_ROOT_SHILOH
5
LARGE_ROOT_YUKON
6
SUPER_LARGE_ROOT
7
8
NULL
9+
INVALID
55
< 8000 => (MAX) = (X)
> 8000 => Tree is built
Text/ntext/image horribly inefficient
Lots of legacy details
56
How pages are organized
Defines how data is *physically* stored
Clustered index
Guarantees physical order of data
Row identified by ”clustered key”
Heap
Data stored whereever SQL Server wants to
Row identified by ”RID”
58
Ø 10 20
Ø 4
1 3
4 5
17
23
17 18
23 25 26
59
Relies on IAM pages
Leaf pages not linked
Except...
1 3
·· ·
4 5
·
17 18
23 25 26
60
Extents, pages & objects
All pages allocated as part of an extent
Mixed extents
Uniform extents
First 8 = mixed, rest uniform
Pages 72-79
Pages 64-71
62
Global Allocation Map
1 = Free, 0 = Allocated
Bitmap tracks 63,904 extents, almost 4GB
Present every 511,232 pages
GAM interval
2 / 511232, every 511232 pages
~4GB
~4GB
...
2
...
511232
~4GB
...
1022464
...
63
Shared Global Allocation Map
1 = Mixed & > 0 free pages
0 = Either uniform or mixed w/no free pages
Structure identical to GAM
3 / 511233, every 511232 pages
64
Index Allocation Map
1 = Uniformly allocated to IAM chain / allocation unit
0 = Not owned by IAM chain / AU
No fixed positioning!
Tracks a GAM interval
Structure (almost) identical to GAM
65
Sequence number
Status
Object ID
Index ID
Page count
Start page ID
Start page file ID
0
42
4
14
int
?
int
short
46
16
short
28
32
?
int
34
short
36
short
40
int
42
short
X 8
48
90
Slot 0 file ID
Slot 0 page ID
66
GAM
SGAM
Any IAM
Status
0
0
0
1
0
1
0
0
0
0
1
0
Mixed, all pages allocated
0
1
1
1
0
1
1
1
0
Invalid
Invalid
Invalid
1
1
1
Invalid
Mixed, > 0 free pages
Uniform
Unallocated
67
001
1
0
1
1
0
Page Free Space
Not used
Bytemap
1 / 8088, every 8088 pages
PFS interval
Page allocated?
Mixed page?
IAM page?
Ghost records?
How much free space?
Only tracks fullness where necessary
68
GAM interval
0
1
2
3
4 5
6
7
8
9
...
8088
...
511232 511233
...
SGAM
Boot page
PFS
GAM
ML map
DIFF map
Unused
SGAM
GAM
PFS
File header
69
Heap /
Index
Objects
Partitions
Allocation units
HOBT
SLOB
Partition 1
Partition N
LOB
LOB
SLOB
HOBT
70
Page 9 in primary data file
0-1
Version
2-3
CreateVersion
...
?
DBCC PAGE == DBCC DBINFO
32-35
Status
Lots of interesting info
36-39
NextID
...
?
48-303
Database name
...
?
308-309
Database ID
...
?
312-319
Max DB Timestamp
Name + ID
...
?
512-515
First sys indexes Page ID
FirstSysIndexes
516-517
First sys indexes File ID
518 - 1440
?
Physical version
Log rebuild count
Last OK CHECKDB
Last LOG backup
71
The source of our parsing metadata
Schema
sys.tables + sys.columns / sys.indexes + sys.index_columns
Indexes
Root page
Heaps
IAM chain root
DEMO
73
sys.tables (t)
object_id
1
t.object_id = p.object_id
*
sys.partitions (p)
partition_id == hobt_id
1
p.hobt_id = au.container_Id
1..2
sys.system_internals_allocation_units (au)
(HOBT, SLOB)
container_id = sys.partitions.hobt_id
1
p.partition_id = au.container_Id
0..1
sys.system_internals_allocation_units (au)
(LOB)
container_id = sys.partitions.partition_Id
74
Just views, no physical storage
Chicken or the egg
How about we take a look at those views?
DEMO
75
The basis for DMV data
Can only be queried through the DAC
Here be dragons!
Confusing column names
Utilizes internal functions
76
SELECT * FROM sys.sysschobjs
sysschobjs
sys.tables (t)
object_id
syscolpars
1
t.object_id = p.object_id
sysrowsets
sysallocunits
*
sys.partitions (p)
partition_id == hobt_id
1
p.hobt_id = au.container_Id
1..2
DEMO
sys.system_internals_allocation_units (au)
(HOBT, SLOB)
container_id = sys.partitions.hobt_id
1
p.partition_id = au.container_Id
0..1
sys.system_internals_allocation_units (au)
(LOB)
container_id = sys.partitions.partition_Id
77
Boot page points to sysallocunits (FirstSysIndexes)
Constant partition ID leads us to sysrowsets
Constant object ID leads us to sysschobjs
Using the above we can find syscolpars
DEMO
78
SELECT
CASE c.maxinrowlen
WHEN 0 THEN p.length
ELSE c.maxinrowlen
END AS max_inrow_length,
p.xtype AS system_type_id,
p.length AS max_length,
p.prec AS PRECISION,
p.scale AS scale,
FROM
sys.sysrscols c –- sys.system_internals_partition_columns
OUTER APPLY
OPENROWSET(TABLE RSCPROP, c.ti) p
79
80
CREATE TABLE TITest
(
A binary(50),
B char(10),
C datetime2(5),
D decimal(12, 5),
E float,
F int,
G numeric(11, 4),
H nvarchar(50),
I nvarchar(max),
J time(3),
K tinyint,
L varbinary(max),
M varchar(75),
N text
)
81
SELECT
t.name,
r.ti,
p.scale,
p.precision,
p.max_length,
p.system_type_id,
p.max_inrow_length
FROM
sys.system_internals_partition_columns p
INNER JOIN
sys.sysrscols r ON
r.rscolid = p.partition_column_id AND
r.rsid = p.partition_id
INNER JOIN
sys.types t ON
t.system_type_id = p.system_type_id AND
t.user_type_id = p.system_type_id
WHERE
partition_id = 72057594040614912
82
83
12973
= 0x000032AD
173
= 0xAD == 12973 & 0xFF
50
= 0x32 == (12973 & 0xFFFF00) >> 8
84
1322
= 0x0000052A
42
= 0x2A == 1322 & 0xFF
5
= 0x5 == (1322 & 0xFF00) >> 8
25
= 20 + scale
85
330858
= 0x00050C6A
106
= 0x6A == 330858 && 0xFF
12
= 0x0C == (330858 && 0xFF00) >> 8
5
= 0x05 == (330858 && 0xFF0000) >> 16
86
[Test]
public void Decimal()
{
var parser = new SysrscolTIParser(330858);
Assert.AreEqual(5, parser.Scale);
Assert.AreEqual(12, parser.Precision);
Assert.AreEqual(9, parser.MaxLength);
Assert.AreEqual(106, parser.TypeID);
Assert.AreEqual(9, parser.MaxInrowLength);
parser = new SysrscolTIParser(396138);
Assert.AreEqual(6, parser.Scale);
Assert.AreEqual(11, parser.Precision);
Assert.AreEqual(9, parser.MaxLength);
Assert.AreEqual(106, parser.TypeID);
Assert.AreEqual(9, parser.MaxInrowLength);
}
87
When everything else fails
You should always have backups available
Make sure to test your backups
Run regular consistency checks
This is a last resort measure
89
Torn pages
Corrupt pages
Bad metadata
Accidental deletes & truncations
How does OrcaMDF differ?
90
One page = 16 disk sectors
First and last sector most important
No header
Identify object from IAM, linked list
No slot array
Slot count in header
Identify record formats in body
91
Checksum doesn’t match content
Could be minor issue, probably major
Treat like torn page
92
SQL Server bugs
Corrupt/torn pages
Scan pages and identify object in header
Scan pages and look for IAM chain
Deduce schema
App
Docs
Record format
93
Accidental delete
Records may be ghosted
Records removed from slot array
STOP!
Accidental truncation
Pages deallocated, physically intact
Scan pages, linked list
94
Garbage may be mistaken for data
Was page allocated?
Look for clues in salvagable allocation structures
95
CorruptMdf class
Verifying tornbits / checksum
Utility methods
Scan for pages belonging to object
Scan for IAM pages
Best-effort parsing of pages
96
Now’s the chance
Blog:
Twitter:
Email:
improve.dk
@improvedk
[email protected]