CS186: Introduction to Database Systems
Download
Report
Transcript CS186: Introduction to Database Systems
CS 405G: Introduction to
Database Systems
21 Storage
Chen Qian
University of Kentucky
Review
Disk management
Basic disk (physical) unit: page
Basic DB unit: record (row of a table)
How to connect page with record?
File!
7/18/2015
Chen Qian @ University of Kentucky
2
A row in a table, when located on disks, is called
A record
Two types of record:
Fixed-length
Variable-length
7/18/2015
Chen Qian @ University of Kentucky
3
In an abstract sense, a file is
In reality, a file is
A set of disk pages
Each record lives on
A set of “records” on a disk
A page
Physical Record ID (RID)
A tuple of <page#, slot#>
7/18/2015
Chen Qian @ University of Kentucky
4
Files
Higher levels of DBMS operate on records, and files of
records.
FILE: A collection of pages, containing a collection of
records.
Record = row in a table
Must support:
insert/delete/modify record
fetch a particular record (specified using record id)
scan all records (possibly with some conditions on the records
to be retrieved)
7/18/2015
Chen Qian @ University of Kentucky
5
Record layout
Record = row in a table
Variable-format records
Rare in DBMS—table schema dictates the format
Relevant for semi-structured data such as XML
Focus on fixed-format records
With fixed-length fields only, or
With possible variable-length fields
7/18/2015
Chen Qian @ University of Kentucky
6
Record Formats: Fixed Length
F1
F2
F3
F4
L1
L2
L3
L4
Base address (B)
All field lengths and offsets are constant
Address = B+L1+L2
Computed from schema, stored in the system catalog
Finding i’th field done via arithmetic.
7/18/2015
Chen Qian @ University of Kentucky
7
Fixed-length fields
Example: CREATE
TABLE Student(SID INT, name
CHAR(20), age INT, GPA FLOAT);
0
Watch out for alignment
4
24 28
36
142 Bart (padded with space) 10
2.3
May need to pad; reorder columns if that helps
What about NULL?
Add a bitmap at the beginning of the record
7/18/2015
Chen Qian @ University of Kentucky
8
Record Formats: Variable Length
Two alternative formats (# fields is fixed):
F1
F2
F3
$
F4
$
$
$
Fields Delimited by Special Symbols
F1
F2
F3
F4
Array of Field Offsets
Second offers direct access to i’th field, efficient storage
of nulls; extra but small directory overhead.
7/18/2015
Chen Qian @ University of Kentucky
9
Trade-off
Compared to the 1st formats, the offset based format has:
Pros: Fast access to a particular field; efficient storage of
nulls
Cons: Extra space
7/18/2015
Chen Qian @ University of Kentucky
10
Large Object (LOB) fields
Example: CREATE
Student records get “de-clustered”
TABLE Student(SID INT, name
CHAR(20), age INT, GPA FLOAT, picture BLOB(32000));
Bad because most queries do not involve picture
Decomposition (automatically done by DBMS and
transparent to the user)
Student(SID, name, age, GPA)
StudentPicture(SID, picture)
7/18/2015
Chen Qian @ University of Kentucky
11
Page layout
How do you organize records in a page?
Fixed length records
Variable length records
NSM (N-ary Storage Model) is used in most commercial
DBMS
7/18/2015
Chen Qian @ University of Kentucky
12
Page Formats: Fixed Length Records
Slot 1
Slot 2
Slot 1
Slot 2
Free
Space
...
...
Slot N
Slot N
Slot M
1 . . . 0 1 1M
N
PACKED
number
of records
M ... 3 2 1
UNPACKED, BITMAP
number
of slots
Record id = <page id, slot #>. In first alternative, moving
records for free space management changes rid; may not be
acceptable.
7/18/2015
Chen Qian @ University of Kentucky
13
Trade-off
Compared to the packed formats, the unpacked format
has:
Pros: No need to move records into vacated slots after
delete. No need to change the slot number.
Cons: To insert a record, one needs to find an empty slot,
by scanning the bitmap.
7/18/2015
Chen Qian @ University of Kentucky
14
Variable-Length Records
Store records from the beginning of each page
Use a directory at the end of each page
To locate records and manage free space
Necessary for variable-length records
142 Bart
857 Lisa
10 2.3
123 Milhouse 10 3.1
8 4.3
456 Ralph
8 2.3
Why store data and directory
at two different ends?
Both can grow easily
7/18/2015
Chen Qian @ University of Kentucky
15
Options
Reorganize after every update/delete to avoid
fragmentation (gaps between records)
Need to rewrite half of the block on average
What if records are fixed-length?
Reorganize after delete
Only need to move one record
Need a pointer to the beginning of free space
Do not reorganize after update
7/18/2015
Need a bitmap indicating which slots are in use
Chen Qian @ University of Kentucky
16
System Catalogs
For each relation:
For each index:
structure (e.g., B+ tree) and search key fields
For each view:
name, file location, file structure (e.g., Heap file)
attribute name and type, for each attribute
index name, for each index
integrity constraints
view name and definition
Plus statistics, authorization, buffer pool size, etc.
Catalogs are themselves stored as relations!
7/18/2015
Chen Qian @ University of Kentucky
17
Indexes
A Heap file allows us to retrieve records:
Sometimes, we want to retrieve records by
specifying the values in one or more fields, e.g.,
by specifying the rid, or
by scanning all records sequentially
Find all students in the “CS” department
Find all students with a gpa > 3
Indexes are file structures that enable us to answer
such value-based queries efficiently.
7/18/2015
Chen Qian @ University of Kentucky
18
Basics
Given a value, locate the record(s) with this value
SELECT * FROM R WHERE A = value;
SELECT * FROM R, S WHERE R.A = S.B;
Other search criteria, e.g.
Range search
SELECT * FROM R WHERE A > value;
Keyword search
database indexing
7/18/2015
Chen Qian @ University of Kentucky
Search
19
Dense and sparse indexes
Dense: one index entry for each search key value
Sparse: one index entry for each block
Records must be clustered according to the search key
7/18/2015
Chen Qian @ University of Kentucky
20
Dense versus sparse indexes
Index size
Requirement on records
Records must be clustered for sparse index
Lookup
Sparse index is smaller
Sparse index is smaller and may fit in memory
Dense index can directly tell if a record exists
Update
Easier for sparse index
7/18/2015
Chen Qian @ University of Kentucky
21
Primary and secondary indexes
Primary index
Secondary index
Created for the primary key of a table
Records are usually clustered according to the primary key
Can be sparse
Usually dense
SQL
PRIMARY KEY declaration automatically creates a primary
index, UNIQUE key automatically creates a secondary index
Additional secondary index can be created on non-key
attribute(s)
CREATE INDEX StudentGPAIndex ON
Student(GPA);
7/18/2015
Chen Qian @ University of Kentucky
22
Tree-Structured Indexes: Introduction
Tree-structured indexing techniques support both range
selections and equality selections.
ISAM =Indexed Sequential Access Method
static structure; early index technology.
B+ tree: dynamic, adjusts gracefully under inserts and
deletes.
7/18/2015
Chen Qian @ University of Kentucky
23
Motivation for Index
``Find all students with gpa > 3.0’’
If data file is sorted, do binary search
Cost of binary search in a database can be quite high,
Why?
Simple idea: Create an `index’ file.
index entry
P0
Page 1
K 1
P1
Index File
kN
k1 k2
K 2
Page 2
P 2
Page 3
K m Pm
Page N
Data File
Can do binary search on (smaller) index file!
7/18/2015
Chen Qian @ University of Kentucky
24
ISAM
What if an index is still too big?
Put a another (sparse) index on top of that!
ISAM (Index Sequential Access Method), more or less
Example: look up 197
100, 200, …, 901
Index blocks 100, 123, …, 192 200, …
…
901, …, 996
100, 108,123, 129,
192, 197,200, 202,
901, 907, 996, 997,
…
…
……
119, 121 …
…
…
…
Data blocks
7/18/2015
Chen Qian @ University of Kentucky
25
How many steps?
For a balanced tree of N pages, the num of search steps
is O(logN)
Less than a constant times logN
7/18/2015
Chen Qian @ University of Kentucky
26
Updates with ISAM
Example: insert 107
Example: delete 129
100, 200, …, 901
Index blocks 100, 123, …, 192 200, …
…
901, …, 996
100, 108,123, 129,
192, 197,200, 202,
901, 907, 996, 997,
…
…
……
119, 121 …
…
…
…
Data blocks
107
Overflow block
Overflow chains and empty data blocks degrade
performance
Worst case: most records go into one long chain
7/18/2015
Chen Qian @ University of Kentucky
27
How many steps?
For a chain of N pages, the worst-case num of search
steps is N
Much larger than O(logN)!
7/18/2015
Chen Qian @ University of Kentucky
28
A Note of Caution
ISAM is an old-fashioned idea
B+-trees are usually better, as we’ll see
But, ISAM is a good place to start to understand the
idea of indexing
Upshot
7/18/2015
Don’t brag about being an ISAM expert on your
resume
Do understand how they work, and tradeoffs with B+trees
Chen Qian @ University of Kentucky
29
Summary (Contd.)
DBMS vs. OS File Support
7/18/2015
DBMS needs features not found in many OSes, e.g.,
forcing a page to disk, controlling the order of page
writes to disk, files spanning disks, ability to control
pre-fetching and page replacement policy based on
predictable access patterns, etc.
Variable length record format with field offset
directory offers support for direct access to ith field
and null values.
Chen Qian @ University of Kentucky
30