CSE 331. Computer Organization

Download Report

Transcript CSE 331. Computer Organization

ECE 569
Database System Engineering
Fall 2004
Yanyong Zhang www.ece.rutgers.edu/~yyzhang
Course URL www.ece.rutgers.edu/~yyzhang/fall04
ECE569 Lecture 04-2.1
Fall 2004
Associative access

The system is not asked to retrieve tuples based
on information about their storage location; rather,
it has to find all tuples the attribute values of which
fulfill certain conditions – associative access.

Associative access can be realized by sequential
scanning, which happens for complicated queries.

select
R.x, S.y,
from
R,S
where
R.k = S.f and R.b < 12;
But for simple selection predicates, this is very
slow (even for an in-memory database)
ECE569 Lecture 04-2.2
Fall 2004
Access Path

The class of algorithms and data structures
designed for translating attribute values into TID,
or into other types of internal addresses of tuples
having those attribute values, is called access
paths.

Depending on what kind of selection predicate is
to be supported, the techniques for associative
access vary greatly.
ECE569 Lecture 04-2.3
Fall 2004
Content addressability techniques


Primary key access.

A tuple of a relation must be retrieved efficiently via the
value of its primary (unique) key(s). e.g., key-sequenced
files and hased files.

Point query vs. range query
Secondary key access


A set of tuples are produced
Multi-table access

Tuple access is often based on relationships between
different tuples. E.g., all orders placed by a given
customer have to be found.
ECE569 Lecture 04-2.4
Fall 2004
Associative access path techniques

Hashing (key transformation)


Using the primary key value as a parameter to a function,
which returns the storage location of the tuple.
Key comparison

Maintaining a dynamic search structure on the set of
values in the key attribute. These values can be
organized into tables, lists, trees, and so on

e.g, B+ tree
ECE569 Lecture 04-2.5
Fall 2004
Operations on files (heap files)


Assumptions

n = number of records in file

R = number of records that can fit in block
Lookup – Given a key find corresponding record


Insertion – add record to file (allows duplicates)


On average, n / (2R) block accesses.
Read last block; it may need to allocate a new block.
Approximately, requires 2 accesses
Deletion – delete record

look up record n / (2R)

Write back to disk (1 access)

Reorganize (unpinned) – move tuple from last page to utilize
space (2 disk accesses)
ECE569 Lecture 04-2.6
Fall 2004
Hashed Files

File is divided into B buckets

Hash function h maps elements of the key space to
range [0, B)

Key space is large and unevenly distributed
- SSNs as character strings
- Each character takes on at most 10 of the possible 256
values

Hash function h must map key values evenly among a
relatively small number of values.
ECE569 Lecture 04-2.7
Fall 2004
Range of
Potential
Key Values
(the shaded
areas denote
used key
values)
ECE569 Lecture 04-2.8
FOLDING
Range of positive integers
Hash-based associative access
HASHING
tuple
address
space
Fall 2004
Folding

Convert arbitrary data types to a positive integer h can be
applied to.

Reduce number of bits so that arithmetic is efficient.

Example: Key is “Keefe” and 16803

Key value is the concatenation of byte representation of
individual fields

Folded value of key is
0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3

Partition result into words and combine using XOR
0x4b 0x65 0x65 0x66
0x65 0x0
0x0
0x0
0x41 0xa3 0x0
0x0

0x6f 0xc6 0x65 0x66 = 1875273062
ECE569 Lecture 04-2.9
Fall 2004
Hashing

goal of hashing

How to choose hash function if all the key values
are uniformly distributed?

The critical issue is to produce 1:1 mapping

Collision: different inputs are mapped to the same
output.

The criteria of a good hash function is to keep the
collision as small as possible.
ECE569 Lecture 04-2.10
Fall 2004
Static Hashing

Input: folded key values

Output: bytes (relative to the beginning of the file),
blocks ??


Bytes are not good because of the varying tuple size.

A block/page is called a bucket.
H: {0 … 232-1} -> {0, B-1}

Continuous allocation

Fixed size: B pages are allocated at file creation time.

Insert
- Determine the bucket
- Check the bucket ( collision may happen)
ECE569 Lecture 04-2.11
Fall 2004
How to find a good hash function

Division / remainder (Congruential hashing)


H(Kb) = kb mod B where kb is folded key value and B
is the number of buckets.
Nth power

Compute kbN, and from the resulting bit string (n x 31
bits) take log2B bits from the middle.

Base transformation

Polynomial division

Numerical analysis

encryption
ECE569 Lecture 04-2.12
Fall 2004
Performance

Assumption




Perfect hash function (tuples are uniformly distributed
over B buckets)
Lookup

½  n/R  1/B
To finish first match

n/R  1/B
If tuple does not exist
Insertion

n/R  1/B + 1
Test for duplicates

1
Otherwise
Deletion

 ½  n/R  1/B delete first match
ECE569 Lecture 04-2.13
Fall 2004
Collision

Two keys collide if they hash to same value

A bucket with room for R tuples can accommodate
R – 1 collisions before it overflows

Internal resolution: Place overflow blocks in another
bucket
- (h(K) + 1) mod B
linear probing
- (h2(h1(K))
multiple hashing
ECE569 Lecture 04-2.14
Fall 2004
Collision - continued

External resolution: Allocation overflow block, link to
overflow chain
buckets
Overflow
pages
ECE569 Lecture 04-2.15
Fall 2004
Discussion

What are the disadvantages of static hashing?

How do you limit the number of pages accessed
when retrieving a tuple, for both external and
internal resolution?
ECE569 Lecture 04-2.16
Fall 2004
Trie

The buckets will dynamically grow/shrink/balance

Fundamental: trie
A
0
A
0
0
0
0
B
1
ECE569 Lecture 04-2.17
B
C
(b)
(a)
000
001
010
011
100
101
110
111
1
1
C
1
1
D
A
D
B
C
(c)
Fall 2004
Dynamic hashing function

We need a hash function whose range of values
can change dynamically

One such hash function can be constructed using
a series of functions hi(k), i = 0, 1, …., such that for
any k, either hi(k) = hi-1(k) or hi(k) = hi-1(k)+2i-1.

Choose H(k), which maps the key space into
random bit patterns of length m, for m sufficiently
large. Then hi(k) may be defined as the integers
formed by the last i bits of H(k).
ECE569 Lecture 04-2.18
Fall 2004
Extendible Hashing

The number of buckets can grow/shrink.

An intermediate data structure translates the hash
results into page addresses. This data structure
needs to be as compact as possible.

Hashes into an array of pointer to buckets (directory).

The array is small enough to be kept in memory.
ECE569 Lecture 04-2.19
Fall 2004
Directory Growth
• To adapt to dynamically varying size of hash filemodify directory size
• Assume a hash function h(Kb) that produces a bit
string s.
• The directory is of size 2d. d is called the global
depth and is initially 0.
• Use least significant d bits of s to determine
bucket to access
• Each bucket has a corresponding local depth in
the range [0, d] which indicates the difference
between all the records in this bucket
ECE569 Lecture 04-2.20
Fall 2004
Example
Insert
Each
0x13, 0x10, 0x07, 0x00, 0x1f
page can contain no more than 2 tuples
0x13,0x10
0
0
0
0x10,0x00
1
0
1
d=1
d=0
0x10,0x00
1
00
0
01
1
10
d=1
0x13,0x10
0x13,0x07
1
2
11
d=2
0x13,0x07
ECE569 Lecture 04-2.21
2
Fall 2004
Example – insert 0x1f
0x10,0x00
1
000
local degree =
001
global degree –
010
log2(# of arrows pointing to
this bucket)
2
011
100
101
0x13
3
0x07,0x1f
3
110
111
d=3
ECE569 Lecture 04-2.22
Fall 2004
Performance

2 steps for retrieving a tuple

If we can keep the directory in memory, each
retrieval is one page access

Assuming 4 bytes per entry, 4KB pages, 1GB hash
files, and we want to keep the entire directory in
memory, what is the minimum buffer size?
ECE569 Lecture 04-2.23
Fall 2004
Discussion

How easy is it to keep the directory in the
memory?

How do we reduce the structure when the file
shrinks?

How do you make the directory small, and increase
space utilization? (deferred splitting)
ECE569 Lecture 04-2.24
Fall 2004
Linear hashing
(a)
a
b
c
d
d=2
00 01 10 11
w
(b)
a
b
000 01
c
d
10 11
d=2
100
x
(c)
a
b
c
d
000 001 10 11
w
d=2
100 101
y
x
(c)
a
b
c
d
000 001 010 11
d=2
ECE569 Lecture 04-2.25
y
w
100 101 110
(d)
a
b
c
d
w
x
000 001 010 011 100 101 110 111
d=3
Fall 2004
Addressing in linear hashing

Which hash function should be used?

d is the degree, p is the address of the next page to split

The algorithm is as follows:
begin
if (hd(k) >= p) then page:= hd(k)
else
page := hd+1(k)
if necessary, chase the overflow chain
end
ECE569 Lecture 04-2.26
Fall 2004
Discuss

Where to store the overflow?
ECE569 Lecture 04-2.27
Fall 2004