Why Simple Hash Functions Work

Download Report

Transcript Why Simple Hash Functions Work

Why Simple Hash Functions Work :
Exploiting the Entropy
in a Data Stream
Michael Mitzenmacher
Salil Vadhan
And improvements with Kai-Min Chung
The Question
• Traditional analyses of hashing-based
algorithms & data structures assume a
truly random hash function.
• In practice: simple (e.g. universal) hash
functions perform just as well.
• Why?
Outline
• Three hashing applications
• The new model and results
• Proof ideas
Bloom Filters
To approximately store S = {x1,…,xT}[N]:
• Start with array of M=O(T) zeroes.
• Hash each item k=O(1) times to [M] using
h : [N]  [M]k, put a one in each location.
To test yS:
• Hash & accept if ones in all k locations.
Bloom Filter Analysis
Thm [B70]:  S  yS,
if h is a truly random hash function,
Prh[accept y] = 2-(ln 2)·M/T+o(1).
for an optimal choice of k.
Balanced Allocations
• Hashing T items into T buckets
– What is the maximum number of items, or
load, of any bucket?
– Assume buckets chosen independently &
uniformly at random.
• Well-known result:
(log T / log log T) maximum load w.h.p.
Power of Two Choices
• Suppose each ball can pick two bins
independently and uniformly and choose the
bin with less load.
• Thm [ABKU94]: maximum load
log log n / log 2 + (1) w.h.p.
Linear Probing
• Hash elements into an array of length M.
• If h(x) is already full, try h(x)+1,h(x)+2,…
until empty spot is found, place x there.
• Thm [K63]: Expected insertion time for
T’th item is 1/(1-(T/M)2)+o(1).
Explicit Hash Functions
Can sometimes analyze for explicit
(e.g. universal [CW79]) hash functions, but
• performance somewhat worse, and/or
• hash functions complex/inefficient.
Noted since 1970’s that simple hash functions
match idealized performance on real data.
Simple Hash Functions
Don’t Always Work
•  pairwise independent hash families & inputs s.t.
Linear Probing has (log T) insertion time [PPR07].
•  k-wise independent hash families & inputs s.t.
Bloom Filter error prob. higher than ideal [MV08].
• Open for Balanced Allocations.
• Worst case does not match practice.
Average-Case Analysis?
• Data uniform & independent in [N].
– Not a good model for real data.
– Trivializes hashing.
• Need intermediate model between worstcase and average-case analysis.
Our Model: Block Sources [CG85]
• Data is a finite stream, modeled by a sequence of
random variables X1,X2,…XT [N]
• Each stream element has some k bits of (Renyi)
entropy, conditioned on previous elements:
k
x1 ,..., xi 1 cp( X i | X 1  x1 ,..., X i 1  xi 1 )  2 ,
where cp(X)=xPr[X=x]2.
• Similar spirit to semi-random graphs [BS95],
smoothed analysis [ST01].
An Approach
• H truly random: for all distinct x1,…,xT,
(H(x1),.. H(xT)) uniform in [M]T.
• Goal: if H random universal hash function
and X1,X2,…XT is a block source, then
(H(X1),.. H(XT)) is “close” to uniform.
Randomness extractors!
Classic Extractor Results
[BBR88,ILL89,CG85,Z90]
• Leftover Hash Lemma:
If H : [N]  [M] is a random universal hash function
and X has Renyi entropy at least log M + 2log(1/),
then (H,H(X)) is -close to uniform.
• Thm: If H : [N]  [M] is a random universal hash
function and X1,X2,…XT is a block source with
Renyi entropy at least log M + 2log(T/) per block,
then (H,H(X1),.. H(XT)) is -close to uniform.
Sample Parameters
• Network flows (IP addresses, ports, transport
protocol): N = 2104
• Number of items: T = 216
• Hash range (2 values per item): M = 232.
• Entropy needed per item: 64+2log(1/ ).
• Can we do better?

Improved Bounds I
Thm [CV08]: If H : [N]  [M] is a random
universal hash function and X1,X2,…XT is a
block source with Renyi entropy at least
log M+log T+2log(1/)+O(1) per block, then
(H,H(X1),.. H(XT)) is -close to uniform.
Tight up to additive constant [CV08].
Improved Bounds II
Thm [MV08,CV08]:
If H : [N]  [M] is a random universal hash
function and X1,X2,…XT is a block source with
Renyi entropy at least
log M+log T+log(1/)+O(1) per block, then
(H,H(X1),.. H(XT)) is -close to a distribution with
collision probability O(1/MT).
Tight upto dependence on  [CV08].
Proof Ideas: Upper Bounds
1. Bound average conditional collision probs:
cp(H(Xi)| H,H(X1),.. H(Xi-1))  1/M+1/2k.
2a. Statistical closeness to uniform: inductively bound
“Hellinger distance” from uniform.
2b. Close to small collision prob: by Markov, get
(1/T) ·i cp(H(Xi)| H=h,H(X1)=y1,.. H(Xi-1)=yi-1)
 1/M+1/(2k) w.p. 1-  over h,y1,..,yi-1
Proof Ideas: Lower Bounds
• Lower bound for randomness extractors [RT97]:
if k not large enough, then  X of min-entropy k s.t.
h(X) “far” from uniform for most h.
• Take X1,X2,…XT to be iid copies of X.
• Show that error accumulates, e.g. statistical distance
grows by a factor of (T) [R04,CV08].
Open Problems
• Tightening connection to practice.
– How to estimate relevant entropy of data streams?
– Cryptographic hash functions (MD5,SHA-1)?
– Other data models?
• Block source data model.
– Other uses, implications?