Transcript pps
Slide 1
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 2
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 3
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 4
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 5
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 6
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 7
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 8
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 9
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 10
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 11
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 12
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 13
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically use about 12 bytes
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 14
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
.
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 2
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 3
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 4
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 5
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 6
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 7
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 8
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 9
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 10
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 11
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 12
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 13
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c
Algorithm
Use a pair of variables
For each item s of the stream,
if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}
Return X;
Proof
Problems
if ≤ N/2
If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing
Consider the following TREC collection:
N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms
What kind of data structure we build to support
word-based searches ?
Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
t=500K
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Space is 500Gb !
1 if play contains
word, 0 otherwise
Solution 2: Inverted index
Brutus
2
Calpurnia
1
Caesar
4
2
8
16
32do64
We can
still128
better:
original
3 i.e.
5 3050%
8 13
21 text
34
13 16
1. Typically
2. We have 109 total terms at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
Please !!
Do not underestimate
the features of disks
in algorithmic design
Algoritmi per IR
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
n -1
2
i 1
i
2 -2
n
We need to talk
about stochastic sources
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s ) log
1
2
- log
p(s)
2
p(s)
Lower probability higher information
Entropy is the weighted average of i(s)
H (S )
s S
p ( s ) log
1
2
p(s)
bits
Statistical Coding
How do we use probability p(s) to encode s?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0
1
0 1
a
0
1
b
c
d
Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C )
p ( s ) L[ s ]
s S
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) La(C’)
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj L[si] ≥ L[sj]
Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have
H ( S ) La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that
La (C ) H ( S ) 1
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
gzip, bzip, jpeg (as option), fax compression,…
Properties:
Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2
Otherwise, at most 1 bit more per symbol!!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)
b(.2)
1
c(.2)
1
1
0
(.5)
d(.5)
0
(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees
What about ties (and thus, tree depth) ?
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0
abc... 00000101
101001... dcb
0
0
(.3)
1
a(.1)
(.5)
1
b(.2)
1
d(.5)
c(.2)
A property on tree contraction
Something like substituting symbols x,y with one new symbol x+y
...by induction, optimality follows…
Optimum vs. Huffman
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree
We store for any level L:
= 00.....0
firstcode[L]
Symbol[L,i], for each i in level L
This is ≤ h2+ |S| log |S| bits
Canonical Huffman
Encoding
1
2
3
4
5
Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0
T=...00010...
Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 ) . 00144
If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.
What can we do?
Macro-symbol = block of k symbols
1 extra bit per macro-symbol = 1/k extra-bits per symbol
Larger model to be transmitted
Shannon took infinite sequences, and k
∞ !!
In practice, we have:
Model takes |S|k (k * log |S|) + h2
It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L
(where h might be |S|)
Compress + Search ?
[Moura et al, 98]
Compressed text derived from a word-based Huffman:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
huffman
“or”
tagging
7 bits
g
a
b
1
Codeword
g
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
b
0
Byte-aligned codeword
a
T = “bzip or not bzip”
1
a
0
b
[]
b
b
0
b
space
bzip
1
a 0 b
[bzip]
g
a
b
g
a
or not
CGrep and other ideas...
P= bzip = 1a 0b
a
b
b
space
bzip
GREP
g
a
b
g
a
or not
T = “bzip or not bzip”
yes
1
C(T)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
You find this at
You find it under my Software projects
Algoritmi per IR
Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits
Problem 1
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T
A
B
P
A
B
C
A
B
D
A
B
Naïve solution
For any position i of T, check if T[i,i+m-1]=P[1,m]
Complexity: O(nm) time
(Classical) Optimal solutions based on comparisons
Knuth-Morris-Pratt
Boyer-Moore
Complexity: O(n + m) time
Semi-numerical pattern matching
We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods
The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet
Rabin-Karp Fingerprint
We will use a class of functions from strings to integers in
order to obtain:
An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.
We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers.
Let s be a string of length m
H (s)
m
i 1
2
m -i
s[ i ]
P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)
Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
Arithmetic replaces Comparisons
Strings are also numbers, H: strings → numbers
Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠ H(P)
T=10110101
P=
0101
H(T5) = 5 = H(P)
Match!
Arithmetic replaces Comparisons
We can compute H(Tr) from H(Tr-1)
H (T r ) 2 H (T r -1 ) - 2 T ( r - 1) T ( r n - 1)
m
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 ) H (1011 ) 11
H (T 2 ) H ( 0110 ) 2 11 - 2 1 0 22 - 16 6 H ( 0110 )
4
Arithmetic replaces Comparisons
A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).
Total running time O(n+m)?
NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.
Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.
IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)
An example
P=101111
q=7
H(P) = 47
Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
1 2 (mod 7 ) 0 2
2 2 (mod 7 ) 1 5
5 2 (mod 7 ) 1 4
4 2 (mod 7 ) 1 2
We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
2 2 (mod 7 ) 1 5
5 (mod 7 ) 5 H q ( P )
Intermediate values are also small! (< 2q)
Karp-Rabin Fingerprint
How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
Our goal will be to choose a modulus q such that
q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either
declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)
Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Problem 1: Solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
b
[]
b
0
1
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
The Shift-And method
Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.
i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for
M
n
m
T
P
c
a
l
i
f
o
rj
n
i
a
1
2
3
4
*
5
6
7
*
8
9
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
*
*
f
1
o
2
*0
0
r
3
0
0
* 0 *0
0
0
Oi
0
0
*
* 1 0*
0
1
How does M solve the exact match problem?
How to construct M
We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one
Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:
And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
0
1
1
0
BitShift ( 1 ) 1
0
1
1
0
Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.
How to construct M
We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1
0
U (a ) 1
1
0
0
1
U (b ) 0
0
0
0
0
U (c ) 0
0
1
How to construct M
Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by
M ( j ) BitShift ( M ( j - 1)) & U (T [ j ])
For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]
⇔
⇔ M(i-1,j-1) = 1
the i-th bit of U(T[j]) = 1
BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true
An example j=1
1 2 3 4 5 6 7 8 9 10
n
m
1
12345
1
0
P=abaac
2
0
3
0
4
0
5
0
T=xabxabaaca
0
0
U (x) 0
0
0
2
3
4
5
6
7
1
0
BitShift ( M ( 0 )) & U (T [1]) 0 &
0
0
8
9
0 0
0 0
0 0
0 0
0 0
1
0
An example j=2
1 2 3 4 5 6 7 8 9 10
n
m
1
2
12345
1
0
1
P=abaac
2
0
0
3
0
0
4
0
0
5
0
0
T=xabxabaaca
1
0
U (a ) 1
1
0
3
4
5
6
7
1
0
BitShift ( M (1)) & U (T [ 2 ]) 0 &
0
0
8
9
1 1
0 0
1 0
1 0
0 0
1
0
An example j=3
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
12345
1
0
1
0
P=abaac
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
T=xabxabaaca
0
1
U (b ) 0
0
0
4
5
6
7
1
1
BitShift ( M ( 2 )) & U (T [ 3 ]) 0 &
0
0
8
9
0 0
1 1
0 0
0 0
0 0
1
0
An example j=9
1 2 3 4 5 6 7 8 9 10
n
m
1
2
3
4
5
6
7
8
9
12345
1
0
1
0
0
1
0
1
1
0
P=abaac
2
0
0
1
0
0
1
0
0
0
3
0
0
0
0
0
0
1
0
0
4
0
0
0
0
0
0
0
1
0
5
0
0
0
0
0
0
0
0
1
T=xabxabaaca
0
0
U (c ) 0
0
1
1
1
BitShift ( M ( 8 )) & U (T [ 9 ]) 0 &
0
1
0 0
0 0
0 0
0 0
1 1
1
0
Shift-And method: Complexity
If m<=w, any column and vector U() fit in a
memory word.
Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
Very often in practice. Recall that w=64 bits in
modern architectures.
Some simple extensions
We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1
0
U (a ) 1
1
0
1
1
U (b ) 0
0
0
What about ‘?’, ‘[^…]’ (not).
0
0
U (c ) 0
0
1
Problem 1: An other solution
Dictionary
P = bzip = 1a 0b
a
bzip
not
or
b
b
space
bzip
space
1
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
yes
g
a
[or]
0
1
no
b
[]
b
0
1
no
yes
a 0 b
[bzip]
Speed ≈ Compression ratio
Problem 2
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring
a
bzip
not
or
P=o
b
b
space
bzip
space
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
g
a
b
a
or not
S = “bzip or not bzip” yes
1
g
a
[or]
0
1
b
[]
b
0
1
not= 1 g 0 g 0 a
or = 1 g 0 a 0 b
a 0 b
[bzip]
Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
A
B
P1
C
A
C
A
P2
B
D
D
A
B
A
Naïve solution
Use an (optimal) Exact Matching Algorithm searching each
pattern in P
Complexity: O(nl+m) time, not good with many patterns
Optimal solution due to Aho and Corasick
Complexity: O(n + l + m) time
A simple extention of Shift-And
S is the concatenation of the patterns in P
R is a bitmap of lenght m.
R[i] = 1 iff S[i] is the first symbol of a pattern
Use a variant of Shift-And method searching for
S
For any symbol c, U’(c) = U(c) and R
U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
compute M(j)
then M(j) OR U’(T[j]). Why?
Set to 1 the first bit of each pattern that start with
T[j]
Check if there are occurrences ending in j. How?
Problem 3
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
g
1
[]
g
0
g
[not]
0
a
a
b
g
a
or not
S = “bzip or not bzip”
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
Agrep: Shift-And method with errors
We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa
aatatccacaa
atcgaa
Agrep
Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:
Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.
What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
Computing Ml: case 1
The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M ( j - 1)) U (T [ j ])
l
Computing Ml: case 2
The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1
BitShift ( M
l -1
( j - 1))
Computing Ml
We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff
M ( j)
l
[ BitShift ( M ( j - 1)) U (T ( j ))]
l
BitShift ( M
l -1
( j - 1))
Example M1
1 2 3 4 5 6 7 8 910
M1=
T=xabxabaaca
P=
abaad
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
M0=
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
How much do we pay?
The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.
Problem 3: Solution
Dictionary
Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches
a
bzip
not
or
b
C(S)
1
a 0 b
[bzip]
b
[]
1
1
b
[]
g
0
g
[not]
g
1
yes
0
a
a
g
b
a
or not
S = “bzip or not bzip” yes
1
space
bzip
space
P = bot k=2
b
g
a
[or]
0
1
b
[]
b
0
1
a 0 b
[bzip]
not= 1 g 0 g 0 a
Agrep: more sophisticated operations
The Shift-And method can solve other ops
The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one
Example: d(ananas,banane) = 3
Search by regular expressions
Example: (a|b)?(abc|a)
Algoritmi per IR
Some thoughts
on
some peculiar compressors
Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).
Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px x ≤ 1/px
How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
p i | g ( i ) |
i 1 ,..., S
This is:
i 1 ,.., S
p i [ 2 * log
1
1]
pi
2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
A better encoding
Byte-aligned and tagged Huffman
128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman
End-tagged dense code
The rank r is mapped to r-th binary sequence on 7*k bits
First bit of the last byte is tagged
A better encoding
Surprising changes
It is a prefix-code
Better compression: it uses all 7-bits configurations
(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2
A new concept: Continuers vs Stoppers
The main idea is:
Previously we used: s = c = 128
s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...
An example
5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
Brute-force approach
Binary search:
On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Experiments: (s,c)-DC much interesting…
Search is 6% faster than
byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
Move-to-Front (MTF):
As a freq-sorting approximator
As a caching strategy
As a compressor
Run-Length-Encoding (RLE):
FAX compression
Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded
Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L
There is a memory
Properties:
Exploit temporal locality, and it is dynamic
X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2
No much worse than Huffman
...but it may be far better
MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S
O ( S log S )
nx
g(p - p )
x 1 i 2
x
x
i
i -1
S
By Jensen’s:
O ( S log S )
n
x 1
x
[ 2 * log
N
1]
nx
O ( S log S ) N * [ 2 * H 0 ( X ) 1]
L a [ mtf ] 2 * H 0 ( X ) O (1)
MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:
Search tree
Hash Table
Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves
Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings just numbers and one bit
Properties:
Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn
There is a memory
Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)
Algoritmi per IR
Arithmetic coding
Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.
Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).
e.g.
1.0
c = .3
0.7
i -1
f (i )
p( j)
j 1
b = .5
0.2
0.0
f(a) = .0, f(b) = .2, f(c) = .7
a = .2
The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))
Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0
0.7
c = .3
0.7
c = .3
0.55
0.0
a = .2
0.3
0.2
c = .3
0.27
b = .5
b = .5
0.2
0.3
a = .2
b = .5
0.22
a = .2
0.2
The final sequence interval is [.27,.3)
Arithmetic Coding
To code a sequence of symbols c with probabilities
p[c] use the following:
l 0 0 l i l i -1 s i -1 * f c i
s0 1
s i s i -1 * p c i
f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is
n
sn
p c
i
i 1
The interval for a message sequence will be called the
sequence interval
Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval
Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0
0.7
c = .3
0.7
0.49
b = .5
c = .3
0.55
0.49
0.2
0.0
0.55
b = .5
0.3
a = .2
The message is bbc.
0.2
0.49
0.475
c = .3
b = .5
0.35
a = .2
0.3
a = .2
Representing a real number
Binary fractional representation:
. 75
. 11
1/ 3
. 01 01
11 / 16
. 1011
Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1
So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01
[.33,.66) = .1 [.66,1) = .11
Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in
m ax
interval
.11
.110
.111
[.75 ,1.0 )
.101
.1010
.1011
[.625 , .75 )
We will call this the code interval.
Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).
Sequence Interval
.79
.75
Code Interval (.101)
.61
.625
Can use L + s/2 truncated to 1 + log (1/s) bits
Bound on Arithmetic length
Note that –log s+1 = log (2/s)
Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 + log ∏ (1/pi)
≤2+∑
j=1,n
log (1/pi)
= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits
nH0 + 0.02 n bits in practice
because of rounding
Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:
Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2
Integer Arithmetic is an approximation
Integer Arithmetic (scaling)
If l R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2
All other cases,
just continue...
If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2
If l R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2
You find this at
Arithmetic ToolBox
As a state machine
L+s
c
s’
L’
L
(p1,....,pS )
c
ATB
ATB
(L,s)
(L’,s’)
Therefore, even the distribution can change over time
K-th order models: PPM
Use previous k characters as the context.
Makes use of conditional probabilities
This is the changing distribution
Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.
Need to keep k small so that dictionary does
not get too large (typically less than 8).
PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?
Cannot code 0 probabilities!
The key idea of PPM is to reduce context size if
previous match has not been seen.
If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….
Keep statistics for each context size < k
The escape is a special character with some probability.
Different variants of PPM use different heuristics for
the probability.
PPM + Arithmetic ToolBox
L+s
s s’
L’
L
p[ s|context ]
s = c or esc
ATB
ATB
(L,s)
(L’,s’)
Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)
PPM: Example Contexts
Context
Empty
Counts
A
B
C
$
=
=
=
=
4
2
5
3
Context
A
B
C
Counts
C
$
A
$
A
B
C
$
=
=
=
=
=
=
=
=
3
1
2
1
1
2
2
3
Context
AC
BA
CA
CB
CC
String = ACCBACCACBA B
k=2
Counts
B
C
$
C
$
C
$
A
$
A
B
$
=
=
=
=
=
=
=
=
=
=
=
=
1
2
2
1
1
1
1
2
1
1
1
2
You find this at: compression.ru/ds/
Algoritmi per IR
Dictionary-based algorithms
Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
How the dictionary is stored
How it is extended
How it is indexed
How elements are removed
No explicit
frequency estimation
LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n !!
LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)
<2,3,c>
Algorithm’s step:
Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match
Advance by len + 1
A buffer “window” has fixed length and moves
Example: LZ77 with window
a a c a a c a b c a b a a a c
(0,0,a)
a a c a a c a b c a b a a a c
(1,1,c)
a a c a a c a b c a b a a a c
(3,4,b)
a a c a a c a b c a b a a a c
(3,3,a)
a a c a a c a b c a b a a a c
(1,2,c)
Window size = 6
Longest match
within W
Next character
LZ77 Decoding
Decoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed)
E.g. seen = abcd, next codeword is (2,9,e)
Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
LZ78
Dictionary:
substrings stored in a trie (each has an id).
Coding loop:
find the longest match S in the dictionary
Output its id and the next character c after
the match in the input string
Add the substring Sc to the dictionary
Decoding:
builds the same dictionary and looks at ids
LZ78: Coding Example
Output
Dict.
a a b a a c a b c a b c b
(0,a) 1 = a
a a b a a c a b c a b c b
(1,b) 2 = ab
a a b a a c a b c a b c b
(1,a) 3 = aa
a a b a a c a b c a b c b
(0,c) 4 = c
a a b a a c a b c a b c b
(2,c) 5 = abc
a a b a a c a b c a b c b
(5,b) 6 = abcb
LZ78: Decoding Example
Dict.
Input
(0,a) a
1 = a
(1,b) a a b
2 = ab
(1,a) a a b a a
3 = aa
(0,c) a a b a a c
4 = c
(2,c) a a b a a c a b c
5 = abc
(5,b) a a b a a c a b c a b c b
6 = abcb
LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c
There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!
LZW: Encoding Example
Output
Dict.
a a b a a c a b a b a c b
112 256=aa
a a b a a c a b a b a c b
112 257=ab
a a b a a c a b a b a c b
113
258=ba
a a b a a c a b a b a c b
256
259=aac
a a b a a c a b a b a c b
114
260=ca
a a b a a c a b a b a c b
257
261=aba
a a b a a c a b a b a c b
261
262=abac
a a b a a c a b a b a c b
114
263=cb
LZW: Decoding Example
Input
112
Dict
a
112
a a
256=aa
113
a a b
257=ab
256
a a b a a
258=ba
114
a a b a a c
259=aac
257
a a b a a c a b ?
260=ca
261
261
114
a a b a a c a b a b
261=aba
one
step
later
LZ78 and LZW issues
How do we keep the dictionary small?
Throw the dictionary away when it reaches a
certain size (used in GIF)
Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)
You find this at: www.gzip.org/zlib/
Algoritmi per IR
Burrows-Wheeler Transform
The big (unconscious) step...
The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
(1994)
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
A famous example
Much
longer...
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:
T = .... i ppi #
InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}
How to compute the BWT ?
SA
BWT matrix
12
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
11
8
5
2
1
10
9
7
4
6
3
L
i
p
s
s
m
#
p
i
s
s
i
i
We said that: L[i] precedes F[i] in T
L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]
How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3
i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Elegant but inefficient
Input: T = mississippi#
Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults
Many algorithms, now...
Compressing L seems promising...
Key observation:
L is locally homogeneous
L is highly compressible
Algorithm Bzip :
Move-to-Front coding of L
Run-Length coding
Statistical coder
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii
# at 16
Mtf = [i,m,p,s]
Mtf = 020030000030030200300300000100000
Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code
RLE0 = 03141041403141410210
Alphabet
|S|+1
Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)
You find this in your Linux distribution
Algoritmi per IR
Web-graph Compression
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[ in - degree ( u ) k ]
a 2.1
1
k
a
A Picture of the Web Graph
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
URL compression + Delta encoding
The library WebGraph
Uncompressed
adjacency list
Adjacency list with
compressed gaps
(locality)
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Copy-lists
Reference chains
possibly limited
Uncompressed
adjacency list
D
Adjacency list with
copy lists
(similarity)
Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.
Adjacency list with
copy blocks
(RLE on bit sequences)
The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks
This is a Java and C++ lib
(≈3 bits/edge)
Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.
Consecutivity
3
in extra-nodes
Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source
0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1
Algoritmi per IR
Compression of file collections
Background
data
knowledge
about data
at receiver
sender
receiver
network links are getting faster and faster but
many clients still connected by fairly slow links (mobile?)
people wish to send more and more data
how can we make this transparent to the user?
Two standard techniques
caching: “avoid sending the same object again”
done on the basis of objects
only works if objects completely unchanged
How about objects that are slightly changed?
compression: “remove redundancy in transmitted data”
avoid repeated substrings in data
can be extended to history of past transmissions
How if the sender has never seen data at receiver ?
(overhead)
Types of Techniques
Common knowledge between sender & receiver
Unstructured file: delta compression
“partial” knowledge
Unstructured files: file synchronization
Record-based data: set reconciliation
Formalization
Delta compression
Compress file f deploying file f’
Compress a group of files
Speed-up web access by sending differences between the requested
page and the ones available in cache
File synchronization
[diff, zdelta, REBL,…]
[rsynch, zsync]
Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net
Set reconciliation
Client updates structured old file fold with fnew available on a server
Update of contacts or appointments, intersect IL in P2P search engine
Z-delta compression
(one-to-one)
Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd
Assume that block moves and copies are allowed
Find an optimal covering set of fnew based on fknown
LZ77-scheme provides and efficient, optimal solution
fknown is “previously encoded text”, compress fknownfnew starting from fnew
zdelta is one of the best implementations
Emacs size
Emacs time
uncompr
27Mb
---
gzip
8Mb
35 secs
zdelta
1.5Mb
42 secs
Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link
Client
reference
request
Slow-link
Delta-encoding
Proxy
reference
request
Fast-link
web
Page
Use zdelta to reduce traffic:
Old version available at both proxies
Restricted to pages already visited (30% hits), URL-prefix match
Small cache
Cluster-based delta compression
Problem: We wish to compress a group of files F
Useful on a dynamic collection of web pages, back-ups, …
Apply pairwise zdelta: find for each f F a good reference
Reduction to the Min Branching problem on DAGs
Build a weighted graph GF, nodes=files, weights= zdelta-size
Insert a dummy node connected to all, and weights are gzip-coding
Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.
0
620
2
20
123
2000
20
220
3
1
5
space
time
uncompr
30Mb
---
tgz
20%
linear
THIS
8%
quadratic
Improvement
What about
many-to-one compression?
(group of files)
Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)
We wish to exploit some pruning approach
Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files
Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space
time
uncompr
260Mb
---
tgz
12%
2 mins
THIS
8%
16 mins
Algoritmi per IR
File Synchronization
File synch: The problem
request
f_new
update
Server
f_old
Client
client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux
Delta compression is a sort of local synch
Since the server has both copies of the files
The rsync algorithm
hashes
f_new
Server
encoded file
f_old
Client
The rsync algorithm
(contd)
simple, widely used, single roundtrip
optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals
choice of block size problematic (default: max{700, √n} bytes)
not good in theory: granularity of changes may disrupt use of blocks
Rsync: some experiments
gcc size
total
27288
gzip
7563
zdelta
227
rsync
964
emacs size
27326
8577
1431
4452
Compressed size in KB (slightly outdated numbers)
Factor 3-5 gap between rsync and zdelta !!
A new framework: zsync
Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Next lecture
Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference
between the two sets at one or both of the machines.
Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.
Note:
set reconciliation is “easier” than file sync [it is record-based]
Not perfectly true but...
Recurring minimum for
improving the estimate
+ 2 SBF
A multi-round protocol
k blocks of n/k elems
Log n/k levels
If distance k, then on each level k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits
Algoritmi per IR
Text Indexing
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...
How do we solve Prefix Search?
Trie !!
Array of string pointers !!
What about Substring Search ?
Basic notation and facts
Pattern P occurs at position i of T
iff
P is a prefix of the i-th suffix of T (ie. T[i,N])
i P
T
T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix
P = si
T = mississippi
mississippi
4,7
SUF(T) = Sorted set of suffixes of T
Reduction
From substring search
To prefix search
The Suffix Tree
#
0
s
i
1
ssi
ppi#
4
#
1
p
i
3
2
1
ppi#
6
pi#
7
8
5
T# = mississippi#
2 4 6 8 10
si
ppi#
i#
ppi#
11
mississippi#
12
2
1 10
9
4
3
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
T = mississippi#
suffix pointer
P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
In practice, a total of 5N bytes
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is larger
2 accesses per step
P = si
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
P is smaller
P = si
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]
Locating the occurrences
SA
occ=2
T = mississippi#
12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$
4
7
Suffix Array search
• O (p + log2 N + occ) time
#<
S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree
[Ferragina-Grossi, ’95]
Self-adjusting Suffix Arrays
[Ciriani et al., ’02]
Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Lcp
0
0
1
4
0
0
1
0
2
1
3
SA
12
11
8
5
2
1
10
9
7
4
6
3
T = mississippi#
4 67 9
issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L
Slide 14
Algoritmi per IR
Prologo
What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]
Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …
References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
A bunch of scientific papers available on the course site !!
About this course
It is a mix of algorithms for
data compression
data indexing
data streaming (and sketching)
data searching
data mining
Massive data !!
Paradigm shift...
Web 2.0 is about the many
Big
DATA Big PC ?
We have three types of algorithms:
T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process
within t time units?
n1 = t,
n2 = √t,
n3 = log2 t
What about a k-times faster processor?
...or, what is n, when the time units are k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario
Data are more available than even before
n➜∞
... is more than a theoretical assumption
The RAM model is too simple
Step cost is W(1) time
Not just MIN #steps…
1
CPU
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Gbs
Tens of nanosecs
Some words fetched
Few Tbs
Few millisecs
B = 32K page
Many Tbs
Even secs
Packets
You should be “??-aware programmers”
I/O-conscious Algorithms
track
read/write head
read/write arm
magnetic surface
“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)
Spatial locality vs Temporal locality
The space issue
M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]
If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈
30 * f/(1+f)
Space-conscious Algorithms
I/Os
search
access
Compressed
data structures
Streaming Algorithms
track
read/write head
read/write arm
magnetic surface
Data arrive continuously or we wish FEW scans
Streaming algorithms:
Use few scans
Handle each element fast
Use small space
Cache-Oblivious Algorithms
CPU
L1
L2
RAM
HD
net
registers
Cache
Few Mbs
Some nanosecs
Few words fetched
Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets
Unknown and/or changing devices
Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse
Cache-oblivious algorithms:
Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray
Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K
8K
16K
32K
128K
256K
512K
1M
n3
22s
3m
26m
3.5h
28h
--
--
--
n2
0
0
0
1s
26s
106s
7m
28m
An optimal solution
We assume every subsum≠0
A=
<0
>0
Optimum
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Algorithm
sum=0; max = -1;
For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};
Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT
Toy problem #2 : sorting
How to sort tuples (objects) on disk
Memory containing the tuples
A
Key observation:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j]
MergeSort Q(n log n) random memory accesses (I/Os ??)
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB
n insertions Data get distributed arbitrarily !!!
B-tree internal nodes
B-tree leaves
(“tuple pointers")
What about
listing tuples
in order ?
Tuples
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine
Cost of Mergesort on large data
Take Wikipedia in Italian, compute word freq:
n=109 tuples few Gbs
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of mergesort on disk:
It is an indirect sort: Q(n log2 n) random I/Os
[5ms] * n log2 n ≈ 1.5 years
In practice, it is faster because of caching...
2 passes (R/W)
Merge-Sort Recursion Tree
log2 N
If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1
2
3
4
5
6
1
2
5
7
9
10
1
2
5
10
2 10
10
2
7
8
13
9
19
10
3
4
11
12
13
15
17
19
6
8
11
12
15
17
11
12
17
How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6
1
5
13 19
5
1
13 19
7 9
9
7
4 15
15
4
M
3
8
12 17
8
3
12 17
6 11
6
11
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort
The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X M/B runs logM/B N/M passes
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT X
Disk
Disk
Main memory buffers of B items
Multiway Merging
Bf1
p1
Bf2
Fetch, if pi = B
p2
min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])
Bfo
po
Bfx
pX
Current
page
Run 1
Current
page
Flush, if
Bfo full
Current
page
Run 2
Run X=M/B
Out File:
Merged run
EOF
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Optimal cost = Q((N/B) logM/B N/M) I/Os
In practice
M/B ≈ 1000 #passes = logM/B N/M 1
One multiway merge 2 passes = few mins
Tuning depends
on disk features
Large fan-out (M/B) decreases #passes
Compression would decrease the cost of a pass!
Does compression may help?
Goal: enlarge M and reduce N
#passes = O(logM/B N/M)
Cost of a pass = O(N/B)
Part of Vitter’s paper…
In order to address issues related to:
Disk Striping: sorting easily on D disks
Distribution sort: top-down sorting
Lower Bounds: how much we can go
Toy problem #3: Top-freq elements
Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)
A=b a c c c d c b a a a c c b c c c